You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/1-Introduction/04-stats-and-probability/notebook.ipynb

1123 lines
155 KiB

{
"cells": [
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"# Introduction to Probability and Statistics\n",
"In this notebook, we will play around with some of the concepts we have previously discussed. Many concepts from probability and statistics are well-represented in major libraries for data processing in Python, such as `numpy` and `pandas`."
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 1,
3 years ago
"metadata": {},
"outputs": [],
"source": [
3 years ago
"import numpy as np\n",
"import pandas as pd\n",
"import random\n",
"import matplotlib.pyplot as plt"
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Random Variables and Distributions\n",
"Let's start with drawing a sample of 30 values from a uniform distribution from 0 to 9. We will also compute mean and variance."
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 2,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Sample: [4, 8, 5, 10, 5, 1, 1, 1, 7, 9, 7, 0, 2, 7, 3, 5, 9, 8, 3, 10, 2, 9, 2, 9, 9, 8, 1, 8, 7, 3]\n",
"Mean = 5.433333333333334\n",
"Variance = 10.178888888888887\n"
]
}
],
3 years ago
"source": [
"sample = [ random.randint(0,10) for _ in range(30) ]\n",
"print(f\"Sample: {sample}\")\n",
"print(f\"Mean = {np.mean(sample)}\")\n",
"print(f\"Variance = {np.var(sample)}\")"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"To visually estimate how many different values are there in the sample, we can plot the **histogram**:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 3,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAL4UlEQVR4nO3db4xlBXnH8e/PXYiCGNpyayzLdDQ1tMZEIROqJSEt2AaKAV+0CSQaa0zmjbXQmJi1b5q+o0lj9IUx2SBKIsVYhNRASzUqMSbttrtAW2AhtXQrq+gOMRawSSn26Yu5C+ty1znL3nPvw8z3k0zm/jmc+xxm9svZc8/hpqqQJPX1qmUPIEn62Qy1JDVnqCWpOUMtSc0ZaklqbvcYKz3vvPNqdXV1jFVL0rZ08ODBp6pqMuu5UUK9urrKgQMHxli1JG1LSf7zZM956EOSmjPUktScoZak5gy1JDVnqCWpOUMtSc1tGeokFyZ58Livp5PcuIDZJEkMOI+6qh4D3g6QZBfwXeCucceSJB1zqoc+rgD+vapOemK2JGm+TvXKxOuA22c9kWQdWAdYWVk5zbEk6eVb3XvPUl738E1Xj7LewXvUSc4ErgH+atbzVbWvqtaqam0ymXm5uiTpZTiVQx9XAfdX1Q/GGkaS9FKnEurrOclhD0nSeAaFOslZwG8Dd447jiTpRIPeTKyq/wZ+YeRZJEkzeGWiJDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJak5Qy1JzRlqSWrOUEtSc4Zakpoz1JLUnKGWpOYMtSQ1Z6glqTlDLUnNGWpJam7op5Cfm+SOJI8mOZTknWMPJknaNOhTyIFPAvdW1e8lORM4a8SZJEnH2TLUSV4HXAb8AUBVPQc8N+5YkqRjhhz6eBOwAXw2yQNJbk5y9okLJVlPciDJgY2NjbkPKkk71ZBQ7wYuBj5dVRcBPwb2nrhQVe2rqrWqWptMJnMeU5J2riGhPgIcqar90/t3sBluSdICbBnqqvo+8ESSC6cPXQE8MupUkqQXDD3r48PAbdMzPh4HPjDeSJKk4w0KdVU9CKyNO4okaRavTJSk5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJak5Qy1JzRlqSWrOUEtSc4Zakpoz1JLUnKGWpOYMtSQ1Z6glqTlDLUnNGWpJam7Qp5AnOQw8A/wEeL6q/ERySVqQQaGe+q2qemq0SSRJM3noQ5KaGxrqAr6S5GCS9VkLJFlPciDJgY2NjflNKEk73NBQX1pVFwNXAR9KctmJC1TVvqpaq6q1yWQy1yElaScbFOqq+t70+1HgLuCSMYeSJL1oy1AnOTvJOcduA78DPDT2YJKkTUPO+ng9cFeSY8v/ZVXdO+pUkqQXbBnqqnoceNsCZpEkzeDpeZLUnKGWpOYMtSQ1Z6glqTlDLUnNGWpJas5QS1JzhlqSmjPUktScoZak5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJam5waFOsivJA0nuHnMgSdJPO5U96huAQ2MNIkmabVCok+wBrgZuHnccSdKJdg9c7hPAR4FzTrZAknVgHWBlZeW0B1u01b33LO21D9909dJeW9vfMn+3NR9b7lEneTdwtKoO/qzlqmpfVa1V1dpkMpnbgJK00w059HEpcE2Sw8AXgMuTfH7UqSRJL9gy1FX1saraU1WrwHXA16vqvaNPJkkCPI9aktob+mYiAFV1H3DfKJNIkmZyj1qSmjPUktScoZak5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJak5Qy1JzRlqSWrOUEtSc4Zakpoz1JLUnKGWpOYMtSQ1Z6glqbktQ53k1Un+Mck/J3k4yZ8tYjBJ0qbdA5b5H+Dyqno2yRnAt5L8bVX9w8izSZIYEOqqKuDZ6d0zpl815lCSpBcN2aMmyS7gIPArwKeqav+MZdaBdYCVlZV5zrjtre69Z9kjLNzhm65eyusu69/1srZX28OgNxOr6idV9XZgD3BJkrfOWGZfVa1V1dpkMpnzmJK0c53SWR9V9SPgPuDKMYaRJL3UkLM+JknOnd5+DfAu4NGR55IkTQ05Rv0G4NbpcepXAV+sqrvHHUuSdMyQsz7+BbhoAbNIkmbwykRJas5QS1JzhlqSmjPUktScoZak5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJak5Qy1JzRlqSWrOUEtSc4Zakpoz1JLU3JahTnJBkm8kOZTk4SQ3LGIwSdKmLT+FHHge+EhV3Z/kHOBgkq9W1SMjzyZJYsAedVU9WVX3T28/AxwCzh97MEnSplM6Rp1kFbgI2D/KNJKklxgc6iSvBb4E3FhVT894fj3JgSQHNjY25jmjJO1og0Kd5Aw2I31bVd05a5mq2ldVa1W1NplM5jmjJO1oQ876CPAZ4FBVfXz8kSRJxxuyR30p8D7g8iQPTr9+d+S5JElTW56eV1XfArKAWSRJM3hloiQ1Z6glqTlDLUnNGWpJas5QS1JzhlqSmjPUktScoZak5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaM9SS1JyhlqTmDLUkNWeoJak5Qy1JzRlqSWrOUEtSc1uGOsktSY4meWgRA0mSftqQPerPAVeOPIck6SS2DHVVfRP44QJmkSTNsHteK0qyDqwDrKysvOz1rO69Z14jqTF/ztJwc3szsar2VdVaVa1NJpN5rVaSdjzP+pCk5gy1JDU35PS824G/By5MciTJB8cfS5J0zJZvJlbV9YsYRJI0m4c+JKk5Qy1JzRlqSWrOUEtSc4Zakpoz1JLUnKGWpOYMtSQ1Z6glqTlDLUnNGWpJas5QS1JzhlqSmjPUktScoZak5gy1JDVnqCWpOUMtSc0ZaklqzlBLUnOGWpKaGxTqJFcmeSzJt5PsHXsoSdKLtgx1kl3Ap4CrgLcA1yd5y9iDSZI2DdmjvgT4dlU9XlXPAV8Arh13LEnSMbsHLHM+8MRx948Av37iQknWgfXp3WeTPPYyZzoPeOpl/rOvVG7zNpc/31nbO7Xjtvk0f86/fLInhoQ6Mx6rlzxQtQ/YdwpDzX6x5EBVrZ3uel5J3Obtb6dtL7jN8zTk0McR4ILj7u8BvjfvQSRJsw0J9T8Bb07yxiRnAtcBXx53LEnSMVse+qiq55P8IfB3wC7glqp6eMSZTvvwySuQ27z97bTtBbd5blL1ksPNkqRGvDJRkpoz1JLUXJtQ77TL1JNckOQbSQ4leTjJDcueaVGS7EryQJK7lz3LIiQ5N8kdSR6d/rzfueyZxpbkj6e/1w8luT3Jq5c907wluSXJ0SQPHffYzyf5apJ/m37/uXm8VotQ79DL1J8HPlJVvwa8A/jQDtjmY24ADi17iAX6JHBvVf0q8Da2+bYnOR/4I2Ctqt7K5kkI1y13qlF8DrjyhMf2Al+rqjcDX5veP20tQs0OvEy9qp6sqvunt59h8w/v+cudanxJ9gBXAzcve5ZFSPI64DLgMwBV9VxV/WipQy3GbuA1SXYDZ7ENr72oqm8CPzzh4WuBW6e3bwXeM4/X6hLqWZepb/toHZNkFbgI2L/kURbhE8BHgf9b8hyL8iZgA/js9HDPzUnOXvZQY6qq7wJ/AXwHeBL4r6r6ynKnWpjXV9WTsLkzBvziPFbaJdSDLlPfjpK8FvgScGNVPb3secaU5N3A0ao6uOxZFmg3cDHw6aq6CPgxc/rrcFfT47LXAm8Efgk4O8l7lzvVK1uXUO/Iy9STnMFmpG+rqjuXPc8CXApck+Qwm4e3Lk/y+eWONLojwJGqOva3pTvYDPd29i7gP6pqo6r+F7gT+I0lz7QoP0jyBoDp96PzWGmXUO+4y9SThM3jloeq6uPLnmcRqupjVbWnqlbZ/Bl/vaq29Z5WVX0feCLJhdOHrgAeWeJIi/Ad4B1Jzpr+nl/BNn8D9ThfBt4/vf1+4K/nsdIh//e80S3hMvUOLgX
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.hist(sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Analyzing Real Data\n",
"\n",
"Mean and variance are very important when analyzing real-world data. Let's load the data about baseball players from [SOCR MLB Height/Weight Data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 4,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Team</th>\n",
" <th>Role</th>\n",
" <th>Height</th>\n",
" <th>Weight</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adam_Donachie</td>\n",
" <td>BAL</td>\n",
" <td>Catcher</td>\n",
" <td>74</td>\n",
" <td>180.0</td>\n",
" <td>22.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul_Bako</td>\n",
" <td>BAL</td>\n",
" <td>Catcher</td>\n",
" <td>74</td>\n",
" <td>215.0</td>\n",
" <td>34.69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Ramon_Hernandez</td>\n",
" <td>BAL</td>\n",
" <td>Catcher</td>\n",
" <td>72</td>\n",
" <td>210.0</td>\n",
" <td>30.78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Kevin_Millar</td>\n",
" <td>BAL</td>\n",
" <td>First_Baseman</td>\n",
" <td>72</td>\n",
" <td>210.0</td>\n",
" <td>35.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Chris_Gomez</td>\n",
" <td>BAL</td>\n",
" <td>First_Baseman</td>\n",
" <td>73</td>\n",
" <td>188.0</td>\n",
" <td>35.71</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1029</th>\n",
" <td>Brad_Thompson</td>\n",
" <td>STL</td>\n",
" <td>Relief_Pitcher</td>\n",
" <td>73</td>\n",
" <td>190.0</td>\n",
" <td>25.08</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1030</th>\n",
" <td>Tyler_Johnson</td>\n",
" <td>STL</td>\n",
" <td>Relief_Pitcher</td>\n",
" <td>74</td>\n",
" <td>180.0</td>\n",
" <td>25.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1031</th>\n",
" <td>Chris_Narveson</td>\n",
" <td>STL</td>\n",
" <td>Relief_Pitcher</td>\n",
" <td>75</td>\n",
" <td>205.0</td>\n",
" <td>25.19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1032</th>\n",
" <td>Randy_Keisler</td>\n",
" <td>STL</td>\n",
" <td>Relief_Pitcher</td>\n",
" <td>75</td>\n",
" <td>190.0</td>\n",
" <td>31.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1033</th>\n",
" <td>Josh_Kinney</td>\n",
" <td>STL</td>\n",
" <td>Relief_Pitcher</td>\n",
" <td>73</td>\n",
" <td>195.0</td>\n",
" <td>27.92</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1034 rows × 6 columns</p>\n",
"</div>"
3 years ago
],
"text/plain": [
" Name Team Role Height Weight Age\n",
"0 Adam_Donachie BAL Catcher 74 180.0 22.99\n",
"1 Paul_Bako BAL Catcher 74 215.0 34.69\n",
"2 Ramon_Hernandez BAL Catcher 72 210.0 30.78\n",
"3 Kevin_Millar BAL First_Baseman 72 210.0 35.43\n",
"4 Chris_Gomez BAL First_Baseman 73 188.0 35.71\n",
"... ... ... ... ... ... ...\n",
"1029 Brad_Thompson STL Relief_Pitcher 73 190.0 25.08\n",
"1030 Tyler_Johnson STL Relief_Pitcher 74 180.0 25.73\n",
"1031 Chris_Narveson STL Relief_Pitcher 75 205.0 25.19\n",
"1032 Randy_Keisler STL Relief_Pitcher 75 190.0 31.01\n",
"1033 Josh_Kinney STL Relief_Pitcher 73 195.0 27.92\n",
"\n",
"[1034 rows x 6 columns]"
]
},
"execution_count": 4,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"df = pd.read_csv(\"../../data/SOCR_MLB.tsv\",sep='\\t', header=None, names=['Name','Team','Role','Height','Weight','Age'])\n",
3 years ago
"df"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"> We are using a package called [**Pandas**](https://pandas.pydata.org/) here for data analysis. We will talk more about Pandas and working with data in Python later in this course.\n",
3 years ago
"\n",
"Let's compute average values for age, height and weight:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 5,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Age 28.736712\n",
"Height 73.697292\n",
"Weight 201.689255\n",
"dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"df[['Age','Height','Weight']].mean()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Now let's focus on height, and compute standard deviation and variance: "
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 6,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"[74, 74, 72, 72, 73, 69, 69, 71, 76, 71, 73, 73, 74, 74, 69, 70, 72, 73, 75, 78]\n"
]
}
],
3 years ago
"source": [
"print(list(df['Height'])[:20])"
]
},
{
"cell_type": "code",
"execution_count": 7,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Mean = 73.6972920696325\n",
"Variance = 5.316798081118074\n",
"Standard Deviation = 2.3058183105175645\n"
]
}
],
3 years ago
"source": [
"mean = df['Height'].mean()\n",
"var = df['Height'].var()\n",
"std = df['Height'].std()\n",
"print(f\"Mean = {mean}\\nVariance = {var}\\nStandard Deviation = {std}\")"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"In addition to mean, it makes sense to look at the median value and quartiles. They can be visualized using a **box plot**:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 8,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAACICAYAAAD6bB0zAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAATqUlEQVR4nO3dbWxW533H8d8/CYaV5cEJzcJmmNehhhSiZCXZMmcP1bIX3Rale9Fpi7aqzTImtslSK3Whq6U+vCjq1iXVxIuhpe0aVZOlNDIMWauVRSaIBZXxUCfQASpsEKCMAGEucopN5WsvfENunNsP55f4XOfE3490y8kdsP7+5hyfy5fvh0gpCQAAAMCE63IPAAAAAFQJC2QAAACgCQtkAAAAoAkLZAAAAKAJC2QAAACgyQ1z8UmXLFmSOjs75+JTAwAAAO+IvXv3nkspvXfy/XOyQO7s7NSePXvm4lPX2vnz53XbbbflHqNWaOahm4duHrp56Oahm4durUXE8Vb38xCLEu3fvz/3CLVDMw/dPHTz0M1DNw/dPHQrJubijULuu+++xA7yW42NjamtrS33GLVCMw/dPHTz0M1DNw/dPHRrLSL2ppTum3w/O8glev7553OPUDs089DNQzcP3Tx089DNQ7di2EEGAADAvMQOcgX09fXlHqF2aOahm4duHrp56Oahm4duxbCDDAAAgHmJHeQK4Ke34mjmoZuHbh66eejmoZuHbsWwgwwAAIB5iR3kChgYGMg9Qu3QzEM3D908dPPQzUM3D92KYQe5RCMjI1q8eHHuMWqFZh66eejmoZuHbh66eejWGjvIFTA0NJR7hNqhmYduHrp56Oahm4duHroVwwK5RCtWrMg9Qu3QzEM3D908dPPQzUM3D92KYYFcotOnT+ceoXZo5qGbh24eunno5qGbh27FsEAu0Y033ph7hNqhmYduHrp56Oahm4duHroVwwIZAAAAaMICuUQXL17MPULt0MxDNw/dPHTz0M1DNw/dimGBXKKlS5fmHqF2aOahm4duHrp56Oahm4duxbBALtGRI0dyj1A7NPPQzUM3D908dPPQzUO3YnijkBLxIt3F0cxDNw/dPHTz0M1DNw/dWuONQipgx44duUeoHZp56Oahm4duHrp56OahWzHsIAMAAGBeYge5Avr6+nKPUDs089DNQzcP3Tx089DNQ7di2EEGAADAvMQOcgXw01txNPPQzUM3D908dPPQzUO3YthBBgAAwLzEDnIF9Pf35x6hdmjmoZuHbh66eejmoZuHbsWwg1yisbExtbW15R6jVmjmoZuHbh66eejmoZuHbq2xg1wBO3fuzD1C7dDMQzcP3Tx089DNQzcP3YphgVyiu+++O/cItUMzD908dPPQzUM3D908dCuGBXKJjh07lnuE2qGZh24eunno5qGbh24euhXDArlES5YsyT1C7dDMQzcP3Tx089DNQzcP3YphgVyiS5cu5R6hdmjmoZuHbh66eejmoZuHbsWwQC7R5cuXc49QOzTz0M1DNw/dPHTz0M1Dt2JYIJeovb099wi1QzMP3Tx089DNQzcP3Tx0K4YFcolOnjyZe4TaoZmHbh66eejmoZuHbh66FcMCuUQrV67MPULt0MxDNw/dPHTz0M1DNw/dimGBXKLdu3fnHqF2aOahm4duHrp56Oahm4duxfBW0yUaHx/XddfxM0kRNPPQzUM3D908dPPQzUO31nir6QrYunVr7hFqh2Yeunno5qGbh24eunnoVgw7yAAAAJiX2EGugM2bN+ceoXZo5qGbh24eunno5qGbh27FsIMMAACAeYkd5ArYsmVL7hFqh2Yeunno5qGbh24eunnoVgw7yCXiGaTF0cxz66236sKFC7nHqJ30+ZsUX/xR7jFaam9v1+uvv557jJY4Tz1089DNQ7fW2EGugMHBwdwj1A7NPBcuXFBKiVvBm6TsM0x1q/IPPJynHrp56OahWzEskEt0//335x6hdmgGVB/nqYduHrp56FYMC+QSHTp0KPcItUMzoPo4Tz1089DNQ7diWCCX6IEHHsg9Qu10dHTkHgHADDhPPVXuFhG5R5hSlbtVGd2KmXGBHBHfiIjXIuJAGQO5uru7tWjRIkWEFi1apO7u7twj4R1Q5cddotrOvnFWnxj4hM79+FzuUd71OE89dCtm+fLligh1dHQoIrR8+fLcI11V5TXIldk6OjoqNVtvb69Wr16t66+/XqtXr1Zvb2/uka4xmx3kb0r68BzP8bZ0d3dr06ZN2rBhg0ZGRrRhwwZt2rSpMgcBfAsWLMg9Ampq0yubtO/MPm16eVPuUd71OE89dJu95cuX68SJE+rq6tL27dvV1dWlEydOVGKRXOU1SPNs+/btq8xsvb296unp0caNG3Xp0iVt3LhRPT091Vokz/KZ3Z2SDsz22dZr1qxJZVq4cGF68sknr7nvySefTAsXLix1jplM5EYRx44dyz1CLc33Y+21kdfSmm+tSau/uTqt+daadPaNs7P7i5+/aW4Hexuq/P+U89RT5W5VO94kpa6urpTSm926uroqMWeV1yDNs13pVoXZVq1alQYHB6+5b3BwMK1atar0WSTtSS3Wsu/YY5Aj4s8jYk9E7Dl16pSOHz+uw4cP68CBAzp16pR27dql4eFhvfDCCxofH7/6gtVX3vpwy5YtGh8f1wsvvKDh4WHt2rVLp06d0oEDB3T48GEdP35ce/fu1fnz5/Xiiy9qbGxM/f39kqTR0VGtW7dOfX19kqSBgQF97GMf0+joqM6cOaOhoSEdPXpUR48e1dDQkM6cOaOXXnpJIyMjGhgYkKSrf/fKx/7+fo2NjenFF1/U+fPntXfv3rf9NTU6cStw6+zszD5DHW+S7PNp8rkwMDCgkZERvfTSS5U6n6b7mj73nc9pPI1PdEjjemLzE7P6miRV9muq8vcPztN3XzdJlfoeIUmf/exnNTw8rB07dmh8fFyPPfbYO7aOeDtf0+Q1SF9fn9atW6fR0dFSv0e0+ppGR0d1xx13SJK2b9+ukZER3XPPPRodHc36vfzgwYMaHR295mu65ZZbdPDgwdKvT1NqtWqefBM7yO8IVeAn3bo5d+5c7hFqaT4fa827x1dus95FZgfZwnnqqXK3qh1vatpBvtKNHeSZNc92pVsVZptXO8g5rV27VuvXr9dTTz2lN954Q0899ZTWr1+vtWvX5h4Nb9P+/ftzj4Ca2fTKpqu7x1eMp3EeizyHOE89dJu9ZcuWaefOnXrwwQe1bds2Pfjgg9q5c6eWLVuWe7RKr0GaZ9u9e3dlZuvp6dHjjz+ubdu26fLly9q2bZsef/xx9fT0ZJ2r2azeajoiOiX1p5RWz+aT5nir6e7ubj399NMaHR3VwoULtXbtWm3cuLHUGWYSEZpNb7xpbGxMbW1tuceonfl8rH1060d1+MLht9x/Z/udeu6R56b/y1+4WfrC8BxN9vZU+f8p56mnyt2qeLxdeaLeFcuWLdOrr76acaI3VXkNUtXZent79aUvfUkHDx7UXXfdpZ6eHj366KOlzxFTvNX0jAvkiOiV9CFJSySdkfT5lNLXp/s7ORbIdVDFbzhV19/fr4cffjj3GLXDsWZigWzhPPVUuRvH27sP3VqzF8gOFshAXlW+uFUaC2QAmFemWiC/Kx6DXBczPmMSb0EzoPo4Tz1089DNQ7di2EEG3oXYbTSxgwwA8wo7yBXAT2/F0cyX+3VU63ircrf29vbMR9TUOE89dPPQzUO3YthBBgAAwLzEDnIFXHkXF8wezTx089DNQzcP3Tx089CtGHaQSzQyMqLFixfnHqNWaOahm4duHrp56Oahm4durbGDXAFDQ0O5R6gdmnno5qGbh24eunno5qFbMSyQS7RixYrcI9QOzTx089DNQzcP3Tx089CtGBbIJTp9+nTuEWqHZh66eejmoZuHbh66eehWDAvkEt144425R6gdmnno5qGbh24eunno5qFbMSyQAQAAgCYskEt08eLF3CPUDs08dPPQzUM3D908dPPQrRgWyCVaunRp7hFqh2Yeunno5qGbh24eunnoVgwL5BIdOXIk9wi
"text/plain": [
"<Figure size 720x144 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,2))\n",
"plt.boxplot(df['Height'], vert=False, showmeans=True)\n",
"plt.grid(color='gray', linestyle='dotted')\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"We can also make box plots of subsets of our dataset, for example, grouped by player role."
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 9,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAI4CAYAAAB3OR9vAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAABJy0lEQVR4nO3de5ycZX3//9eHBBIgHBQxCioRD7ghCprgMWrWeKIWtNZW9qsWdQvF+lvFtDbI+hWs3dZ4oNXUiuCqqHVREFSknIRdNCryBeQc0crBEyp4AAICJn5+f9z3kjubze4szO49s/N6Ph77yM49M/d85srcO++55rqvKzITSZIkSYXt6i5AkiRJaiUGZEmSJKnCgCxJkiRVGJAlSZKkCgOyJEmSVGFAliRJkioMyJJmvYjIiHhi3XXUKSJWRMTPJri+9jaKiDdGxLo6a5AkMCBLmkERcXNE/CEiNkTE7yLi7Ih4bN11jTKgPXRlG24q/4/vjIirIuLP665LkqbCgCxpph2SmQuARwO/AtbWXM+0iYi5dddQk++W/8e7A/8FnBoRu9dakSRNgQFZUi0y817gdGDx6LaI2C0iPhsRt0XELRHx7ojYLiIeHhE/i4hDytstiIj/jYi/KS9/JiJOjIgLIuKuiLg4IvYZ73EneIwu4ETgOWXv5++3cf/HR8Q3y8f5RkR8LCI+X163qByq0BsRPwEuKvf97vKxfl0+9m7l7bca9lD2sr+4/P34iDg9Ir5YPt4VEXFA5bZ7RcSXy+dyU0S8rXLdjmW7/C4irgcOauC/5c8i4saIuD0iPljWPi8ifhsRT63s+5HlNwF7TrSzzPwT8DlgZ+BJE7X/Ntr6KeX/6W8j4oaI+OsGnoMkPWQGZEm1iIidgNcCl1Q2rwV2A/YFXgj8DfCmzPwt8Gbg5Ih4JPDvwJWZ+dnKfV8HvA94BHAl8N/beOhtPcZ64CjK3s/M3H0b9/8CcCmwB3A88IZxbvNCoAt4GfDG8qe7fMwFwH9uY9/jeSVwGvDw8rG/EhHbl6HyLOAqYG9gJXB0RLysvN9xwBPKn5cBhzfwWH8BLAOeUT7umzPzPuBU4PWV2/UA38jM2ybaWUTMAd4E/BG4pdw8bvuPc9+dgQvK5/zI8jH/KyL2b+B5SNJDk5n++OOPPzPyA9wMbAB+D2wEfgE8tbxuDnAfsLhy+78DRiqX1wLXlPfbo7L9M8CplcsLgE3AY8vLCTxxssegCLLrJqj/cWXdO1W2fR74fPn7ovKx9q1cfyHw95XL+1EExrnACuBn47TRi8vfjwcuqVy3HXAr8HzgWcBPxtz3XcCny99vBF5eue7IsY815r455vZ/D1xY/v4s4KfAduXly4C/3sZ+3li20e/L5/mH0dtOpf0pPjx9a8y+PwEcV/fr2B9//Jn9P/YgS5ppr8qid3Ye8P8BF0fEoyh6fndgc08j5e97Vy6fBCyhCIG/GbPfn47+kpkbgN8Ce425TSOPMZG9gN9m5j3jPe42tu01zuPNBRY2+JjV5/Un4GflPvcB9oqI34/+AMdW9rvXmDqqNUz6WOXt9yof93vA3cALI+IpFB82vjbBfi4p/48fVt7u+eX2qbT/PsCzxjy/1wGPauB5SNJDYkCWVIvM3JSZZ1D09C4HbqfocayOHX4c8HN44Ov6TwCfBd4yzpRkD8yGERELKIYk/GLMbSZ8DIpe1IncCjy8HB6y1eNWn17l91+M83gbKU5QvBt4YF/lcxw7rrf6vLYDHlPu86fATZm5e+Vnl8z8s0qt1doeN8lzG/tcHseW7XcKxTCLNwCnZzGGfELlB5W/B94QEU9n8vav+ilw8ZjntyAz39LA85Ckh8SALKkWUXglRS/j+szcBHwJGIiIXcqT7FZRDGGAoncUirHIHwI+WwbKUX8WEcsjYgeKscjfy8wtencbeIxfAY8p97GVzLyFYnjB8RGxQ0Q8Bzhkkqc6BLyjPLlvAfCvwBczcyPwQ2B+RLwiIrYH3k3Rs161NCJeHcWMGEdTDFG4hGIc9J0Rsbo8IW9ORCyJiNGT8b4EvCsiHhYRjwH6JqkT4J3l7R8LvB34YuW6z1GMUX49xYeUhpQ9/Z8E3tNA+1d9HXhyRLyhHHO9fUQcVJ5MKUnTyoAsaaadFREbgDuBAeDwzLyuvK6Polf1RmAdxQlan4qIpRRB6m/KkLWGopf2mMp+v0BxYtpvgaUUX8ePZ9zHKK+7CLgO+GVE3L6N+78OeA7wG+BfKELkfRM8309RhMtvAjcB95Y1kJl3UPSwfpKiF/VuiiEUVV+lGI/7O4re21dn5h/LdjgEOLDc7+3lfnYr7/deiuELNwHnlzVM5qvA5RQnOZ4NDI5ekZk/A66gaPdvNbCvqv+g+ADzNCZu/wdk5l3AS4HDKHqyf0nx/z72A4QkNV1kTvaNoiS1toj4DMUJaO+u4bG/CPwgM4+bhn0fDzwxM18/2W1nQkR8CvhFHe0sSTOpUyexl6QHpRzC8FuKntmXUkyH9v5ai5oBEbEIeDXw9JpLkaRp5xALSZqaRwEjFNPVfRR4S2Z+v9aKpllEvA+4FvhgZt5Udz2SNN0cYiFJkiRV2IMsSZIkVRiQJUmSpAoDsiRJklRhQJYkSZIqDMiSJElShQFZkiRJqjAgS5IkSRUGZEmSJKnCgCxJkiRVGJAlqQ1ExLER8ckGb3t8RHx+umuSpNnKgCxJMyQibo6IF4/Z9saIWDfZfTPzXzPzb6erDknSZgZkSZIkqcKALEktIiL2iogvR8RtEXFTRLytct0WwyYi4m8i4paI+E1E/N9xeoV3iIjPRsRdEXFdRCwr7/c54HHAWRGxISL+acaeoCS1CQOyJLWAiNgOOAu4CtgbWAkcHREvG+e2i4H/Al4HPBrYrbxP1aHAqcDuwNeA/wTIzDcAPwEOycwFmfmB6Xg+ktTODMiSNLO+EhG/H/2hCLoABwF7ZuY/Z+b9mXkjcDJw2Dj7eA1wVmauy8z7gfcAOeY26zLzfzJzE/A54IBpeTaSNAvNrbsASeowr8rMb4xeiIg3An8L7APsVYbmUXOAb42zj72An45eyMx7IuI3Y27zy8rv9wDzI2JuZm58aOVL0uxnQJak1vBT4KbMfFIDt70V2G/0QkTsCOwxhcca29ssSapwiIUktYZLgTsjYnVE7BgRcyJiSUQcNM5tTwcOiYjnRsQOwHuBmMJj/QrYtwk1S9KsZECWpBZQjhU+BDgQuAm4HfgkxQl4Y297HdBHcRLercBdwK+B+xp8uH8D3l2Og/7Hh1y8JM0ykek3bZLUziJiAfB74EmZeVPN5UhS27MHWZLaUEQcEhE7RcTOwIeAa4Cb661KkmYHA7IktadXAr8of54EHJZ+JShJTeEQC0mSJKnCHmRJkiSpwoAsSZIkVczoQiGPeMQjctGiRTP5kA25++672Xnnnesuoy3YVo2zrabG9mqcbTU1tlfjbKvG2VZT06rtdfnll9+emXuO3T6jAXnRokVcdtllM/mQDRkZGWHFihV1l9EWbKvG2VZTY3s1zraaGturcbZV42yrqWnV9oqIW8bb7hALSZIkqcKALEmSJFUYkCVJkqQKA7IkSZJUYUCWJEmSKgzIkiRJUoUBWZIkSaowIEuSJEkVBmRJkiSpwoAsSZIkVRiQJUmSpAoDsiRJklRhQJYkSZIqDMiSJElShQFZktrE0NAQS5YsYeXKlSxZsoShoaG6S5KkWWlu3QVIkiY3NDREf38/g4ODbNq0iTlz5tDb2wtAT09PzdVJ0uxiD7IktYGBgQEGBwfp7u5m7ty5dHd3Mzg4yMDAQN2lSdKsY0CWpDawfv16li9fvsW25cuXs379+poqkqTZy4AsSW2gq6uLdevWbbFt3bp1dHV11VSRJM1eBmRJagP9/f309vYyPDzMxo0bGR4epre3l/7+/rpLk6RZx5P0JKkNjJ6I19fXx/r16+nq6mJgYMAT9CRpGhiQJalN9PT00NPTw8jICCtWrKi7HEmatRxiIUmSJFUYkCVJkqQKA7IkSZJUYUCWJEmSKgzIkiRJUoUBWZIkSaowIEuSJEkVBmRJkiS
"text/plain": [
"<Figure size 720x576 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"df.boxplot(column='Height', by='Role', figsize=(10,8))\n",
3 years ago
"plt.xticks(rotation='vertical')\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"> **Note**: This diagram suggests, that on average, the heights of first basemen are higher than heights of second basemen. Later we will learn how we can test this hypothesis more formally, and how to demonstrate that our data is statistically significant to show that. \n",
3 years ago
"\n",
"Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values: "
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 10,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGqCAYAAAAWf7K6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAn10lEQVR4nO3de5hlZXnn/e9PUDS2AgatIJK0GkwE+g0TShIPMdWaUSNMMPOqwWEURmNHYw7GTt40mqjRkCEmaCZjoukEXjEqLSMeiJAoMTaoI2rDoA2iItIoBxsFBFoJSeM9f6xV8lDUqYu9a9fh+7muumrvZ6291r3vrq761VPPXjtVhSRJkqTO/UZdgCRJkrSUGJAlSZKkhgFZkiRJahiQJUmSpIYBWZIkSWoYkCVJkqSGAVnSgiU5PslH57nviUk+OeDz70jyC/3tVyf5uwEee1eSx/S335Hkjwd47Lcn+cNBHW8PzvvyJDv75/bDi33+PZWkkvz4qOuQtPoYkKVVJslJSc6bMnblDGPHzXasqnp3VT1jQHVtTfKrC318Vf1JVc35+Pmep6rWVNXXFlpPc757/WJQVS+rqjfe12PvYR33B94MPKN/bjdN2b62D6SXTBk/IMm/JdnRjP3gF5Mp+04k+X4fwHcluS7JH81S0+Q5J/ffkWTTfX6yknQfGZCl1edC4MlJ9gJI8iPA/YGfnjL24/2+q0qSvUddw5CMAQ8ELp9jvwcnOby5/1+Aq/fgPNf3AXwN8BTgJUmeM8dj9uv3fwHw2iTP2oPzDdTk/wFJq5sBWVp9PkcXiI/o7z8V+Djw5SljV1XV9Un2TXJakhv6GcE/boL0PWZHkzwjyZeT3Jrkr5NcMHW2NsmfJ7klydVJfrEfOxn4OeCt/UziW6crPMkLk1yT5KYkr5my7fVJ3tXffmCSd/X7fSfJ55KMzXSefhbzFUmuBK5sxto/7x+Q5Pwkt/fP68f6/SZnQfduatma5FeTPB54O/DE/nzf6bffY8lGkpcm+WqSm5Ock+SRzbZK8rJ+Rv+WJH+VJDP0Z58kf5Hk+v7jL/qxx/X/vgDfSfIv0z2+9/fACc39FwHvnGX/GVXV1cD/Bg6d5/6fpgvwh0/dluToJP8nyW1JvpHk9c22c5P85pT9vzAZzJP8ZP9vd3P/9fn8Zr93JHlbkvOSfBdYn+TZSb7Y/1tfl+R3F/D0JS1jBmRplamqfwM+QxeC6T9/AvjklLHJ2eMzgN10M8r/AXgGcK8lCkkOAN4HnAT8MF0ge9KU3X6mHz8AeBNwWpJU1Wv6Gn6jn338jWmOfyjwNuCFwCP7czxqhqd5ArAvcHC/38uAO+Y4z3P6+mYKc8cDb+xrvxR49wz7/UBVXdGf+9P9+fab5nk9DfjvwPOBA4FrgC1TdjsGeALwU/1+z5zhlK8BfpbuF52fAo4C/qCqvgIc1u+zX1U9bZay3wUcl2SvPuA/hO7rZY8lOQR4MnDRPPZNkif3df6faXb5Ll1Y3w84Gnh5MzN9BvBfm2P9FHAQcF6SBwPnA+8BHkE3S/3XSQ7jbv8FOJnuuX4SOA34tap6CF1Yn+0XCkkrkAFZWp0u4O4w/HN0ofETU8YuSDIG/CLwyqr6blXdCLwFmG5t8rOBy6vq/VW1G/hL4JtT9rmmqv62qu6iCzUH0v3pfz6eC3y4qi6sqjuBPwS+P8O+/04XjH+8qu6qqour6rY5jv/fq+rmqrpjhu3nNud+Dd2s8MHzrH02xwOnV9Ul/bFP6o+9ttnnlKr6TlV9nW62/4hZjvWGqrqxqr4F/BHdLxR74lq6X2J+ge4XjT2dPX5kP2t/G/AVunA914szvw3cDPwdsKmqPjZ1h6raWlXbq+r7VfUF4Ezg5/vNHwIO6QM5dM/5vf0vg8cAO6rq/6+q3VV1CXA23dfTpA9V1af6Y/8r3dfPoUkeWlW39I+RtIoYkKXV6ULgKUn2Bx5eVVfS/Sn8Sf3Y4f0+P0a3HOOGPvR8B/gbupm4qR4JfGPyTlUVXdhqfbPZ/r3+5pp51jz1+N8Fbpph378HPgJs6ZcavCndi9Rm8435bq+qXXSB7pEz7z5vj6SbNW6PfRPdDOik9heN7zFzz+5xrP72Qmp8J3Ai3Wzru/bwsddX1X5V9VC62d476H4Zms0BVbV/VT2+qv5yuh2S/EySjyf5VpJb6WbmDwDof7E4C/ivSe7X1/33/UN/DPiZya/f/mv4eOBHmsNP/bf/f+l+4bumX07zxPk9dUkrhQFZWp0+TbcEYQPwKYB+hvX6fuz6fv3oN4A76QLMfpPBp6oOm+aYN9AseejXyc60BGI6Ncf2G+iWTEwe/4foZonvfaCqf6+qP6qqQ+mWeRxD9+f52c4z1/nbc68BHkbXr+/2wz/U7NuGr7mOez1diJs89oPpntd1czxuzmMBP9qP7amz6ZYxfK2qrplr55lU1a10Sxv+00KP0XgPcA5wcFXtS7e2u12LfQZd8H068L1+PTN0X8MXNF+/+/XLXV7eljql7s9V1bF0vwh+kC58S1pFDMjSKtQvI9gGvIpuacWkT/ZjF/b73QB8FDg1yUOT3C/JY5P8/NRjAucC65I8p3/B2iu4Z1Ccy07gMbNsfx9wTJKnJHkA8AZm+B6WZH2SdeleTHgb3Z/M75rneWby7ObcbwQ+U1Xf6JcyXEc3e7lXkhcDj53yvB7VP2467wH+W5IjkuwD/El/7B0LqPFM4A+SPLxfE/5a9nwGeHJ2/mlMs9a8cf90L4ac/LjX1T/6XySOY+4rZ8zHQ4Cbq+pfkxxFt264rfnTdEtuTuXu2WOADwOPS/cCz/v3H0/o11ffS5IHpLu+975V9e90Xz93TbevpJXLgCytXhfQzZC160M/0Y+1l3d7EfAA4IvALXRB9cCpB6uqbwPPo3vx3U10L3bbRjcDPR//A3huf6WGe/2Zvaoupwvd76GbTb6Fey/hmPQjfZ23AVfQPdfJoDjreWbxHuB1dEsrjqSbrZz0UuD36J73YXTLVSb9C11A/GaSb0/zvD5Gt5767P55PZbp13jPxx/T9fwLwHbgkn5sj1XVtqq6apZdzqNbPjH58fp+/JHpr2tMt8TjYdyzVwv168AbktxOF/ynm9V9J7CO5peCqrqd7oWlx9HNpn8T+FNgn1nO9UJgR7+O+mU0LwCUtDqkWyYoSYPVrwW9Fji+qj4+6nq08iV5EbChqp4y6lokLW/OIEsamCTPTLJfv1Tg1XRrROe8xJd0X/Vr0n8d2DzqWiQtfwZkSYP0ROAqust2/SfgObNcNk0aiCTPBL5Ft977PSMuR9IK4BILSZIkqeEMsiRJktQwIEuSJEkNA7IkSZLUMCBLkiRJDQOyJEmS1DAgS5IkSQ0DsiRJktQwIEuSJEkNA7IkSZLUMCBLkiRJDQOyJEmS1Nh71AXcFwcccECtXbt21GUsad/97nd58IMfPOoyVhR7Ohz2dfDs6XDY18Gzp8NhX+d28cUXf7uqHj51fFkH5LVr17Jt27ZRl7Gkbd26lYmJiVGXsaLY0+Gwr4NnT4fDvg6ePR0O+zq3JNdMN+4SC0mSJKlhQJYkSZIaBmRJkiSpMbSAnOTgJB9PckWSy5P8dj/+sCTnJ7my/7x/85iTknw1yZeTPHNYtUmSJEkzGeYM8m5gY1U9HvhZ4BVJDgU2AR+rqkOAj/X36bcdBxwGPAv46yR7DbE+SZIk6V6GFpCr6oaquqS/fTtwBXAQcCxwRr/bGcBz+tvHAluq6s6quhr4KnDUsOqTJEmSppOqGv5JkrXAhcDhwNerar9m2y1VtX+StwIXVdW7+vHTgH+sqvdNOdYGYAPA2NjYkVu2bBl6/cvZrl27WLNmzajLWFHs6XDY18Gzp8NhXwfPng6HfZ3b+vXrL66q8anjQ78OcpI1wNnAK6vqtiQz7jrN2L3Se1VtBjYDjI+Pl9f3m53XQBw8ezoc9nXw7Olw2NfBs6fDYV8XbqhXsUhyf7pw/O6qen8/vDPJgf32A4Eb+/FrgYObhz8KuH6Y9UmSJElTDfMqFgFOA66oqjc3m84BTuhvnwB8qBk/Lsk+SR4NHAJ
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"df['Weight'].hist(bins=15, figsize=(10,6))\n",
3 years ago
"plt.suptitle('Weight distribution of MLB Players')\n",
"plt.xlabel('Weight')\n",
"plt.ylabel('Count')\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Normal Distribution\n",
"\n",
"Let's create an artificial sample of weights that follows a normal distribution with the same mean and variance as our real data:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 11,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([73.46072234, 70.40678311, 70.23689776, 73.81190675, 72.41091792,\n",
" 76.00127651, 71.91641414, 77.18162239, 76.7173353 , 73.93996587,\n",
" 74.2862748 , 76.88034696, 72.15184905, 74.43537605, 76.37723417,\n",
" 65.66976051, 74.3200533 , 77.3235274 , 72.8840488 , 77.50300255])"
]
},
"execution_count": 11,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"generated = np.random.normal(mean, std, 1000)\n",
3 years ago
"generated[:20]"
]
},
{
"cell_type": "code",
"execution_count": 12,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAXh0lEQVR4nO3df4xlZ33f8c+33oQWQoqJB2SM3TXIoELVGrpy0yIQrUlwcIQhEtRWSt0EdUEyUmj5gwWkQiNZchoc/kkDWmIXJyLGDsbBkkmL60ahSOXHGhxjYzu2YYG1t+sNTgMpiHTNt3/M2eTxcmdnPPfeGa/39ZJG997nnjPz+Nmrs2+fPXNvdXcAAIBVf2u7JwAAAE8kAhkAAAYCGQAABgIZAAAGAhkAAAY7tnsCSXLaaaf1zp07t3saAACcRG677bY/6+6VY8efEIG8c+fO7Nu3b7unAQDASaSqvjFr3CUWAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwWDeQq+rMqvqjqrq7qu6qql+Zxp9ZVbdU1X3T7anDPu+qqvur6t6qevUy/wMAAGCRNnIG+UiSd3T330/y00kuq6oXJdmT5NbuPifJrdPjTM9dnOTFSS5I8ltVdcoyJg8AAIu2biB398Hu/tJ0/7tJ7k5yRpKLklwzbXZNktdN9y9K8rHu/kF3fz3J/UnOW/C8AQBgKR7XNchVtTPJS5J8Psmzu/tgshrRSZ41bXZGkm8Nux2Yxo79Xrural9V7Tt8+PAmpg4AAIu34UCuqp9IckOSt3f3d4636Yyx/pGB7r3dvau7d62srGx0GgAAsFQbCuSq+rGsxvFHu/sT0/Chqjp9ev70JA9P4weSnDns/twkDy1mugAAsFwbeReLSnJVkru7+zeGp25Kcul0/9IknxzGL66qp1TV2UnOSfKFxU0ZAACWZ8cGtnlZkjcl+UpV3T6NvTvJFUmur6o3J/lmkjckSXffVVXXJ/lqVt8B47LufnTREwc41s49N2/3FNa1/4oLt3sKAKxj3UDu7s9m9nXFSXL+GvtcnuTyOeYFAADbwifpAQDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBAIAMAwEAgAwDAQCADAMBg3UCuqqur6uGqunMYu66qbp++9lfV7dP4zqr6/vDch5Y4dwAAWLgdG9jmI0l+M8nvHB3o7n959H5VXZnkL4btH+jucxc0PwAA2FLrBnJ3f6aqds56rqoqyRuT/IsFzwvgSWnnnpu3ewrr2n/Fhds9BYBtNe81yC9Pcqi77xvGzq6qL1fVH1fVy+f8/gAAsKU2conF8VyS5Nrh8cEkZ3X3t6vqHyf5g6p6cXd/59gdq2p3kt1JctZZZ805DQAAWIxNn0Guqh1JfiHJdUfHuvsH3f3t6f5tSR5I8oJZ+3f33u7e1d27VlZWNjsNAABYqHkusXhVknu6+8DRgapaqapTpvvPS3JOkq/NN0UAANg6G3mbt2uT/K8kL6yqA1X15umpi/PYyyuS5BVJ7qiqP0ny8SRv7e5HFjlhAABYpo28i8Ula4z/mxljNyS5Yf5pAQDA9vBJegAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADAQyAAAMBDIAAAwEMgAADBYN5Cr6uqqeriq7hzG3ldVD1bV7dPXa4bn3lVV91fVvVX16mVNHAAAlmHHBrb5SJLfTPI7x4x/oLvfPw5U1YuSXJzkxUmek+S/V9ULuvvRBcwV2GY799y83VMAgKVb9wxyd38mySMb/H4XJflYd/+gu7+e5P4k580xPwAA2FIbOYO8lrdV1b9Osi/JO7r7z5OckeRzwzYHprEfUVW7k+xOkrPOOmuOaQCwSCfCvxTsv+LC7Z4C8CS22V/S+2CS5yc5N8nBJFdO4zVj2571Dbp7b3fv6u5dKysrm5wGAAAs1qYCubsPdfej3f3DJB/O31xGcSDJmcOmz03y0HxTBACArbOpQK6q04eHr09y9B0ubkpycVU9parOTnJOki/MN0UAANg6616DXFXXJnllktOq6kCS9yZ5ZVWdm9XLJ/YneUuSdPddVXV9kq8mOZLkMu9gAQDAiWTdQO7uS2YMX3Wc7S9Pcvk8kwIAgO3ik/QAAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYLBuIFfV1VX1cFXdOYz9elXdU1V3VNWNVfWMaXxnVX2/qm6fvj60xLkDAMDCbeQM8keSXHDM2C1J/kF3/8Mkf5rkXcNzD3T3udPXWxczTQAA2BrrBnJ3fybJI8eMfbq7j0wPP5fkuUuYGwAAbLlFXIP8y0n+cHh8dlV9uar+uKpevtZOVbW7qvZV1b7Dhw8vYBoAADC/uQK5qt6T5EiSj05DB5Oc1d0vSfLvk/xeVf3krH27e2937+ruXSsrK/NMAwAAFmbTgVxVlyb5+SS/2N2dJN39g+7+9nT/tiQPJHnBIiYKAABbYVOBXFUXJHlnktd29/eG8ZWqOmW6/7wk5yT52iImCgAAW2HHehtU1bVJXpnktKo6kOS9WX3XiqckuaWqkuRz0ztWvCLJr1bVkSSPJnlrdz8y8xsDAMAT0LqB3N2XzBi+ao1tb0hyw7yTAgCA7eKT9AAAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgIJABAGAgkAEAYCCQAQBgsG4gV9XVVfVwVd05jD2zqm6pqvum21OH595VVfdX1b1V9eplTRwAAJZhI2eQP5LkgmPG9iS5tbvPSXLr9DhV9aIkFyd58bTPb1XVKQubLQAALNm6gdzdn0nyyDHDFyW5Zrp/TZLXDeMf6+4fdPfXk9yf5LzFTBUAAJZvs9cgP7u7DybJdPusafyMJN8atjswjQEAwAlh0b+kVzPGeuaGVbural9V7Tt8+PCCpwEAAJuz2UA+VFWnJ8l0+/A0fiDJmcN2z03y0Kxv0N17u3tXd+9aWVnZ5DQAAGCxNhvINyW5dLp/aZJPDuMXV9VTqursJOck+cJ8UwQAgK2zY70NquraJK9MclpVHUjy3iRXJLm+qt6c5JtJ3pAk3X1XVV2f5KtJjiS5rLsfXdLcAQBg4dYN5O6+ZI2nzl9j+8uTXD7PpAAAYLv4JD0AABgIZAAAGAhkAAAYCGQAABgIZAAAGAhkAAAYrPs2b8DW2Lnn5u2eAgAQZ5ABAOAxBDIAAAwEMgAADAQyAAAMBDIAAAy8iwUAJ5wn+ru+7L/iwu2eAjAHZ5ABAGAgkAEAYCC
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,6))\n",
"plt.hist(generated, bins=15)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 13,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAaH0lEQVR4nO3dfayk51kf4N+NExIaQEkU2zj+6LqqQTgpJOjIDYqE0jglLo7itKrRooK2rSv/4/AhUeE1kYpQtdKqSBSkQisrpDUixKyAyKuYNnEMUVopxFmHJMR20qwS115sYkNAQCsZ2bn7x5lVHu+es2fOnpkzH+e6JGtmnnln5t7X58z5zT3P+7zV3QEAADZ906ILAACAZSIgAwDAQEAGAICBgAwAAAMBGQAABgIyAAAMpgrIVfXKqvrtqvpCVT1WVd9fVa+uqgeq6kuTy1cN299VVaer6otV9fb5lQ8AALNV06yDXFX3JPmf3f3eqvrmJH8nyc8m+Vp3H6+qo0le1d13VtX1ST6Q5IYkr03y0STf2d0vbPf8r3nNa/rQoUN7/9cAAMCUHn744T/r7kvPHX/JTg+sqm9P8gNJ/mWSdPffJvnbqrolyVsmm92T5GNJ7kxyS5J7u/u5JF+pqtPZDMuf2O41Dh06lFOnTu3inwMAAHtTVf9nq/Fpplj8vSTPJvmvVfVHVfXeqnpFksu7++kkmVxeNtn+yiRPDo8/MxkDAIClN01AfkmS70vyn7v7jUn+b5KjF9i+thg7bx5HVd1eVaeq6tSzzz47VbEAADBv0wTkM0nOdPcnJ7d/O5uB+atVdUWSTC6fGba/enj8VUmeOvdJu/vu7t7o7o1LLz1v6gcAACzEjgG5u/80yZNV9V2ToRuTPJrkZJIjk7EjSe6bXD+Z5HBVvayqrk1yXZKHZlo1AADMyY4H6U38eJL3T1aw+HKSf5XNcH2iqm5L8kSSW5Okux+pqhPZDNHPJ7njQitYAADAMpkqIHf3Z5JsbHHXjdtsfyzJsYsvCwAAFsOZ9AAAYCAgAwDAQEAGAICBgAwAAAMBGQAABgIyAAAMBGQAABgIyAAAMBCQAQBgICADAMBAQAYAgIGADLCmDh29P4eO3j+z7QAOCgEZAAAGAjIAAAwEZAAAGAjIAAAwEJABAGAgIAMAwEBABgCAgYAMAAADARkAAAYCMgBJnFEP4CwBGQAABgIyAAAMBGQAABgIyAArxDxhgPkTkAEAYCAgAywBnWGA5SEgAwDAQEAGYFd0u4F1JyADAMBAQAYAgIGADDAHpiEArC4BGQAABgIyAAAMBGQAABgIyAAAMBCQAfaBg/YAVoeADLCCBG6A+XnJogsAWGc7hdhpQ+4iwvDZ13z8+M37/toAi6SDDLDCdtNJ1nUGmI4OMsCaE4oBdkcHGQAABjrIAHsw73m6ur8A+08HGQAABjrIAAugMwywvARkgCUkQAMsjikWAFyQ5eGAg0ZABmCuBGxg1ZhiAbBmhFGAvdFBBgCAgYAMwEUxdQJYVwIyADMhMAPrQkAGAICBgAwAAAOrWADwIqZJAAfdVB3kqnq8qv64qj5TVacmY6+uqgeq6kuTy1cN299VVaer6otV9fZ5FQ8AALO2mw7yP+ruPxtuH03yYHcfr6qjk9t3VtX1SQ4neV2S1yb5aFV9Z3e/MLOqAVgaOs7AutnLFItbkrxlcv2eJB9Lcudk/N7ufi7JV6rqdJIbknxiD68FwAUIqQCzM+1Bep3kI1X1cFXdPhm7vLufTpLJ5WWT8SuTPDk89sxkDAAAlt60HeQ3d/dTVXVZkgeq6gsX2La2GOvzNtoM2rcnyTXXXDNlGQDMyry6zmef9/HjN8/l+QHmbaqA3N1PTS6fqaoPZnPKxFer6orufrqqrkjyzGTzM0muHh5+VZKntnjOu5PcnSQbGxvnBWiAdWQqBMDy23GKRVW9oqq+7ez1JD+Y5PNJTiY5MtnsSJL7JtdPJjlcVS+rqmuTXJfkoVkXDgAA8zBNB/nyJB+sqrPb/2Z3/4+q+lSSE1V1W5InktyaJN39SFWdSPJokueT3GEFC4Dp6DADLN6OAbm7v5zke7cY//MkN27zmGNJju25OgAA2GfOpAcwQzrAAKtv2mXeAADgQNBBBpgBnWOA9SEgAzAX231osE4ysOxMsQAAgIGADAAAAwEZAAAGAjIAAAwEZAAAGAjIAAAwsMwbwBQsTWatZ+Dg0EEG2MKho/cLhAAHlA4ywEUQngHWlw4yAAAMBGQAABiYYgFwwJgeAnBhOsgAADAQkAEAYCAgAwDAQEAGAICBgAwAAAMBGQAABgIyAAAMrIMMcAHnrhlsDWGA9aeDDAAAAwEZAAAGAjIAAAwEZAAAGAjIAAAwEJABAGAgIAOwUIeO3m/5PGCpWAcZgH0hBAOrQgcZWHs6lADshoAMwFLxgQZYNAEZAAAGAjIAAAwEZAAAGAjIwIFknutq8f8L2E8CMgAADKyDDDDQpQRABxkAAAYCMgAADEyxAGAhTGcBlpWADKwtAQyAi2GKBQAADARkAAAYmGIBwFIwJQZYFjrIAAAwEJABAGAgIAMAwEBABgCAgYP0AOIAMQC+QUAGDgwhGIBpmGIBAAADARkAAAYCMgAADKaeg1xVlyQ5leRPuvsdVfXqJL+V5FCSx5P8cHf/xWTbu5LcluSFJD/R3R+ecd0ArDlzxoFF2U0H+SeTPDbcPprkwe6+LsmDk9upquuTHE7yuiQ3JfnVSbgGAIClN1VArqqrktyc5L3D8C1J7plcvyfJu4bxe7v7ue7+SpLTSW6YSbUAADBn03aQfynJzyT5+jB2eXc/nSSTy8sm41cmeXLY7sxk7EWq6vaqOlVVp5599tnd1g0AAHOxY0Cuqnckeaa7H57yOWuLsT5voPvu7t7o7o1LL710yqcGAID5muYgvTcneWdV/VCSlyf59qr6jSRfraoruvvpqroiyTOT7c8kuXp4/FVJnppl0QAAMC87dpC7+67uvqq7D2Xz4Lvf7+4fTXIyyZHJZkeS3De5fjLJ4ap6WVVdm+S6JA/NvHKAGTh09H6rJQDwIns51fTxJCeq6rYkTyS5NUm6+5GqOpHk0STPJ7mju1/Yc6UAALAPdhWQu/tjST42uf7nSW7cZrtjSY7tsTYAANh3zqQHwMowJQbYDwIyAAAMBGQAABgIyAAAMBCQAQBgICADa8MBXADMwl7WQQaAhTj3g9Djx29eUCXAOtJBBmDt+DYB2AsBGQAABgIyAAAMBGQAABg4SA9YeeaaAjBLOsgAADAQkAEAYCAgAwDAQEAGAICBgAwAAAMBGQAABpZ5A2BtWPIPmAUBGYCVJxgDs2SKBQAADARkAAAYCMgAADAQkAEAYCAgAwDAQEAGAICBgAzA2jp09H5LwAG7JiADAMDAiUKAlXFuJ/Dx4zcvqBIA1pmADKwdX6kDsBemWAAAwEBABgCAgYAMAAADARkAAAYCMgAADARkAA4MJw4BpmGZN2BlCToAzIOADCwtAZhZ8bME7IYpFgAAMBCQAQBgICADcGA5aA/YioAMAAADARkAAAYCMgAADARkAAAYCMgAADAQkAE48KxmAYwEZAAAGAjIAAAwEJABAGDwkkUXAAD7zXxj4EJ0kAEAYCAgAwDAQEAGloaltgBYBgIyAEz4kAYkAjKwj4QPAFbBjgG5ql5eVQ9V1Wer6pGq+vnJ+Kur6oGq+tLk8lXDY+6qqtNV9cWqevs8/wEAADBL03SQn0vy1u7+3iRvSHJTVb0pydEkD3b3dUkenNxOVV2f5HCS1yW5KcmvVtUlc6gdAObCtx1wsO0YkHvT30xuvnTyXye5Jck9k/F7krxrcv2WJPd293Pd/ZUkp5PcMMuiAQBgXqY6UcikA/xwkr+f5Fe6+5NVdXl3P50k3f10VV022fzKJH84PPzMZOzc57w9ye1Jcs0111z8vwBYOed25h4/fvOCKgGA800VkLv7hSRvqKpXJvlgVb3+ApvXVk+xxXPeneTuJNnY2DjvfuDg8FU2AMtkV6tYdPdfJvlYNucWf7WqrkiSyeUzk83OJLl6eNhVSZ7aa6E
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,6))\n",
"plt.hist(np.random.normal(0,1,50000), bins=300)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Since most values in real life are normally distributed, we should not use a uniform random number generator to generate sample data. Here is what happens if we try to generate weights with a uniform distribution (generated by `np.random.rand`):"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 14,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAATQElEQVR4nO3db6ykd3nf4e9db4FCFGHLx+7GNl1TbUgMapv0hKaNWkV10zoxst1WREakWgVLWyoSSNUorItUV4qQnCbqnxdNpS1xs2opxCKktorSYC35o7wAugaSYAy1G4y99sZekhSSRjI13H1xJs7tk13WPnPOzK73uiRrZn4zc+Z+8dPZj57zeJ7q7gAAAFv+zLoHAACA84lABgCAQSADAMAgkAEAYBDIAAAw7Fv3AEly+eWX94EDB9Y9BgAAF5H777//i929sX39vAjkAwcO5MSJE+seAwCAi0hVfeFM606xAACAQSADAMAgkAEAYBDIAAAwnDOQq+quqnqqqj491n6yqj5bVb9ZVb9QVa8cz91eVQ9X1eeq6u/t0dwAALAnns8R5J9NcsO2tfuSvK67/1KS/5Xk9iSpquuS3JrktYv3/HRVXbJr0wIAwB47ZyB3968l+b1tax/u7mcWDz+a5OrF/ZuTvL+7n+7uzyd5OMnrd3FeAADYU7txDvJbkvzi4v5VSR4bz51crAEAwAVhqUCuqncleSbJe/946Qwv67O893BVnaiqE6dPn15mDAAA2DU7DuSqOpTkDUne3N1/HMEnk1wzXnZ1kifO9P7uPtrdm929ubHxp67wBwAAa7GjQK6qG5K8M8lN3f1H46l7k9xaVS+tqmuTHEzy8eXHBACA1dh3rhdU1fuSfHeSy6vqZJI7svWtFS9Ncl9VJclHu/ut3f1AVd2d5DPZOvXibd391b0aHgAAdlv9ydkR67O5udknTpxY9xgAAFxEqur+7t7cvu5KegAAMAhkAAAYBDIAAAwCGQAAhnN+iwW8GBw48qF1j7Byj9x547pHAIALkiPIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwuJLeRehivKocAMDz5QgyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwHDRf4uFb3QAAGByBBkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAw75zvaCq7kryhiRPdffrFmuXJfm5JAeSPJLk+7v79xfP3Z7ktiRfTfL27v6lPZkc+LoOHPnQukdYuUfuvHHdIwDwIvB8jiD/bJIbtq0dSXK8uw8mOb54nKq6LsmtSV67eM9PV9UluzYtAADssXMGcnf/WpLf27Z8c5Jji/vHktwy1t/f3U939+eTPJzk9bszKgAA7L2dnoN8ZXefSpLF7RWL9auSPDZed3KxBgAAF4RznoP8AtUZ1vqML6w6nORwkrzqVa/a5TEAeDFzjj0vVvb2+WGnR5CfrKr9SbK4fWqxfjLJNeN1Vyd54kw/oLuPdvdmd29ubGzscAwAANhdOw3ke5McWtw/lOSesX5rVb20qq5NcjDJx5cbEQAAVuf5fM3b+5J8d5LLq+pkkjuS3Jnk7qq6LcmjSd6YJN39QFXdneQzSZ5J8rbu/uoezQ4AALvunIHc3W86y1PXn+X1707y7mWGAgCAdXElPQAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwLBv3QMA7JYDRz607hFW7pE7b1z3CAAvOo4gAwDAIJABAGAQyAAAMAhkAAAYBDIAAAwCGQAABoEMAACDQAYAgMGFQgDgAuBCOLA6jiADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAMO+dQ8AwM4dOPKhdY8A8KLjCDIAAAwCGQAABoEMAACDc5ABgPOSc+xZF0eQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGJYK5Kr6p1X1QFV9uqreV1Uvq6rLquq+qnpocXvpbg0LAAB7bceBXFVXJXl7ks3ufl2SS5LcmuRIkuPdfTDJ8cVjAAC4ICx7isW+JH+uqvYleXmSJ5LcnOTY4vljSW5Z8jMAAGBldhzI3f14kp9K8miSU0m+1N0fTnJld59avOZUkit2Y1AAAFiFZU6xuDRbR4uvTfJNSV5RVT/wAt5/uKpOVNWJ06dP73QMAADYVcucYvF3kny+u0939/9L8sEkfyPJk1W1P0kWt0+d6c3dfbS7N7t7c2NjY4kxAABg9ywTyI8m+c6qenlVVZLrkzyY5N4khxavOZTknuVGBACA1dm30zd298eq6gNJPpHkmSSfTHI0yTckubuqbstWRL9xNwYFAIBV2HEgJ0l335Hkjm3LT2fraDIAAFxwXEkPAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABgEMgAADAIZAAAGgQwAAINABgCAQSADAMAgkAEAYBDIAAAwCGQAABiWCuSqemVVfaCqPltVD1bVX6+qy6rqvqp6aHF76W4NCwAAe23ZI8j/Lsn/6O5vSfKXkzyY5EiS4919MMnxxWMAALgg7DiQq+obk/ytJD+TJN39le7+P0luTnJs8bJjSW5ZbkQAAFidZY4gvzrJ6ST/qao+WVXvqapXJLmyu08lyeL2il2YEwAAVmKZQN6X5NuT/Ifu/rYk/zcv4HSKqjpcVSeq6sTp06eXGAMAAHbPMoF8MsnJ7v7Y4vEHshXMT1bV/iRZ3D51pjd399Hu3uzuzY2NjSXGAACA3bPjQO7u30nyWFW9ZrF0fZLPJLk3yaHF2qEk9yw1IQAArNC+Jd//w0neW1UvSfLbSX4wW9F9d1XdluTRJG9c8jMAAGBllgrk7v5Uks0zPHX9Mj8XAADWxZX0AABgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwCCQAQBgEMgAADAIZAAAGAQyAAAMAhkAAAaBDAAAg0AGAIBBIAMAwLB0IFfVJVX1yar674vHl1XVfVX10OL20uXHBACA1diNI8jvSPLgeHwkyfHuPpjk+OIxAABcEJYK5Kq6OsmNSd4zlm9Ocmxx/1iSW5b5DAAAWKVljyD/2yQ/luRrY+3K7j6VJIvbK870xqo6XFUnqurE6dOnlxwDAAB2x44DuarekOSp7r5/J+/v7qPdvdndmxsbGzsdAwAAdtW+Jd77XUluqqrvS/KyJN9YVf8lyZNVtb+7T1XV/iRP7cagAACwCjs+gtz
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"wrong_sample = np.random.rand(1000)*2*std+mean-std\n",
"plt.figure(figsize=(10,6))\n",
3 years ago
"plt.hist(wrong_sample)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Confidence Intervals\n",
"\n",
"Let's now calculate confidence intervals for the weights and heights of baseball players. We will use the code [from this stackoverflow discussion](https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data):"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 15,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"p=0.85, mean = 201.73 ± 0.94\n",
"p=0.90, mean = 201.73 ± 1.08\n",
"p=0.95, mean = 201.73 ± 1.28\n"
]
}
],
3 years ago
"source": [
"import scipy.stats\n",
"\n",
"def mean_confidence_interval(data, confidence=0.95):\n",
" a = 1.0 * np.array(data)\n",
" n = len(a)\n",
" m, se = np.mean(a), scipy.stats.sem(a)\n",
" h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)\n",
" return m, h\n",
"\n",
"for p in [0.85, 0.9, 0.95]:\n",
" m, h = mean_confidence_interval(df['Weight'].fillna(method='pad'),p)\n",
" print(f\"p={p:.2f}, mean = {m:.2f} ± {h:.2f}\")"
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Hypothesis Testing\n",
"\n",
"Let's explore different roles in our baseball players dataset:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 16,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Height</th>\n",
" <th>Weight</th>\n",
" <th>Count</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Role</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Catcher</th>\n",
" <td>72.723684</td>\n",
" <td>204.328947</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Designated_Hitter</th>\n",
" <td>74.222222</td>\n",
" <td>220.888889</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>First_Baseman</th>\n",
" <td>74.000000</td>\n",
" <td>213.109091</td>\n",
" <td>55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Outfielder</th>\n",
" <td>73.010309</td>\n",
" <td>199.113402</td>\n",
" <td>194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Relief_Pitcher</th>\n",
" <td>74.374603</td>\n",
" <td>203.517460</td>\n",
" <td>315</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Second_Baseman</th>\n",
" <td>71.362069</td>\n",
" <td>184.344828</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Shortstop</th>\n",
" <td>71.903846</td>\n",
" <td>182.923077</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Starting_Pitcher</th>\n",
" <td>74.719457</td>\n",
" <td>205.163636</td>\n",
" <td>221</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Third_Baseman</th>\n",
" <td>73.044444</td>\n",
" <td>200.955556</td>\n",
" <td>45</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
3 years ago
],
"text/plain": [
" Height Weight Count\n",
"Role \n",
"Catcher 72.723684 204.328947 76\n",
"Designated_Hitter 74.222222 220.888889 18\n",
"First_Baseman 74.000000 213.109091 55\n",
"Outfielder 73.010309 199.113402 194\n",
"Relief_Pitcher 74.374603 203.517460 315\n",
"Second_Baseman 71.362069 184.344828 58\n",
"Shortstop 71.903846 182.923077 52\n",
"Starting_Pitcher 74.719457 205.163636 221\n",
"Third_Baseman 73.044444 200.955556 45"
]
},
"execution_count": 16,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"df.groupby('Role').agg({ 'Height' : 'mean', 'Weight' : 'mean', 'Age' : 'count'}).rename(columns={ 'Age' : 'Count'})"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Let's test the hypothesis that First Basemen are taller than Second Basemen. The simplest way to do this is to test the confidence intervals:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 17,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Conf=0.85, 1st basemen height: 73.62..74.38, 2nd basemen height: 71.04..71.69\n",
"Conf=0.90, 1st basemen height: 73.56..74.44, 2nd basemen height: 70.99..71.73\n",
"Conf=0.95, 1st basemen height: 73.47..74.53, 2nd basemen height: 70.92..71.81\n"
]
}
],
3 years ago
"source": [
"for p in [0.85,0.9,0.95]:\n",
" m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)\n",
" m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)\n",
" print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"We can see that the intervals do not overlap.\n",
3 years ago
"\n",
"A statistically more correct way to prove the hypothesis is to use a **Student t-test**:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 18,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"T-value = 7.65\n",
"P-value: 9.137321189738925e-12\n"
]
}
],
3 years ago
"source": [
"from scipy.stats import ttest_ind\n",
"\n",
"tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)\n",
"print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"The two values returned by the `ttest_ind` function are:\n",
"* p-value can be considered as the probability of two distributions having the same mean. In our case, it is very low, meaning that there is strong evidence supporting that first basemen are taller.\n",
"* t-value is the intermediate value of normalized mean difference that is used in the t-test, and it is compared against a threshold value for a given confidence value."
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"## Simulating a Normal Distribution with the Central Limit Theorem\n",
3 years ago
"\n",
"The pseudo-random generator in Python is designed to give us a uniform distribution. If we want to create a generator for normal distribution, we can use the central limit theorem. To get a normally distributed value we will just compute a mean of a uniform-generated sample."
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 19,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAARLElEQVR4nO3df4zkd13H8ddblgbkR4DcghU4Fgghlj/4kbOIGFNDMEiNQIIJJGI1mFMjBJREL/yh/FnjryZGMRWQGn6FQPkRriqkkqCJEq9QQpuCIFQsXLg2KKAxIS0f/9g5eLfdc7fznd3v7O3jkUxu5rszO+/93Ox+n/e9mZ0aYwQAANj2A3MPAAAA60QgAwBAI5ABAKARyAAA0AhkAABoNg7yzo4dOza2trYO8i4BAGBHN910011jjM37bj/QQN7a2sqZM2cO8i4BAGBHVfXvO233FAsAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGg25h4AgAdm69TpuUeYxe1XXzn3CMAR4QgyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCg2TWQq+qJVfXxqrqtqm6tqtcttj+mqj5WVV9Y/Pno/R8XAAD2116OIN+d5A1jjB9J8mNJfqOqLktyKsmNY4ynJblxcRkAAA61XQN5jHF2jPGpxflvJ7ktyeOTvCTJdYurXZfkpfs0IwAAHJgH9BzkqtpK8uwkn0zyuDHG2WQ7opM89gK3OVlVZ6rqzJ133jlxXAAA2F97DuSqeniS9yd5/RjjW3u93Rjj2jHGiTHGic3NzWVmBACAA7OnQK6qB2c7jt85xrh+sfnrVXXp4uOXJjm3PyMCAMDB2ctvsagkb01y2xjjj9uHPpzkqsX5q5J8aPXjAQDAwdrYw3Wen+RVST5bVTcvtr0xydVJ3ltVr07ylSQ/vy8TAgDAAdo1kMcY/5ikLvDhF6x2HAAAmJd30gMAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoNuYeAGCKrVOn5x4BgIuMI8gAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBm10CuqrdV1bmquqVte1NVfbWqbl6cXry/YwIAwMHYyxHktyd50Q7b/2SM8azF6YbVjgUAAPPYNZDHGJ9I8o0DmAUAAGa3MeG2r6mqX0xyJskbxhj/udOVqupkkpNJcvz48Ql3BwBHz9ap03OPcOBuv/rKuUfgiFv2RXpvTvLUJM9KcjbJH13oimOMa8cYJ8YYJzY3N5e8OwAAOBhLBfIY4+tjjHvGGN9N8pdJLl/tWAAAMI+lArmqLm0XX5bklgtdFwAADpNdn4NcVe9OckWSY1V1R5LfS3JFVT0ryUhye5Jf3b8RAQDg4OwayGOMV+6w+a37MAsAAMzOO+kBAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQbMw9AADsxdap03OPABwRjiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAECzMfcAwGpsnTo99wgAcFFwBBkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANDsGshV9baqOldVt7Rtj6mqj1XVFxZ/Pnp/xwQAgIOxlyPIb0/yovtsO5XkxjHG05LcuLgMAACH3q6BPMb4RJJv3GfzS5Jctzh/XZKXrnYsAACYx8aSt3vcGONskowxzlbVYy90xao6meRkkhw/fnzJuwMAjoqtU6fnHmEWt1995dwjsLDvL9IbY1w7xjgxxjixubm533cHAACTLBvIX6+qS5Nk8ee51Y0EAADzWTaQP5zkqsX5q5J8aDXjAADAvPbya97eneSfkjy9qu6oqlcnuTrJC6vqC0leuLgMAACH3q4v0htjvPICH3rBimcBAIDZeSc9AABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAADNxpQbV9XtSb6d5J4kd48xTqxiKAAAmMukQF74qTHGXSv4PAAAMDtPsQAAgGZqII8kH62qm6rq5CoGAgCAOU19isXzxxhfq6rHJvlYVX1ujPGJfoVFOJ9MkuPHj0+8OwCAi9PWqdNzjzCL26++cu4R7mfSEeQxxtcWf55L8oEkl+9wnWvHGCfGGCc2Nzen3B0AAOy7pQO5qh5WVY84fz7JTye5ZVWDAQDAHKY8xeJxST5QVec/z7vGGH+7kqkAAGAmSwfyGONLSZ65wlkAAGB2fs0bAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQLMx9wCwalunTs89AgBwiDmCDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAzcb
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"def normal_random(sample_size=100):\n",
" sample = [random.uniform(0,1) for _ in range(sample_size) ]\n",
" return sum(sample)/sample_size\n",
"\n",
"sample = [normal_random() for _ in range(100)]\n",
"plt.figure(figsize=(10,6))\n",
3 years ago
"plt.hist(sample)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Correlation and Evil Baseball Corp\n",
"\n",
"Correlation allows us to find relations between data sequences. In our toy example, let's pretend there is an evil baseball corporation that pays its players according to their height - the taller the player is, the more money he/she gets. Suppose there is a base salary of $1000, and an additional bonus from $0 to $100, depending on height. We will take the real players from MLB, and compute their imaginary salaries:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 20,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"[(74, 1075.2469071629068), (74, 1075.2469071629068), (72, 1053.7477908306478), (72, 1053.7477908306478), (73, 1064.4973489967772), (69, 1021.4991163322591), (69, 1021.4991163322591), (71, 1042.9982326645181), (76, 1096.746023495166), (71, 1042.9982326645181)]\n"
]
}
],
3 years ago
"source": [
"heights = df['Height']\n",
"salaries = 1000+(heights-heights.min())/(heights.max()-heights.mean())*100\n",
"print(list(zip(heights, salaries))[:10])"
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Let's now compute covariance and correlation of those sequences. `np.cov` will give us a so-called **covariance matrix**, which is an extension of covariance to multiple variables. The element $M_{ij}$ of the covariance matrix $M$ is a correlation between input variables $X_i$ and $X_j$, and diagonal values $M_{ii}$ is the variance of $X_{i}$. Similarly, `np.corrcoef` will give us the **correlation matrix**."
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 21,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Covariance matrix:\n",
"[[ 5.31679808 57.15323023]\n",
" [ 57.15323023 614.37197275]]\n",
"Covariance = 57.153230230544736\n",
"Correlation = 1.0\n"
]
}
],
3 years ago
"source": [
"print(f\"Covariance matrix:\\n{np.cov(heights, salaries)}\")\n",
"print(f\"Covariance = {np.cov(heights, salaries)[0,1]}\")\n",
"print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"A correlation equal to 1 means that there is a strong **linear relation** between two variables. We can visually see the linear relation by plotting one value against the other:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 22,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAcYklEQVR4nO3dcYyndX0n8Penu4q0Vw49Fs8C3mqLNBoSrHNCL6dn6nlyxVZqQyqhSq5eqF7bxDa1XSKeuYsm2/OS5trkbNBDsNZtaKNoshpqaSx3jdAMhXaXWiJYhAUO1hJ7xFKs+Lk/5ln6ZXZmZ2eZ2d9vh9crefL8fp/f85v5/L6ZfXjzzPf5TnV3AACAJd816wYAAGCeCMgAADAQkAEAYCAgAwDAQEAGAIDB9lk3sJbTTjutd+7cOes2AADYYm677bavd/eO5fW5D8g7d+7M4uLirNsAAGCLqaqvrVQ3xQIAAAYCMgAADARkAAAYCMgAADAQkAEAYCAgAwDAQEAGAIDBmgG5qq6pqkeqav9Qu6Sq7qyq71TVwlC/rKruGLbvVNV502tfrKq7htdO35RPBAAAz8DRXEG+NsmFy2r7k7wlyc1jsbt/p7vP6+7zkrwtyb3dfcdwyGWHXu/uR465awAA2CRr/iW97r65qnYuq305SarqSG+9NMmeZ9IcAAAcb5s5B/mncnhA/tg0veJ9dYR0XVVXVNViVS0ePHhwE1sEAICn25SAXFXnJ/m77t4/lC/r7nOTvGba3rba+7v76u5e6O6FHTt2bEaLAACwos26gvzWLLt63N0PTPvHknwyyas36XsDAMAxW3MO8npV1XcluSTJa4fa9iSndvfXq+o5Sd6U5A83+nsDAHBiuOH2B/KhG+/Kg994PN936sl5zxvPycWvPGPWbSU5ioBcVXuSvC7JaVV1IMn7kzya5DeT7Eiyt6ru6O43Tm95bZID3f3V4cuclOTGKRxvy1I4/siGfQoAAE4YN9z+QK781L48/g9PJkke+MbjufJT+5JkLkLy0axicekqL316leO/mOSCZbVvJnnVepsDAGDr+dCNdz0Vjg95/B+ezIduvGsuArK/pAcAwHH14DceX1f9eBOQAQA4rr7v1JPXVT/eBGQAAI6r97zxnJz8nG1Pq538nG15zxvPmVFHT7fhq1gAAMCRHJpnfMKuYgEAABvt4leeMTeBeDlTLAAAYCAgAwDAQEAGAICBgAwAAAMBGQAABgIyAAAMBGQAABgIyAAAMBCQAQBgICADAMBAQAYAgIGADAAAAwEZAAAGAjIAAAwEZAAAGGyfdQMAAGyeq27Ylz233p8nu7OtKpeef1Y+cPG5s25rrgnIAABb1FU37MsnbrnvqedPdj/1XEhenSkWAABb1J5b719XnSUCMgDAFvVk97rqLBGQAQC2qG1V66qzREAGANiiLj3/rHXVWeImPQCALerQjXhWsVif6jmfg7KwsNCLi4uzbgMAgC2mqm7r7oXldVMsAABgICADAMBAQAYAgIGADAAAAwEZAAAGAjIAAAwEZAAAGAjIAAAwEJABAGAgIAMAwEBABgCAgYAMAAADARkAAAYCMgAADLbPugEAgK3iqhv2Zc+t9+fJ7myryqXnn5UPXHzurNtinda8glxV11TVI1W1f6hdUlV3VtV3qmphqO+sqser6o5p+63htVdV1b6quruqfqOqauM/DgDAbFx1w7584pb78mR3kuTJ7nzilvty1Q37ZtwZ63U0UyyuTXLhstr+JG9JcvMKx9/T3edN2zuH+oeTXJHk7Glb/jUBAE5Ye269f1115teaAbm7b07y6LLal7v7rqP9JlX1oiSndPeXuruTfDzJxevsFQBgbh26cny0debXZtyk95Kqur2q/riqXjPVzkhyYDjmwFRbUVVdUVWLVbV48ODBTWgRAGBjbVtl9uhqdebXRgfkh5K8uLtfmeSXknyyqk5JstJPxqr/O9XdV3f3Qncv7NixY4NbBADYeJeef9a66syvDV3ForufSPLE9Pi2qronycuydMX4zOHQM5M8uJHfGwBglg6tVmEVixPfhgbkqtqR5NHufrKqXpqlm/G+2t2PVtVjVXVBkluTvD3Jb27k9wYAmLUPXHyuQLwFHM0yb3uSfCnJOVV1oKreUVU/UVUHkvxwkr1VdeN0+GuT/EVV/XmS30/yzu4+dIPfu5J8NMndSe5J8vkN/iwAAPCMVc/5nZULCwu9uLg46zYAANhiquq27l5YXvenpgEAYCAgAwDAQEAGAICBgAwAAAMBGQAABgIyAAAMBGQAABgIyAAAMBCQAQBgICADAMBAQAYAgIGADAAAg+2zbgAAYL3O/+AX8vBj33rq+Qu/97m59b1vmGFHbCWuIAMAJ5Tl4ThJHn7sWzn/g1+YUUdsNQIyAHBCWR6O16rDegnIAAAwEJABAGAgIAMAJ5QXfu9z11WH9RKQAYATyq3vfcNhYdgqFmwky7wBACccYZjN5AoyAAAMBGQAABgIyAAAMBCQAQBgICADAMBAQAYAgIGADAAAAwEZAAAGAjIAAAwEZAAAGAjIAAAwEJABAGAgIAMAwEBABgCAgYAMAACD7bNuAACYXzt37T2sdu/ui2bQCRw/riADACtaKRwfqQ5bhYAMAAADARkAAAYCMgAADARkAAAYCMgAwIpWW63CKhZsdZZ5AwBWJQzzbLTmFeSquqaqHqmq/UPtkqq6s6q+U1ULQ/0NVXVbVe2b9j8yvPbFqrqrqu6YttM3/uMAAMAzczRTLK5NcuGy2v4kb0ly87L615P8WHefm+TyJL+97PXLuvu8aXvkGPoFAIBNteYUi+6+uap2Lqt9OUmqavmxtw9P70zyvKo6qbufeOatAgDA5tvMm/R+Msnty8Lxx6bpFe+r5el6UFVXVNViVS0ePHhwE1sEAICn25SAXFWvSPJrSX52KF82Tb14zbS9bbX3d/fV3b3Q3Qs7duzYjBYBAGBFGx6Qq+rMJJ9O8vbuvudQvbsfmPaPJflkkldv9PcGAIBnakMDclWdmmRvkiu7+0+G+vaqOm16/Jwkb8rSjX4AADBXjmaZtz1JvpTknKo6UFXvqKqfqKoDSX44yd6qunE6/OeT/ECS9y1bzu2kJDdW1V8kuSPJA0k+sgmfBwAAnpHq7ln3cEQLCwu9uLg46zYAANhiquq27l5YXvenpgEAYCAgAwDAQEAGAICBgAwAAAMBGQAABttn3QAAkOzctfew2r27L5pBJ4AryAAwYyuF4yPVgc0lIAMAwEBABgCAgYAMAAADARkAAAYCMgDM2GqrVVjFAmbDMm8AMAeEYZgfriADAMBAQAYAgIGADAAAAwEZAAAGAjIAAAwEZAAAGAjIAAAwEJABAGAgIAMAwEBABgCAgYAMAAADARkAAAYCMgAADARkAAAYbJ91AwBwPO3ctfew2r27L5pBJ8C8cgUZgGeNlcLxkerAs5OADAAAAwEZAAAGAjIAAAwEZAAAGAjIADxrrLZahVUsgJFl3gB4VhGGgbW4ggwAAAMBGQAABgIyAAAMBGQAABgIyAAAMBCQAQBgICADAMBAQAYAgMGaAbmqrqmqR6pq/1C7pKrurKrvVNXCsuOvrKq7q+quqnrjUH9VVe2bXvuNqqqN/SgAAPDMHc0V5GuTXListj/JW5LcPBar6uVJ3prkFdN7/mdVbZte/nCSK5KcPW3LvyYAAMzcmgG5u29O8uiy2pe7+64VDn9zkt/t7ie6+6+T3J3k1VX1oiSndPeXuruTfDzJxc+4ewAA2GAbPQf5jCT3D88PTLUzpsfL6yuqqiuqarGqFg8ePLjBLQIAwOo2OiCvNK+4j1BfUXdf3d0L3b2wY8eODWsOAADWstEB+UCSs4bnZyZ5cKqfuUIdAADmykYH5M8meWtVnVRVL8nSzXh/2t0PJXmsqi6YVq94e5LPbPD3BgCAZ2z7WgdU1Z4kr0tyWlUdSPL+LN2095tJdiTZW1V3dPcbu/vOqro+yV8m+XaSn+vuJ6cv9a4srYhxcpLPTxsAW9TOXXsPq927+6IZdAKwPrW0qMT8WlhY6MXFxVm3AcA6rBS
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,6))\n",
3 years ago
"plt.scatter(heights,salaries)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Let's see what happens if the relation is not linear. Suppose that our corporation decided to hide the obvious linear dependency between heights and salaries, and introduced some non-linearity into the formula, such as `sin`:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 23,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Correlation = 0.9835304456670837\n"
]
}
],
3 years ago
"source": [
"salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100\n",
"print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
3 years ago
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"In this case, the correlation is slightly smaller, but it is still quite high. Now, to make the relation even less obvious, we might want to add some extra randomness by adding some random variable to the salary. Let's see what happens:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 24,
3 years ago
"metadata": {},
"outputs": [
{
"name": "stdout",
3 years ago
"output_type": "stream",
"text": [
"Correlation = 0.9363097848296155\n"
]
}
],
3 years ago
"source": [
"salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100+np.random.random(size=len(heights))*20-10\n",
"print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 25,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAlY0lEQVR4nO3dcZTdZ3kn9u/jsUzGzqFjg02tMa4IdZQT44BiLThlt8su64p2E5h1IIsLB5+Wxrs07WmSEwXrrNuQc5zFG+2Slm7LqUNZTE29JcQZ2ANUy5qwbDkWWRGFCCfo2BAwGjnYiREhMAExfvvHXImfxnOluaPR/O7MfD7n6Ny5z7139Or1zJ2v33l+71uttQAAAIsu6nsAAAAwTgRkAADoEJABAKBDQAYAgA4BGQAAOi7uewDn8tznPrft2LGj72EAALDJfPazn/2z1tqVS+tjH5B37NiRQ4cO9T0MAAA2mar6ynJ1LRYAANAhIAMAQIeADAAAHQIyAAB0CMgAANAhIAMAQIeADAAAHQIyAAB0CMgAANAhIAMAQIeADAAAHQIyAAB0CMgAANBxcd8DAADgwpk9PJf9B47m+In5bJ+azN49OzOza7rvYY01ARkAYJOaPTyXfQ8cyfzJhSTJ3In57HvgSJIIyWehxQIAYJPaf+Do6XB8yvzJhew/cLSnEW0MAjIAwCZ1/MT8SHUWabEAAFgj49bvu31qMnPLhOHtU5M9jGbjsIIMALAGTvX7zp2YT8v3+31nD8/1Nqa9e3ZmctvEGbXJbRPZu2dnTyPaGARkAIA1MI79vjO7pvP2W27I9NRkKsn01GTefssNLtA7By0WAABrYFz7fWd2TQvEI7KCDACwBob19er33XgEZACANaDfd/PQYgEAsAZOtTGM0y4WrI6ADACwRvT7rty4bYnXJSADALCuxv0IbD3IAACsq3HcEq/LCjIAwBoZ57aBcTKuW+KdYgUZAGANjONJeuNq3LfEE5ABANbAuLcNjJNx3xJPiwUAwBoY97aBcTLuW+IJyADAUHpqV2771GTmlgnD49I2MG7GeUs8LRYAwLL01I5m3NsGWDkryADAss7WU9v3yt84rmyPe9sAKycgA8AYuHP2SO7/zFez0FomqnLry56fu2Zu6HVM49pTO86HTIxz2wArp8UCAHp25+yR3HfwsSy0liRZaC33HXwsd84e6XVc47oVl90iuNAEZADo2fsPPjZSfb2Ma0/tuK5ss3kIyADQszZifb3M7JrO22+5IdNTk6kk01OTefstN/TeQjCuK9tsHnqQAYChxrGndu+enWf0ICfjsbLN5nHOFeSqek9VPVFVn+/UXldVD1fV01W1e8nz91XVo1V1tKr2dOo3VtWRwWPvrKpa238KALAVjOvKNpvHSlaQ35vknyd5X6f2+SS3JPk/uk+sqh9N8vok1yfZnuTfVNUPt9YWkrwrye1JDib5aJJXJfnYeY4fADa8N950be5bpt/4jTdd28NozjSO26kl47myzeZxzoDcWvtUVe1YUvvjJFlmEfg1Sf5la+07Sf6kqh5N8tKq+nKSZ7fWHhq87n1JZiIgA8Dp7dzGbZu3cd5ObVyDO5vDWvcgT2dxhfiUY4PaycHHS+vLqqrbs7janGuv7f//ngHgQrtr5obeA/FS43pQyDgHdzaHtd7FYrm+4naW+rJaa/e01na31nZfeeWVazY4ABhXs4fn8vK7P5EX3PGRvPzuT4zFcc7jup2afZC50NZ6BflYkud37l+T5Pigfs0ydQDY8sZ1RXTq0m35+rdPLlvv09yQgD6sDqNa6xXkDyd5fVU9q6pekOS6JL/XWns8yTer6qbB7hVvSvKhNf67AWBDGtcV0Tbkd73D6utlYshGWMPqMKpzriBX1f1JXpHkuVV1LMmvJHkqyf+a5MokH6mqP2it7WmtPVxVH0jyR0m+l+TnBjtYJMlbsrgjxmQWL85zgR4AZHxbGb4x/8zV47PV18vCkIQ+rA6jWskuFrcOeeh3hjz/15L82jL1Q0leNNLoAGALGNdWhu1Tk8u2LfR9Yt30kHFNO0mPNeKoaQC2lHG8GO47S9orzlVfLzues3zgHFZfL3v37Mzktokzak7SYy05ahqALWP28Fz2fvBzObmw+Kv4uRPz2fvBzyXp92K4b598eqT6ejn4pa+PVF8vp/5b2QeZC0VABmDL+NV/9fDpcHzKyYWWX/1XDwtXyxjnXl8n6XEhabEAYMtYrs/3bPX1MjW5fK/xsPp6sVsEW5WADAA9+8kXXz1Sfb3c9EOXj1SHzUJABoCe/e4Xnhypvl6+/OfLbzM3rA6bhYAMAD0b15PhxnV/ZrjQBGQAtoxtQ37qDatvdcP2O+57H2S40LwlALBlfG/IrmnD6lud/YbZqmzzBsCWMWxzsv43LRtP9htmqxKQAaBnl267aNlDQS4dg94P+w2zFfX/nQcAW9wtN14zUh24sARkAOjZuG7zBluVgAwAPbOdGowXPcgAXBCzh+dc3LVC26cml93z2HZqw/n64kISkAFYc7OH57LvgSOZP7mQZPHAi30PHEkSIWYZO56zfEDe8RwBeTm+vrjQtFgAsOb2Hzh6OrycMn9yIfsPHO1pROPt4Je+PlJ9Pc0ensvL7/5EXnDHR/Lyuz+R2cNzfQ/J1xcXnBVkANbcuB6dPK4W2vI7MQ+rr5fZw3PZ+1ufy8mnF8cxd2I+e3/rc0n6XanVs82FZgUZAHo2UTVSfb287cMPnw7Hp5x8uuVtH364pxEtcgQ2F5qADAA9e+4Pbhupvl5OzJ8cqb5eHIHNhabFAgB69rVvfnek+lbnCOzR2PFjdAIyALDhOAJ7Zez4sTpaLAAANik7fqyOgAwAPds25KfxsDqslB0/Vse3HgD07O+/9NqR6rBSdvxYHQEZAHr2u194cqQ643mAyTiy48fqCMgA0LNx/TX4y194xUj19XLqwrO5E/Np+f6FZ0LyM83sms7bb7kh01OTqSTTU5N5+y03uEDvHOxiAQA92z41uewpg33/Gvz9P/sTecNvPpRPf/Gp07WXv/CKvP9nf6LHUZ39wjPB75ns+DE6ARkAeva3fuTK3HfwsWXrfes7DC9nXFfc2Ty0WABAz/Qgj8aFZ1xoAjIA9Gy59oqz1bc6F55xoWmxAICeTVRlobVl630bx2OKHTXNhSYgA0DPlgvHZ6uvl3E+ptiFZ1xIWiwAoGfTQ3pnh9XXi2OK2aqsIANsAnfOHsn9n/lqFlrLRFVufdnzc9fMDX0PixXa8Zzlt3nb8Zx+A7LdItiqrCADbHB3zh7JfQcfO/3r+IXWct/Bx3Ln7JGeR8ZKHfzS10eqrxe7RbBVCcgAG9z//Zln7p97tjrjZ1x7kO0WwValxQJgg3t6SIYaVmf8XFTL//e6qOdNLOwWwVYlIANAzyaGBOSJ/nd5s1sEW5IWCwDo2cmnR6sDF5aADMCWcdklEyPVga1JQAZgy3h6yEVvw+rA1iQgA7BlzA/pWRhWXy/DTpQeg5OmYUsSkAGgZ2942bUj1YELyy4WANCzU6ceOg0RxoOADMCWcdklE/nWdxeWrfftrpkbBGIYE1osANgy2pCL8YbVga1JQAZgy/j2kIvxhtWBrUlABgCADgEZAAA6BGQAAOg4Z0CuqvdU1RNV9flO7Yqq+nhVPTK4vXxQ31ZV91bVkar646ra13nNjYP6o1X1zirbnwOwvob94PEDCehayQrye5O8akntjiQPttauS/Lg4H6SvC7Js1prNyS5Mck/qKodg8feleT2JNcN/iz9nABwQQ3bq8IeFkDXOQNya+1TSZ5aUn5NknsHH9+bZObU05NcVlUXJ5lM8t0kf1FVVyd5dmvtoba4l877Oq8BAICxsdoe5Oe11h5PksHtVYP6B5N8K8njSR5L8k9ba08lmU5yrPP6Y4MaAACMlbU+Se+lSRaSbE9yeZJ/V1X/Jsu3dw39jVZV3Z7Fdoxce61z6AFYG5Xlf/joQQa6VruC/LVB20QGt08M6v9lkv+3tXaytfZEkk8n2Z3FFeNrOq+/JsnxYZ+8tXZPa213a233lVdeucohAsC
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,6))\n",
3 years ago
"plt.scatter(heights, salaries)\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"> Can you guess why the dots line up into vertical lines like this?\n",
"\n",
"We have observed the correlation between an artificially engineered concept like salary and the observed variable *height*. Let's also see if the two observed variables, such as height and weight, correlate too:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 26,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., nan],\n",
" [nan, nan]])"
]
},
"execution_count": 26,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"np.corrcoef(df['Height'],df['Weight'])"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"Unfortunately, we did not get any results - only some strange `nan` values. This is due to the fact that some of the values in our series are undefined, represented as `nan`, which causes the result of the operation to be undefined as well. By looking at the matrix we can see that `Weight` is the problematic column, because self-correlation between `Height` values has been computed.\n",
3 years ago
"\n",
"> This example shows the importance of **data preparation** and **cleaning**. Without proper data we cannot compute anything.\n",
"\n",
"Let's use `fillna` method to fill the missing values, and compute the correlation: "
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 27,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0.52959196],\n",
" [0.52959196, 1. ]])"
]
},
"execution_count": 27,
"metadata": {},
3 years ago
"output_type": "execute_result"
}
],
3 years ago
"source": [
"np.corrcoef(df['Height'],df['Weight'].fillna(method='pad'))"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
"There is indeed a correlation, but not such a strong one as in our artificial example. Indeed, if we look at the scatter plot of one value against the other, the relation would be much less obvious:"
3 years ago
]
},
{
"cell_type": "code",
"execution_count": 28,
3 years ago
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAABCr0lEQVR4nO3df3Td5XXn+8+2kEEQiKAxpBZ27XgcpTBOcavEppreUjpeYqA3aPk2Db7QlZnmktUObeqQqLWLV7NyF1x76lzSzGp714Um03TsOiGJR82MIa47Dp2Jr20qYhI1EA9QiI2cAq1jYIhijLzvH+cc+fz6SufYPt9nH533ay0tpK0f3jzne77aes7z7MfcXQAAAAAK5qVOAAAAAIiEAhkAAAAoQ4EMAAAAlKFABgAAAMpQIAMAAABlLkidwLl429ve5kuWLEmdBgAAANrQ448//o/uvqA63tYF8pIlSzQ2NpY6DQAAALQhM/tevThLLAAAAIAyFMgAAABAGQpkAAAAoAwFMgAAAFCGAhkAAAAoQ4EMAAAAlKFABgAAAMpQIAMAAABlKJABAACAMhTIAAAAQBkKZAAAAKAMBTIAAABQhgIZAAAAKHNB6gQAAEB8o4cmtHX3YR07MamFvT0aGerX8Mq+1GkBLUGBDAAAZjR6aEIbd45r8tSUJGnixKQ27hyXJIpkzEkssQAAADPauvvwdHFcMnlqSlt3H06UEdBaFMgAAGBGx05MNhUH2h0FMgAAmNHC3p6m4kC7o0AGAAAzGhnqV093V0Wsp7tLI0P9iTICWotNegAAYEaljXh0sUCnoEAGAACzGl7ZR0GMjsESCwAAAKAMBTIAAABQhgIZAAAAKEOBDAAAAJShQAYAAADKUCADAAAAZSiQAQAAgDIUyAAAAEAZCmQAAACgDAUyAAAAUIYCGQAAAChDgQwAAACUoUAGAAAAylAgAwAAAGUokAEAAIAyFMgAAABAmZYVyGa2yMy+bmZPmdl3zOy3i/HrzOyAmT1hZmNm9t6y79loZs+Y2WEzG2pVbgAAAECWC1r4s9+U9DF3/6aZXSrpcTPbI+kPJH3S3R8xs5uLH99gZtdIuk3StZIWSvprM3unu0+1MEcAAACgQstmkN39++7+zeL7r0l6SlKfJJd0WfHL3irpWPH9WyV9wd1Puvtzkp6R9F4BAAAAOWrlDPI0M1siaaWkg5LWS9ptZp9SoUD/2eKX9Uk6UPZtLxRj1T/rw5I+LEmLFy9uWc4AAADoTC3fpGdmb5H0FUnr3f1VSb8h6aPuvkjSRyV9tvSldb7dawLuD7j7gLsPLFiwoFVpAwAAoEO1dAbZzLpVKI63u/vOYviDkn67+P6XJP1p8f0XJC0q+/ardWb5BQCgQaOHJrR192EdOzGphb09Ghnq1/DKmhfkAAAZWtnFwlSYHX7K3e8v+9QxST9ffP9GSU8X3/+qpNvM7EIzWyppuaTHWpUfAMxFo4cmtHHnuCZOTMolTZyY1Mad4xo9NJE6NQBoG62cQR6U9KuSxs3siWLs9yTdKekzZnaBpB+puJ7Y3b9jZg9JelKFDhh30cECAJqzdfdhTZ6qvHVOnprS1t2HmUUGgAa1rEB292+o/rpiSfqZjO+5T9J9rcoJAOa6Yycmm4oDAGpxkh4AzCELe3uaigMAalEgA8AcMjLUr57uropYT3eXRob6E2UEAO0nlz7IAIB8lNYZ08UCAM4eBTIAzDHDK/soiAHgHFAgAwA6Fj2jAdRDgQwA6EilntGltnilntGSKJKBDscmPQBAR5qpZzSAzkaBDADoSPSMBpCFAhkA0JHoGQ0gCwUyAKAj0TMaQBY26QEAOhI9owFkoUAGAHQsekYDqIclFgAAAEAZCmQAAACgDAUyAAAAUIYCGQAAACjDJj0AmGNGD03QmQEAzgEFMgDMIaOHJrRx5/j0EcoTJya1cee4JFEkA0CDWGIBAHPI1t2Hp4vjkslTU9q6+3CijACg/VAgA8AccuzEZFNxAEAtCmQAmEMW9vY0FQcA1KJABoA5ZGSoXz3dXRWxnu4ujQz1J8oIANoPm/QAYA4pbcSjiwUAnD0KZACYY4ZX9lEQA8A5YIkFAAAAUIYCGQAAAChDgQwAAACUoUAGAAAAylAgAwAAAGUokAEAAIAyFMgAAABAGQpkAAAAoAwFMgAAAFCGAhkAAAAoQ4EMAAAAlKFABgAAAMpQIAMAAABlKJABAACAMhTIAAAAQBkKZAAAAKBMywpkM1tkZl83s6fM7Dtm9ttln/stMztcjP9BWXyjmT1T/NxQq3IDAAAAslzQwp/9pqSPufs3zexSSY+b2R5JV0m6VdK73f2kmV0pSWZ2jaTbJF0raaGkvzazd7r7VAtzBFDH6KEJbd19WMdOTGphb49Ghvo1vLIvdVpoc1xX6BRc642LOlYtK5Dd/fuSvl98/zUze0pSn6Q7JW1x95PFz71U/JZbJX2hGH/OzJ6R9F5J+1uVI4Bao4cmtHHnuCZPFf42nTgxqY07xyUpxE0L7YnrCp2Ca71xkccqlzXIZrZE0kpJByW9U9LPmdlBM/sbM3tP8cv6JB0t+7YXijEAOdq6+/D0zapk8tSUtu4+nCgjzAVcV+gUXOuNizxWrVxiIUkys7dI+oqk9e7+qpldIOlySaslvUfSQ2b2DklW59u9zs/7sKQPS9LixYtbljfQqY6dmGwqDjSC6wqdgmu9cZHHqqUzyGbWrUJxvN3ddxbDL0ja6QWPSTot6W3F+KKyb79a0rHqn+nuD7j7gLsPLFiwoJXpAx1pYW9PU3GgEVxX6BRc642LPFat7GJhkj4r6Sl3v7/sU6OSbix+zTslzZf0j5K+Kuk2M7vQzJZKWi7psVblB6C+kaF+9XR3VcR6urs0MtSfKCPMBVxX6BRc642LPFatXGIxKOlXJY2b2RPF2O9J+pykz5nZ30l6Q9IH3d0lfcfMHpL0pAodMO6igwWQv9LGiIi7itG+uK7QKbjWGxd5rKxQm7angYEBHxsbS50GAKABUds5AehcZva4uw9Ux1u+SQ8AgMjtnACgGkdNAwBaLnI7JwCoRoEMAGi5yO2cAKAaBTIAoOUit3MCgGoUyACAlovazmn00IQGt+zV0g27NLhlr0YPTSTNB0AMbNIDALRcxHZObBwEkIUZZABAR2LjIIAszCADAFou4mwtGwcBZGEGGQDQchFna9k4CCALBTIAoOUiztZG3TgIID0KZABAy0WcrR1e2afNa1eor7dHJqmvt0eb165ggx4A1iADAFpvZKi/Yg2yFGO2dnhlHwUxgBoUyACAlovY5g0AslAgAwBywWwtgHbBGmQAAACgDAUyAAAAUIYlFkBio4cmWJfZIMaqvd3+4H7te/b49MeDy67Q9juvT5hRzJzQOO4JjWOsmsMMMpBQ6XSxiROTcp05XWz00ETq1MJhrNpbdSEqSfuePa7bH9yfKKOYOaFx3BMax1g1jwIZSCji6WJRMVbtrboQnS2eh4g5oXHcExrHWDWPJRZAQhFPF4uKsUIn4eXw2XFPaBxj1TxmkIGEIp4uFhVjhU7By+GN4Z7QOMaqeRTIQEIjQ/3q6e6qiEU4XSwixqq9DS67oql4HiLmJPFyeKO4JzSOsWoeBTKQ0PDKPm1eu0J9vT0ySX29Pdq8dgUvpdbBWLW37XdeX1N4pu4YETEniZfDG8U9oXGMVfPM3VPncNYGBgZ8bGwsdRoAAJw3g1v2aqJOMdzX26N9G25MkBEwd5nZ4+4+UB1nBhkAgEB4ORxIjy4WAAAEUnrZmy4WQDoUyAAwx9AirP0Nr+zjMQMSokAGgDmk1CKs1AWh1CJMEgUXADSINcgAMIfQIgwAzh0zyABwDjaNjmvHwaOacleXmdatWqR7h1cky4cWYc2J9vhFxtIddBIKZAA4S5tGx7XtwJHpj6fcpz9OVWTNv2CeTr55um4clSI+flGxdAedhjsmgBqjhyY0uGWvlm7YpcEtezniNsOOg0ebiuehXnE8UzxP0a6riI9fVCzdQadhBhlABWaKGjeVcdBSVryTRbyuePwax9IddBpmkAFUYKYIrRD
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
3 years ago
]
},
"metadata": {
"needs_background": "light"
},
3 years ago
"output_type": "display_data"
}
],
3 years ago
"source": [
"plt.figure(figsize=(10,6))\n",
3 years ago
"plt.scatter(df['Height'],df['Weight'])\n",
"plt.xlabel('Height')\n",
"plt.ylabel('Weight')\n",
"plt.tight_layout()\n",
3 years ago
"plt.show()"
]
},
{
"cell_type": "markdown",
3 years ago
"metadata": {},
"source": [
3 years ago
"## Conclusion\n",
"\n",
"In this notebook we have learnt how to perform basic operations on data to compute statistical functions. We now know how to use a sound apparatus of math and statistics in order to prove some hypotheses, and how to compute confidence intervals for arbitrary variables given a data sample. "
3 years ago
]
}
],
"metadata": {
3 years ago
"interpreter": {
"hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
3 years ago
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
3 years ago
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
3 years ago
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
3 years ago
}