From 22c9d768b1f568a8ec812e7c83f174dbe97235cb Mon Sep 17 00:00:00 2001 From: Dmitri Soshnikov Date: Thu, 19 Aug 2021 15:01:14 +0300 Subject: [PATCH] Continue work on prob and statistics --- .../04-stats-and-probability/README.md | 78 +++++- .../04-stats-and-probability/notebook.ipynb | 253 +++++++++++++++++- 2 files changed, 321 insertions(+), 10 deletions(-) diff --git a/1-Introduction/04-stats-and-probability/README.md b/1-Introduction/04-stats-and-probability/README.md index 504b6c0f..f2fa0c22 100644 --- a/1-Introduction/04-stats-and-probability/README.md +++ b/1-Introduction/04-stats-and-probability/README.md @@ -51,7 +51,7 @@ To help us understand the distribution of data, it is helpful to talk about **qu Graphically we can represent relationship between median and quartiles in a diagram called the **box plot**: -![Box Plot](images/boxplot_explanation.png) + Here we also computer **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR]. @@ -92,6 +92,82 @@ If we plot the histogram of the generated samples we will see the picture very s *Normal Distribution with mean=0 and std.dev=1* +## Confidence Intervals + +When we talk about weights of baseball players, we assume that there is certain **random variable W** that corresponds to ideal probability distribution of weights of all baseball players. Our sequence of weights corresponds to a subset of all baseball players that we call **population**. An interesting question is, can we know the parameters of distribution of W, i.e. mean and variance? + +The easiest answer would be to calculate mean and variance of our sample. However, it could happen that our random sample does not accurately represent complete population. Thus it makes sense to talk about **confidence interval**. + +Suppose we have a sample X1, ..., Xn from our distribution. Each time we draw a sample from our distribution, we would end up with different mean value μ. Thus μ can be considered to be a random variable. A **confidence interval** with confidence p is a pair of values (Lp,Rp), such that **P**(Lp≤μ≤Rp) = p, i.e. a probability of measured mean value falling within the interval equals to p. + +It does beyond our short intro to discuss how those confidence intervals are calculated. Some more details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). An example of calculating confidence interval for weights and heights is given in the [accompanying notebooks](notebook.ipynb). + +| p | Weight mean | +|-----|-----------| +| 0.85 | 201.73±0.94 | +| 0.90 | 201.73±1.08 | +| 0.95 | 201.73±1.28 | + +Notice that the higher is the confidence probability, the wider is the confidence interval. + +## Hypothesis Testing + +In our baseball players dataset, there are different player roles, that can be summarized below (look at the [accompanying notebook](notebook.ipynb) to see how this table can be calculated): + +| Role | Height | Weight | Count | +|------|--------|--------|-------| +| Catcher | 72.723684 | 204.328947 | 76 | +| Designated_Hitter | 74.222222 | 220.888889 | 18 | +| First_Baseman | 74.000000 | 213.109091 | 55 | +| Outfielder | 73.010309 | 199.113402 | 194 | +| Relief_Pitcher | 74.374603 | 203.517460 | 315 | +| Second_Baseman | 71.362069 | 184.344828 | 58 | +| Shortstop | 71.903846 | 182.923077 | 52 | +| Starting_Pitcher | 74.719457 | 205.163636 | 221 | +| Third_Baseman | 73.044444 | 200.955556 | 45 | + +We can notice that the mean heights of first basemen is higher that that of second basemen. Thus, we may be tempted to conclude that **first basemen are higher than second basemen**. + +> This statement is called **a hypothesis**, because we do not know whether the fact is actually true or not. + +However, it is not always obvious whether we can make this conclusion. From the discussion above we know that each mean has an associated confidence interval, and thus this difference can just be a statistical error. We need some more formal way to test our hypothesis. + +Let's compute confidence intervals separately for heights of first and second basemen: + +| Confidence | First Basemen | Second Basemen | +|------------|---------------|----------------| +| 0.85 | 73.62..74.38 | 71.04..71.69 | +| 0.90 | 73.56..74.44 | 70.99..71.73 | +| 0.95 | 73.47..74.53 | 70.92..71.81 | + +We can see that under no confidence the intervals overlap. That proves our hypothesis that first basemen are higher than second basemen. + +More formally, the problem we are solving is to see if **two probability distributions are the same**, or at least have the same parameters. Depending on the distribution, we need to use different tests for that. If we know that our distributions are normal, we can apply **[Student t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)**. + +In Student t-test, we compute so-called **t-value**, which indicates the difference between means, taking into account the variance. It is demonstrated that t-value follows **student distribution**, which allows us to get the threshold value for a given confidence level **p** (this can be computed, or looked up in the numerical tables). We then compare t-value to this threshold to approve or reject the hypothesis. + +In Python, we can use **SciPy** package, which includes `ttest_ind` function (in addition to many other useful statistical functions!). It computes the t-value for us, and also does the reverse lookup of confidence p-value, so that we can just look at the confidence to draw the conclusion. + +For example, our comparison between heights of first and second basemen give us the following results: +```python +from scipy.stats import ttest_ind + +tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Designated_Hitter',['Height']],equal_var=False) +print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}") +``` +``` +T-value = 7.65 +P-value: 9.137321189738925e-12 +``` +In our case, p-value is very low, meaning that there is strong evidence supporting that first basemen are taller. + +> **Challenge**: Use the sample code in the notebook to test other hypothesis that: (1) First basemen and older that second basemen; (2) First basemen and taller than third basemen; (3) Shortstops are taller than second basemen + + +There are different types of hypothesis that we might want to test, for example: +* To prove that a given sample follows some distribution. In our case we have assumed that heights are normally distributed, but that needs formal statistical verification. +* To prove that a mean value of a sample corresponds to some predefined value +* To prove that ## Law of Large Numbers and Central Limit Theorem diff --git a/1-Introduction/04-stats-and-probability/notebook.ipynb b/1-Introduction/04-stats-and-probability/notebook.ipynb index c94d8b3a..8b83ab72 100644 --- a/1-Introduction/04-stats-and-probability/notebook.ipynb +++ b/1-Introduction/04-stats-and-probability/notebook.ipynb @@ -93,9 +93,9 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 168, "source": [ - "df = pd.read_csv(\"../../data/SOCR_MLB.tsv\",sep='\\t',header=None,names=['Name','Team','Rome','Height','Weight','Age'])\r\n", + "df = pd.read_csv(\"../../data/SOCR_MLB.tsv\",sep='\\t',header=None,names=['Name','Team','Role','Height','Weight','Age'])\r\n", "df" ], "outputs": [ @@ -103,7 +103,7 @@ "output_type": "execute_result", "data": { "text/plain": [ - " Name Team Rome Height Weight Age\n", + " Name Team Role Height Weight Age\n", "0 Adam_Donachie BAL Catcher 74 180.0 22.99\n", "1 Paul_Bako BAL Catcher 74 215.0 34.69\n", "2 Ramon_Hernandez BAL Catcher 72 210.0 30.78\n", @@ -139,7 +139,7 @@ " \n", " Name\n", " Team\n", - " Rome\n", + " Role\n", " Height\n", " Weight\n", " Age\n", @@ -252,7 +252,7 @@ ] }, "metadata": {}, - "execution_count": 26 + "execution_count": 168 } ], "metadata": {} @@ -298,10 +298,10 @@ }, { "cell_type": "code", - "execution_count": 148, + "execution_count": 158, "source": [ "plt.figure(figsize=(10,2))\r\n", - "plt.boxplot(df['Height'],vert=False)\r\n", + "plt.boxplot(df['Height'],vert=False,showmeans=True)\r\n", "plt.grid(color='gray',linestyle='dotted')\r\n", "plt.show()" ], @@ -312,8 +312,8 @@ "text/plain": [ "
" ], - "image/svg+xml": "\r\n\r\n\r\n \r\n \r\n \r\n \r\n 2021-08-17T12:38:47.369488\r\n image/svg+xml\r\n \r\n \r\n Matplotlib v3.4.2, https://matplotlib.org/\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", - "image/png": "" + "image/svg+xml": "\r\n\r\n\r\n \r\n \r\n \r\n \r\n 2021-08-17T13:50:40.404536\r\n image/svg+xml\r\n \r\n \r\n Matplotlib v3.4.2, https://matplotlib.org/\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", + "image/png": "" }, "metadata": {} } @@ -499,6 +499,241 @@ ], "metadata": {} }, + { + "cell_type": "markdown", + "source": [ + "## Confidence Intervals\r\n", + "\r\n", + "Let's now calculate confidence intervals for the weights and heights of baseball players. We will use the code [from this stackoverflow discussion](https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data):" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 181, + "source": [ + "import scipy.stats\r\n", + "\r\n", + "def mean_confidence_interval(data, confidence=0.95):\r\n", + " a = 1.0 * np.array(data)\r\n", + " n = len(a)\r\n", + " m, se = np.mean(a), scipy.stats.sem(a)\r\n", + " h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)\r\n", + " return m, h\r\n", + "\r\n", + "for p in [0.85, 0.9, 0.95]:\r\n", + " m, h = mean_confidence_interval(df['Weight'].fillna(method='pad'),p)\r\n", + " print(f\"p={p:.2f}, mean = {m:.2f}±{h:.2f}\")" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "p=0.85, mean = 201.73±0.94\n", + "p=0.90, mean = 201.73±1.08\n", + "p=0.95, mean = 201.73±1.28\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Hypothesis Testing\r\n", + "\r\n", + "Let's explore different roles in our baseball players dataset:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 175, + "source": [ + "df.groupby('Role').agg({ 'Height' : 'mean', 'Weight' : 'mean', 'Age' : 'count'}).rename(columns={ 'Age' : 'Count'})" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Height Weight Count\n", + "Role \n", + "Catcher 72.723684 204.328947 76\n", + "Designated_Hitter 74.222222 220.888889 18\n", + "First_Baseman 74.000000 213.109091 55\n", + "Outfielder 73.010309 199.113402 194\n", + "Relief_Pitcher 74.374603 203.517460 315\n", + "Second_Baseman 71.362069 184.344828 58\n", + "Shortstop 71.903846 182.923077 52\n", + "Starting_Pitcher 74.719457 205.163636 221\n", + "Third_Baseman 73.044444 200.955556 45" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HeightWeightCount
Role
Catcher72.723684204.32894776
Designated_Hitter74.222222220.88888918
First_Baseman74.000000213.10909155
Outfielder73.010309199.113402194
Relief_Pitcher74.374603203.517460315
Second_Baseman71.362069184.34482858
Shortstop71.903846182.92307752
Starting_Pitcher74.719457205.163636221
Third_Baseman73.044444200.95555645
\n", + "
" + ] + }, + "metadata": {}, + "execution_count": 175 + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Let's test the hypothesis that First Basemen are higher then Second Basemen. The simplest way to do it is to test the confidence intervals:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 188, + "source": [ + "for p in [0.85,0.9,0.95]:\r\n", + " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)\r\n", + " m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)\r\n", + " print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Conf=0.85, 1st basemen height: 73.62..74.38, 2nd basemen height: 71.04..71.69\n", + "Conf=0.90, 1st basemen height: 73.56..74.44, 2nd basemen height: 70.99..71.73\n", + "Conf=0.95, 1st basemen height: 73.47..74.53, 2nd basemen height: 70.92..71.81\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "We can see that intervals do not overlap.\r\n", + "\r\n", + "More statistically correct way to prove the hypothesis is to use **Student t-test**:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 200, + "source": [ + "from scipy.stats import ttest_ind\r\n", + "\r\n", + "tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)\r\n", + "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "T-value = 7.65\n", + "P-value: 9.137321189738925e-12\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Two values returned by the `ttest_ind` functions are:\r\n", + "* p-value can be considered as the probability of two distributions having the same mean. In our case, it is very low, meaning that there is strong evidence supporting that first basemen are taller\r\n", + "* t-value is the intermediate value of normalized mean difference that is used in t-test, and it is compared against threshold value for a given confidence value " + ], + "metadata": {} + }, { "cell_type": "markdown", "source": [