diff --git a/1-Introduction/04-stats-and-probability/README.md b/1-Introduction/04-stats-and-probability/README.md
index f2fa0c2..83e3959 100644
--- a/1-Introduction/04-stats-and-probability/README.md
+++ b/1-Introduction/04-stats-and-probability/README.md
@@ -55,7 +55,7 @@ Graphically we can represent relationship between median and quartiles in a diag
Here we also computer **inter-quartile range** IQR=Q3-Q1, and so-called **outliers** - values, that lie outside the boundaries [Q1-1.5*IQR,Q3+1.5*IQR].
-For finite distribution that contains small number of possible values, a good "typical" value is the one that appears the most frequently, which is called **mode**. It is often applied to categorical data, such as colors. Consider a situation when we have two groups of people - some that strongly prefer red, and others who prefer blue. If we code colors by numbers, the mean value for a favourite color would be somewhere in the orange-green spectrum, which does not indicate the actual preference on neither group. However, the mode would be either one of the colors, or both colors, if the number of people voting for them is equal (in this case we call the sample **multimodal**).
+For finite distribution that contains small number of possible values, a good "typical" value is the one that appears the most frequently, which is called **mode**. It is often applied to categorical data, such as colors. Consider a situation when we have two groups of people - some that strongly prefer red, and others who prefer blue. If we code colors by numbers, the mean value for a favorite color would be somewhere in the orange-green spectrum, which does not indicate the actual preference on neither group. However, the mode would be either one of the colors, or both colors, if the number of people voting for them is equal (in this case we call the sample **multimodal**).
## Real-world Data
When we analyze data from real life, they often are not random variables as such, in a sense that we do not perform experiments with unknown result. For example, consider a team of baseball players, and their body data, such as height, weight and age. Those numbers are not exactly random, but we can still apply the same mathematical concepts. For example, a sequence of people's weights can be considered to be a sequence of values drawn from some random variable. Below is the sequence of weights of actual baseball players from [Major League Baseball](http://mlb.mlb.com/index.jsp), taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (for your convenience, only first 20 values are shown):
@@ -64,6 +64,8 @@ When we analyze data from real life, they often are not random variables as such
[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]
```
+> **Note**: To see the example of working with this dataset, have a look at the [accompanying notebook](notebook.ipynb). There is also a number of challenges throughout this lesson, and you may complete them by adding some code to that notebook. If you are not sure how to operate on data, do not worry - we will come back to working with data using Python at a later time.
+
Here is the box plot showing mean, median and quartiles for our data:
![Weight Box Plot](images/weight-boxplot.png)
@@ -94,13 +96,23 @@ If we plot the histogram of the generated samples we will see the picture very s
## Confidence Intervals
-When we talk about weights of baseball players, we assume that there is certain **random variable W** that corresponds to ideal probability distribution of weights of all baseball players. Our sequence of weights corresponds to a subset of all baseball players that we call **population**. An interesting question is, can we know the parameters of distribution of W, i.e. mean and variance?
+When we talk about weights of baseball players, we assume that there is certain **random variable W** that corresponds to ideal probability distribution of weights of all baseball players (so-called **population**). Our sequence of weights corresponds to a subset of all baseball players that we call **sample**. An interesting question is, can we know the parameters of distribution of W, i.e. mean and variance of the population?
+
+The easiest answer would be to calculate mean and variance of our sample. However, it could happen that our random sample does not accurately represent complete population. Thus it makes sense to talk about **confidence interval**.
-The easiest answer would be to calculate mean and variance of our sample. However, it could happen that our random sample does not accurately represent complete population. Thus it makes sense to talk about **confidence interval**.
+> **Confidence interval** is the estimation of true mean of the population given our sample, which is accurate is a certain probability (or **level of confidence**).
Suppose we have a sample X1, ..., Xn from our distribution. Each time we draw a sample from our distribution, we would end up with different mean value μ. Thus μ can be considered to be a random variable. A **confidence interval** with confidence p is a pair of values (Lp,Rp), such that **P**(Lp≤μ≤Rp) = p, i.e. a probability of measured mean value falling within the interval equals to p.
-It does beyond our short intro to discuss how those confidence intervals are calculated. Some more details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). An example of calculating confidence interval for weights and heights is given in the [accompanying notebooks](notebook.ipynb).
+It does beyond our short intro to discuss in detail how those confidence intervals are calculated. Some more details can be found [on Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval). In short, we define the distribution of computed sample mean relative to the true mean of the population, which is called **student distribution**.
+
+> **Interesting fact**: Student distribution is named after mathematician William Sealy Gosset, who published his paper under pseudonym "Student". He worked in the Guinness brewery, and, according to one of the versions, his employer did not want general public to know that they were using statistical tests to determine the quality of raw materials.
+
+If we want to estimate the mean μ of our population with confidence p, we need to take *(1-p)/2-th percentile* of a Student distribution A, which can either be taken from tables, or computer using some built-in functions of statistical software (eg. Python, R, etc.). Then the interval for μ would be given by X±A*D/√n, where X is the obtained mean of the sample, D is the standard deviation.
+
+> **Note**: We also omit the discussion of an important concept of [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)), which is important in relation to Student distribution. You can refer to more complete books on statistics to understand this concept deeper.
+
+An example of calculating confidence interval for weights and heights is given in the [accompanying notebooks](notebook.ipynb).
| p | Weight mean |
|-----|-----------|
@@ -163,11 +175,10 @@ In our case, p-value is very low, meaning that there is strong evidence supporti
> **Challenge**: Use the sample code in the notebook to test other hypothesis that: (1) First basemen and older that second basemen; (2) First basemen and taller than third basemen; (3) Shortstops are taller than second basemen
-
-There are different types of hypothesis that we might want to test, for example:
+There are also different other types of hypothesis that we might want to test, for example:
* To prove that a given sample follows some distribution. In our case we have assumed that heights are normally distributed, but that needs formal statistical verification.
* To prove that a mean value of a sample corresponds to some predefined value
-* To prove that
+* To compare means of a number of samples (eg. what is the difference in happiness levels amond different age groups)
## Law of Large Numbers and Central Limit Theorem
@@ -205,9 +216,6 @@ In our case, the value 0.53 indicates that there is some correlation between wei
> More examples of correlation and covariance can be found in [accompanying notebook](notebook.ipynb).
-
-
-
## 🚀 Challenge
@@ -217,7 +225,12 @@ In our case, the value 0.53 indicates that there is some correlation between wei
## Review & Self Study
+Probability and statistics is such a broad topic that it deserves its own course. If you are interested to go deeper into theory, you may want to continue reading some of the following books:
+
+1. [Carlos Fernanderz-Granda](https://cims.nyu.edu/~cfgranda/) from New York University has great lecture notes [Probability and Statistics for Data Science](https://cims.nyu.edu/~cfgranda/pages/stuff/probability_stats_for_DS.pdf) (available online)
+1. [Peter and Andrew Bruce. Practical Statistics for Data Scientists.](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/) [[sample code in R](https://github.com/andrewgbruce/statistics-for-data-scientists)].
+1. [James D. Miller. Statistics for Data Science](https://www.packtpub.com/product/statistics-for-data-science/9781788290678) [[sample code in R](https://github.com/PacktPublishing/Statistics-for-Data-Science)]
## Assignment
-[Assignment Title](assignment.md)
+[Small Diabetes Study](assignment.md)
diff --git a/1-Introduction/04-stats-and-probability/assignment.ipynb b/1-Introduction/04-stats-and-probability/assignment.ipynb
new file mode 100644
index 0000000..a6f8147
--- /dev/null
+++ b/1-Introduction/04-stats-and-probability/assignment.ipynb
@@ -0,0 +1,252 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Introduction to Probability and Statistics\r\n",
+ "## Assignment\r\n",
+ "\r\n",
+ "In this assignment, we will use the dataset of diabetes patients taken [from here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "source": [
+ "import pandas as pd\r\n",
+ "import numpy as np\r\n",
+ "\r\n",
+ "df = pd.read_csv(\"../../data/diabetes.tsv\",sep='\\t')\r\n",
+ "df.head()"
+ ],
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y\n",
+ "0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151\n",
+ "1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75\n",
+ "2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141\n",
+ "3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206\n",
+ "4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
AGE
\n",
+ "
SEX
\n",
+ "
BMI
\n",
+ "
BP
\n",
+ "
S1
\n",
+ "
S2
\n",
+ "
S3
\n",
+ "
S4
\n",
+ "
S5
\n",
+ "
S6
\n",
+ "
Y
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
59
\n",
+ "
2
\n",
+ "
32.1
\n",
+ "
101.0
\n",
+ "
157
\n",
+ "
93.2
\n",
+ "
38.0
\n",
+ "
4.0
\n",
+ "
4.8598
\n",
+ "
87
\n",
+ "
151
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
48
\n",
+ "
1
\n",
+ "
21.6
\n",
+ "
87.0
\n",
+ "
183
\n",
+ "
103.2
\n",
+ "
70.0
\n",
+ "
3.0
\n",
+ "
3.8918
\n",
+ "
69
\n",
+ "
75
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
72
\n",
+ "
2
\n",
+ "
30.5
\n",
+ "
93.0
\n",
+ "
156
\n",
+ "
93.6
\n",
+ "
41.0
\n",
+ "
4.0
\n",
+ "
4.6728
\n",
+ "
85
\n",
+ "
141
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
24
\n",
+ "
1
\n",
+ "
25.3
\n",
+ "
84.0
\n",
+ "
198
\n",
+ "
131.4
\n",
+ "
40.0
\n",
+ "
5.0
\n",
+ "
4.8903
\n",
+ "
89
\n",
+ "
206
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
50
\n",
+ "
1
\n",
+ "
23.0
\n",
+ "
101.0
\n",
+ "
192
\n",
+ "
125.4
\n",
+ "
52.0
\n",
+ "
4.0
\n",
+ "
4.2905
\n",
+ "
80
\n",
+ "
135
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 13
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\r\n",
+ "In this dataset, columns as the following:\r\n",
+ "* Age and sex are self-explanatory\r\n",
+ "* BMI is body mass index\r\n",
+ "* BP is average blood pressure\r\n",
+ "* S1 through S6 are different blood measurements\r\n",
+ "* Y is the qualitative measure of disease progression over one year\r\n",
+ "\r\n",
+ "Let's study this dataset using methods of probability and statistics.\r\n",
+ "\r\n",
+ "### Task 1: Compute mean values and variance for all values"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "source": [],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Task 2: Plot boxplots for BMI, BP and Y depending on gender"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "source": [],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Task 3: What is the the distribution of Age, Sex, BMI and Y variables?"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "source": [],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Task 4: Test the correlation between different variables and disease progression (Y)\r\n",
+ "\r\n",
+ "> **Hint** Correlation matrix would give you the most useful information on which values are dependent."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Task 5: Test the hypothesis that the degree of diabetes progression is different between men and women"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [],
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "orig_nbformat": 4,
+ "language_info": {
+ "name": "python",
+ "version": "3.8.8",
+ "mimetype": "text/x-python",
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "pygments_lexer": "ipython3",
+ "nbconvert_exporter": "python",
+ "file_extension": ".py"
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3.8.8 64-bit (conda)"
+ },
+ "interpreter": {
+ "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git a/1-Introduction/04-stats-and-probability/assignment.md b/1-Introduction/04-stats-and-probability/assignment.md
index b7af641..08ac35a 100644
--- a/1-Introduction/04-stats-and-probability/assignment.md
+++ b/1-Introduction/04-stats-and-probability/assignment.md
@@ -1,8 +1,25 @@
-# Title
+# Small Diabetes Study
+
+In this assignment, we will work with a small dataset of diabetes patients taken from [here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).
+
+| | AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y |
+|---|-----|-----|-----|----|----|----|----|----|----|----|----|
+| 0 | 59 | 2 | 32.1 | 101. | 157 | 93.2 | 38.0 | 4. | 4.8598 | 87 | 151 |
+| 1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70. | 3. | 3.8918 | 69 | 75 |
+| 2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4. | 85 | 141 |
+| ... | ... | ... | ... | ...| ...| ...| ...| ...| ...| ...| ... |
## Instructions
+* Open the [assignment notebook](assignment.ipynb) in a jupyter notebook environment
+* Complete all tasks listed in the notebook, namely:
+ [ ] Compute mean values and variance for all values
+ [ ] Plot boxplots for BMI, BP and Y depending on gender
+ [ ] What is the the distribution of Age, Sex, BMI and Y variables?
+ [ ] Test the correlation between different variables and disease progression (Y)
+ [ ] Test the hypothesis that the degree of diabetes progression is different between men and women
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
+All required tasks are complete, graphically illustrated and explained | Most of the tasks are complete, explanations or takeaways from graphs and/or obtained values are missing | Only basic tasks such as computation of mean/variance and basic plots are complete, no conclusions are made from the data
\ No newline at end of file
diff --git a/1-Introduction/04-stats-and-probability/notebook.ipynb b/1-Introduction/04-stats-and-probability/notebook.ipynb
index 8b83ab7..5ecac5d 100644
--- a/1-Introduction/04-stats-and-probability/notebook.ipynb
+++ b/1-Introduction/04-stats-and-probability/notebook.ipynb
@@ -11,7 +11,7 @@
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 212,
"source": [
"import numpy as np\r\n",
"import pandas as pd\r\n",
@@ -33,7 +33,7 @@
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 213,
"source": [
"sample = [ random.randint(0,10) for _ in range(30) ]\r\n",
"print(f\"Sample: {sample}\")\r\n",
@@ -45,9 +45,9 @@
"output_type": "stream",
"name": "stdout",
"text": [
- "Sample: [4, 6, 3, 0, 3, 4, 7, 7, 9, 6, 8, 2, 0, 3, 10, 7, 2, 0, 2, 1, 1, 6, 5, 0, 9, 0, 1, 8, 2, 9]\n",
- "Mean = 4.166666666666667\n",
- "Variance = 10.272222222222222\n"
+ "Sample: [1, 1, 0, 5, 6, 3, 7, 5, 1, 6, 5, 6, 7, 0, 3, 6, 2, 4, 2, 8, 1, 5, 7, 10, 8, 5, 7, 10, 6, 8]\n",
+ "Mean = 4.833333333333333\n",
+ "Variance = 7.938888888888889\n"
]
}
],
@@ -62,7 +62,7 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 214,
"source": [
"plt.hist(sample)\r\n",
"plt.show()"
@@ -74,8 +74,8 @@
"text/plain": [
"
"
],
- "image/svg+xml": "\r\n\r\n\r\n",
+ "image/svg+xml": "\r\n\r\n\r\n",
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAy0AAADFCAYAAABZ7x10AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAlCUlEQVR4nO3de3RV5Z3/8c+5JEEwwE9YLgwBRqGtFhkVZXTUepkqVP2Nsux1ZhRdXkYHrS22oDPjGvBSZLXViNgYEZZOG+soLVKGdkRsQRFwpEQcQZGr5Y6L8WcSRM5tf39/MMmYEG4m373PPr5fa2UlJseV57w5Z+88eZ59kjAzEwAAAAAUqWTUAwAAAACAQ2HSAgAAAKCoMWkBAAAAUNSYtAAAAAAoakxaAAAAABQ1Ji0AAAAAihqTFgAAAABFLR32NwyCQNu3b1dlZaUSiUTY3x4AAABAkTAzNTc3q6qqSsnkwddTQp+0bN++XQMGDAj72wIAAAAoUlu2bFF1dfVBvx76pKWyslLS/oH17Nkz7G8fa4VCQW+//baGDRumVCoV9XBKDn190dcXfX3R1xd9fdHXF307p6mpSQMGDGidIxxMwswspDFJ2j+wXr16qbGxkUkLAAAA8Dl2pHMDLsSPkWw2q5kzZyqbzUY9lJJEX1/09UVfX/T1RV9f9PVF33AwaYmRVCqlc845h6VHJ/T1RV9f9PVFX1/09UVfX/QNB9vDAAAAAESC7WElKJvNqra2luVHJ/T1RV9f9PVFX1/09UVfX/QNB5OWGEmn0xo5cqTS6dBf9O1zgb6+6OuLvr7o64u+vujri77hYHsYAAAAgEiwPawEZTIZPfzww8pkMlEPpSTR1xd9fdHXF3190dcXfX3RNxystMRIEATatm2b+vfvr2SS+WZXo68v+vqiry/6+qKvL/r6om/nHOncgEkLAAAAgEiwPawEZTIZPfjggyw/OqGvL/r6oq8v+vqiry/6+qJvOFhpiZEgCLR792717duX5UcH9PVFX1/09UVfX/T1RV9f9O0ctocBAAAAKGpsDytBmUxG9957L8uPTujri76+6OuLvr7o64u+vugbDlZaYsTM1NzcrMrKSiUSiaiHU3Lo64u+vujri76+6OuLvr7o2zmstJSoioqKqIdQ0ujri76+6OuLvr7o64u+vujrj0lLjGSzWU2ZMkXZbDbqoZQk+vqiry/6+qKvL/r6oq8v+oaD7WExYmbKZrMqLy9n+dEBfX3R1xd9fdHXF3190dcXfTuH7WEliou8fNHXF3190dcXfX3R1xd9fdHXH5OWGMlms6qpqWH50Ql9fdHXF3190dcXfX3R1xd9w8H2MAAAAACRYHtYCQqCQB988IGCIIh6KCWJvr7o64u+vujri76+6OuLvuFg0hIjuVxOM2fOVC6Xi3ooJYm+vujri76+6OuLvr7o64u+4WB7GAAAAIBIsD2sBAVBoC1btrD86IS+vujri76+6OuLvr7o64u+4WDSEiO5XE6zZs1i+dEJfX3R1xd9fdHXF3190dcXfcPB9jAAAAAAkWB7WAkKgkDr169n+dEJfX3R1xd9fdHXF3190dcXfcPBpCVG8vm8XnrpJeXz+aiHUpLo64u+vujri76+6OuLvr7oGw62hwEAAACIBNvDSlChUNDq1atVKBSiHkpJoq8v+vqiry/6+qKvL/r6om84mLTESKFQ0Ouvv86Twgl9fdHXF3190dcXfX3R1xd9w8H2MAAAAACRYHtYCSoUCmpoaGAm74S+vujri76+6OuLvr7o64u+4WDSEiOFQkHvvPMOTwon9PVFX1/09UVfX/T1RV9f9A0H28MAAAAARILtYSUon89r2bJlvA64E/r6oq8v+vqiry/6+qKvL/qGg0lLjJiZtm7dqpAXxz436OuLvr7o64u+vujri76+6BsOtocBAAAAiATbw0pQPp/XokWLWH50Ql9f9PVFX1/09UVfX/T1Rd9wMGmJETNTU1MTy49O6OuLvr7o64u+vujri76+6BsOtocBAAAAiATbw0pQPp/X/PnzWX50Ql9f9PVFX1/09UVfX/T1Rd9wMGkBAAAAUNTYHgYAAAAgEmwPK0G5XE5z585VLpeLeiglib6+6OuLvr7o64u+vujri77hYNISI4lEQj179lQikYh6KCWJvr7o64u+vujri76+6OuLvuFgexgAAACASLA9rATlcjnNmjWL5Ucn9PVFX1/09UVfX/T1RV9f9A0Hk5YYSSQSqq6uZvnRCX190dcXfX3R1xd9fdHXF33DwfYwAAAAAJFge1gJymazqq+vVzabjXooJYm+vujri76+6OuLvr7o64u+4WDSEiOpVEpf/vKXlUqloh5KSaKvL/r6oq8v+vqiry/6+qJvONgeBgAAACASbA8rQdlsVjNnzmT50Ql9fdHXF3190dcXfX3R1xd9w8GkJUZSqZTOOecclh+d0NcXfX3R1xd9fdHXF3190TccbA8DAAAAEAm2h5WgbDar2tpalh+d0NcXfX3R1xd9fdHXF3190TccTFpiJJ1Oa+TIkUqn01EPpSTR1xd9fdHXF3190dcXfX3RNxxsDwMAAAAQCbaHlaBMJqOHH35YmUwm6qGUJPr6oq8v+vqiry/6+qKvL/qGg5WWGAmCQNu2bVP//v2VTDLf7Gr09UVfX/T1RV9f9PVFX1/07ZwjnRswaQEAAAAQiSOdG3DFUIy0LD/eeeedqqioiHo4JYe+vjKZjP7pn/5J3/zmN1VeXh71cEpO/pMm/eH5J/RX37pF6WPi8QuhyspKfeELX4h6GEeE44Mv+vqiry/6hoOVlhgJgkC7d+9W3759WX50QF9f7733nk4++eSoh1GyzuiXVMMtx2r4E3v05s4g6uEcsbVr18Zi4sLxwRd9fdHXF307h5WWEpRMJnX88cdHPYySRV9fH3/8sSSpvr5ep5xySsSjKT3HfLRWevUWPfPMM/qk9xejHs5hvfvuu7rmmmvU3Nwc9VCOCMcHX/T1RV9f9A0Hk5YYyWQymjJliu6++26WHx3Q11fLH90aPHiwhg8fHvFoSk/2TwXp1f19ywfRt6txfPBFX1/09UXfcLCGFSPl5eUaN24c1wM4oa+vsrKyNu/RtcrK0m3eo2txfPBFX1/09UXfcDBpiRlm8L7oC+BgOD74oq8v+vqirz8mLTHy0Ucf6Y477tBHH30U9VBKUjab1ZQpU1q3MaFr5XK5Nu/RtXK5fJv36FocH3zFre/evXvV0NCgvXv3Rj2UIxK3vnFD33B8bict2WxWjzzyiL773e/qkUceicUDbePGjZo+fbo2btwY9VBKUnl5ue6++26Wd52wPcwX28N8cXzwFbe+a9as0Zlnnqk1a9ZEPZTDamxs1Fe/+lXNmDFDX/3qV9XY2Bj1kA4pbj+fZbNZ1dbWaufOnaqtrS368RYKBS1atEjPPvusFi1apEKhEPWQjtjnctIyYcIE9ejRQ+PGjdNjjz2mcePGqUePHpowYULUQ0PEMplM1EMAtGz7Ml015yot274s6qHgUzg++KJv1xsyZIh69+6tJUuWaOvWrVqyZIl69+6tIUOGRD20DsXt57OW8d55552qq6vTnXfeWdTjnT17toYMGaKLL75Yf/u3f6uLL75YQ4YM0ezZs6Me2hE56knLq6++qr/+679WVVWVEomE5syZ4zAsPxMmTNBPfvIT9enTR08++aR27NihJ598Un369NFPfvKTon2gSWyv8ZbNZlVTU1P0vyWJKx6/R8bMNLVhqjY2btTUhqk60j+lxfYwXxwffNG36w0ZMkQbNmyQJI0aNUo33HCDRo0aJUnasGFD0U1c4vbz2afHW1tbqx/84Aeqra0t2vHOnj1b3/jGNzRs2DAtW7ZMzc3NWrZsmYYNG6ZvfOMbsZi4HPUfl/yP//gPLVmyRMOHD9fXv/51vfDCCxo9evQR//9R/nHJbDarHj16qE+fPtq6davS6f/dRpHP51VdXa3//u//1scff1yUS9QNDQ0688wztWLFCl4yFrHD4/fILNm2RLe+fGvrf9ddUqfz+p93+P9x+0pp+oXS378iVZ3uNr6uwuMBcVbsj9/Gxkb17t1b0v6/kdW9e/fWr+3du1c9evSQtP9a2V69ekUxxDbi9vNZ3MZbKBQ0ZMgQDRs2THPmzGnzBzCDINDo0aO1atUqrVu3TqlUKvTxuf1xycsuu0yXXXbZEd8+k8m0WfJtamqS9L+/bc3n9/9WMJ1OK5fLKZFIHPBxNptVKpVSKpU64ON0Oq1kMqlMJqOysrIOP255wEydOlX5fF7333+/CoWC0um0giBQLpdTRUWF7r33Xt16662qra3VHXfcoXw+r/LychUKBRUKhQM+zufzMjOVlZUd8LHHfdqzZ48k6a233mr97Wsul2u9RiCXy6m8vFxBELSOvf3HhUJBZWVlKhQKCoLggI8/Pfaj/TiXyymZTCqVSh3wcSqVUjKZbHP/2n/86fsRxX0yMzU3N+vYY49VMpksiftUTP9Oq1atkiTt2bNH2Ww28udTR8eIlnG1fFxRUdHmGNH+464+RmSzWU1rmKZkIqnAAiUTSU17c5rO6nuW0un0Ie/TvkxG3SQFZsoV0X062L9Tyx+VfPfdd2PxfMpms2psbFSfPn0UBEHkz6dSO0bk83l9+OGH6tu3r8ys6O9Ty7Usn3zySVE8n9p/fPnll0vav8LSrVs3BUGg7du3q1+/furevbsuvfRSLViwQFdccYV+//vfR36MmDZtmvL5vCZNmqREItE6llQqpXQ6rX/5l3/RbbfdptraWv3DP/xD5MfylvHee++9kvb/4L9z50717dtX5eXlmjhxosaOHava2lqNHTs28vPTH/7wB73//vuqr69v85xruU/jx4/XBRdcoMWLF+v8888P/Zx7xOsn1gmS7IUXXjjkbSZOnGiSDnh79tlnzczsxRdftBdffNHMzH7zm9/YwoULzczs+eeft6VLl5qZ2S9+8QtbsWKFmZnNmDHDVq1aZWZmP/vZz2zdunVmZvbQQw/Z5s2bzcxs8uTJtmvXLjMzmzRpkjU2Ntq+ffvsL/7iL0ySrV271iZNmmRmZrt27bLJkyebmdny5ctNkt1+++22bt06+9nPfmZmZqtWrbIZM2aYmdmKFSvsF7/4hZmZLV261J5//nkzM1u4cKH95je/cb1Pjz76aIcteeMtTm+TJ08uiudTR8eISZMm2b59+6yxsbHDY8TmzZvtoYceMjNzOUZMfnaynfr0qQe83TPznsPep5kP3GE2sadl3n+jqO7Twf6dbrrppsgfi7zx1tm3+vr6ong+tT/u9evXzyTZd7/7Xdu8eXPrsWDLli1mZnb99debJKuuri6KY8TXv/51k2TTpk3r8D796Ec/Mmn/z2fFcCy//PLLTZL98pe/tBdffNH27dtn9913ny1YsMDMzOrq6lrHWwznp5tvvtkk2eLFizu8T7/97W/b3J+wz7lbtmwxSdbY2GiHctTbwz4tkUgcdntYRystAwYM0O7du9WnT59Qf4v605/+VBMmTND06dM1ZsyYA2amTzzxhG699VbV1NQU5UrL8uXL9ZWvfEVPP/20Tj31VEn8do77FJ/7tHr1al1//fV69dVXdfbZZ0f+fCq2lZZUKqXvzPuO1vy/NQosaD1mJhNJnfx/TtYzlz3Tel87uk/ZPy1X+VOXSH//ijJ9TimK+3Sof6dXXnlFF110kerr6zVkyBCeT9ynWN2nNWvW6JprrtFrr72mv/zLv4z8+dT+44svvlhLly7VqFGj9Lvf/e6A48XIkSO1YMECnXfeeUWx0vLoo4/qhz/8oR5//HHdfPPNB9yn2tpa3XbbbaqpqSmKlZapU6fqhz/8oerq6nTjjTcecJ8ef/xxjR07VjU1NUWz0jJy5Ei99tprGjFixAH3afHixbrgggu0cOHCSFZampqa1Lt378NuD3OftLTHNS2f3R//+EeNGDFCy5cv11lnnRX1cEpOEATatm2b+vfv32a/J7oGj99Da38tS3uHu7Yl2Pamkk9epODmRUr2P8NjiF2q2K8JaI/jg6+49S32x2/7a1q6devW2nffvn1c09JJ7cebTCZb+wZBUHTjLZVrWor/yNCFysvLNW7cOO3atUvV1dWaPn26tm/frunTp6u6ulq7du3SuHHjiuIB1pGW31K0vEfXyuVymjVrVuv1VuhaPH4Pzsw07c1pSijR4dcTSmjam9MOue+Xvr44Pviib9fq1auXBg8eLEnq0aOHvva1r2nKlCn62te+1jphGTx4cFFMWKT4/XzWfrx1dXWaMWOG6urqinK8qVRKDz30kObNm6fRo0e3efWw0aNHa968efrpT38ayYTlaHzu/grZj3/8Y0lSTU2NbrnlltbPp9NpjR8/vvXrxajlwV8sT4JSU1FRoTvvvDPqYZQsHr8Hlwty2vnxTpk6npSYTDs/3qlckFN5quN+5f+zfaXlPboWxwdf9O1669evb33Z4wULFmjBggWtXxs8eLDWr18f4egOFLefzz493ttuu63188U63quvvlq/+tWv9IMf/EDnnntu6+dPPPFE/epXv9LVV18d4eiOzFFPWvbs2dPmgb5p0yatXLlSxx13nAYOHNilg/Py4x//WA888IBqa2u1YcMGDR48WGPHji36H6ZaXrGm5T26VhAE2rhxo0466aRYbE+IGx6/B1eeKte//d9/04f7PjzobY7rdtxBJyzS/lcNS37qPboWxwdf9PWxfv16NTY26vLLL2/t+7vf/a5oVljai9vPZy3jfeyxx/Tmm2/qjDPO0O23316047366qt11VVXafHixdqxY4dOOOEEfeUrXyn6FZYWRz1p+eMf/6iLL7649b9bfjNy3XXX6emnn+6ygXkrLy/X97///aiHcVQKhUKb9+ha+XxeL730km666aaiPeDEGY/fQ+vXo5/69ej3mf//QiGv5Kfeo2txfPBFXz+9evXSwoULNWPGjFj0jdvPZ+Xl5br99ttj0zeVSumiiy6KehifSacuxP8sorwQP+727t2rNWvW6OSTT27zh6KAOCj2C1djjz8uCYSG8zHQdbgQvwRVVFS0vqHrFQoFrV69mpUAJ6y0+Cr8z7a7AtvvXHB88BW3vt27d9fw4cNjM2GJW9+4oW84mLTESKFQ0Ouvv86Twgl9fXFNi68gKLR5j67F8cEXfX3R1xd9w/G5e/WwOCsvL9eNN94Y9TBKFn19tfxxtjJe3cpFWbqszXt0LY4Pvujri76+6BsOJi0xUigU9NZbb+m0006LzSs9xAl9fTU3N0va/2Ie6HoVH67RUEmr33lHmZ3Fv5r17rvvRj2Eo8LxwRd9fdHXF33DwaQlRgqFgt555x2deuqpPCkc0NdXyw+pn379fXSdM/ol1XDLsbr22mv1ZgwmLS0qKyujHsIR4fjgi76+6OuLvuHg1cMAhGL37t2aM2cOr7bjJJHfp257NmvfsQNl6W5RD+eIVFZW6gtf+ELUwwAAROhI5wastMRIPp/X8uXLNWLECKXT/NN1Nfr66t27t4YOHUpfJ/sfvwmNGE5fDxwffNHXF3190TccvHpYjJiZtm7dqpAXxz436OuLvr7o64u+vujri76+6BsOtocBAAAAiAR/XLIE5fN5LVq0SPl8PuqhlCT6+qKvL/r6oq8v+vqiry/6hoNJS4yYmZqamlh+dEJfX/T1RV9f9PVFX1/09UXfcLA9DAAAAEAk2B5WgvL5vObPn8/yoxP6+qKvL/r6oq8v+vqiry/6hoNJCwAAAICixvYwAAAAAJFge1gJyuVymjt3rnK5XNRDKUn09UVfX/T1RV9f9PVFX1/0DQeTlhhJJBLq2bOnEolE1EMpSfT1RV9f9PVFX1/09UVfX/QNB9vDAAAAAESC7WElKJfLadasWSw/OqGvL/r6oq8v+vqiry/6+qJvOJi0xEgikVB1dTXLj07o64u+vujri76+6OuLvr7oGw62hwEAAACIBNvDSlA2m1V9fb2y2WzUQylJ9PVFX1/09UVfX/T1RV9f9A0Hk5YYSaVS+vKXv6xUKhX1UEoSfX3R1xd9fdHXF3190dcXfcPB9jAAAAAAkWB7WAnKZrOaOXMmy49O6OuLvr7o64u+vujri76+6BsOJi0xkkqldM4557D86IS+vujri76+6OuLvr7o64u+4WB7GAAAAIBIsD2sBGWzWdXW1rL86IS+vujri76+6OuLvr7o64u+4WDSEiPpdFojR45UOp2Oeiglib6+6OuLvr7o64u+vujri77hYHsYAAAAgEiwPawEZTIZPfzww8pkMlEPpSTR1xd9fdHXF3190dcXfX3RNxystMRIEATatm2b+vfvr2SS+WZXo68v+vqiry/6+qKvL/r6om/nHOncgEkLAAAAgEiwPawEZTIZPfjggyw/OqGvL/r6oq8v+vqiry/6+qJvOFhpiZEgCLR792717duX5UcH9PVFX1/09UVfX/T1RV9f9O0ctocBAAAAKGpsDytBmUxG9957L8uPTujri76+6OuLvr7o64u+vugbDlZaYsTM1NzcrMrKSiUSiaiHU3Lo64u+vujri76+6OuLvr7o2zmstJSoioqKqIdQ0ujri76+6OuLvr7o64u+vujrj0lLjGSzWU2ZMkXZbDbqoZQk+vqiry/6+qKvL/r6oq8v+oaD7WExYmbKZrMqLy9n+dEBfX3R1xd9fdHXF3190dcXfTuH7WEliou8fNHXF3190dcXfX3R1xd9fdHXH5OWGMlms6qpqWH50Ql9fdHXF3190dcXfX3R1xd9w8H2MAAAAACRYHtYCQqCQB988IGCIIh6KCWJvr7o64u+vujri76+6OuLvuFg0hIjuVxOM2fOVC6Xi3ooJYm+vujri76+6OuLvr7o64u+4WB7GAAAAIBIsD2sBAVBoC1btrD86IS+vujri76+6OuLvr7o64u+4WDSEiO5XE6zZs1i+dEJfX3R1xd9fdHXF3190dcXfcPB9jAAAAAAkWB7WAkKgkDr169n+dEJfX3R1xd9fdHXF3190dcXfcPBpCVG8vm8XnrpJeXz+aiHUpLo64u+vujri76+6OuLvr7oGw62hwEAAACIBNvDSlChUNDq1atVKBSiHkpJoq8v+vqiry/6+qKvL/r6om84mLTESKFQ0Ouvv86Twgl9fdHXF3190dcXfX3R1xd9w8H2MAAAAACRYHtYCSoUCmpoaGAm74S+vujri76+6OuLvr7o64u+4WDSEiOFQkHvvPMOTwon9PVFX1/09UVfX/T1RV9f9A0H28MAAAAARILtYSUon89r2bJlvA64E/r6oq8v+vqiry/6+qKvL/qGg0lLjJiZtm7dqpAXxz436OuLvr7o64u+vujri76+6BsOtocBAAAAiATbw0pQPp/XokWLWH50Ql9f9PVFX1/09UVfX/T1Rd9wMGmJETNTU1MTy49O6OuLvr7o64u+vujri76+6BsOtocBAAAAiATbw0pQPp/X/PnzWX50Ql9f9PVFX1/09UVfX/T1Rd9wMGkBAAAAUNTYHgYAAAAgEkc6N0iHOCZJar1IqampKexvHXu5XE4LFizQpZdeqrKysqiHU3Lo64u+vujri76+6OuLvr7o2zktc4LDraOEPmlpbm6WJA0YMCDsbw0AAACgCDU3N6tXr14H/Xro28OCIND27dtVWVmpRCIR5reOvaamJg0YMEBbtmxha50D+vqiry/6+qKvL/r6oq8v+naOmam5uVlVVVVKJg9+uX3oKy3JZFLV1dVhf9uS0rNnT54Ujujri76+6OuLvr7o64u+vuj72R1qhaUFrx4GAAAAoKgxaQEAAABQ1Ji0xEhFRYUmTpyoioqKqIdSkujri76+6OuLvr7o64u+vugbjtAvxAcAAACAo8FKCwAAAICixqQFAAAAQFFj0gIAAACgqDFpAQAAAFDUmLQAAAAAKGpMWorQtm3bdM0116hPnz7q3r27Tj/9dK1YsaL163v27NHtt9+u6upqHXPMMTrllFP0+OOPRzji+PizP/szJRKJA95uu+02SZKZadKkSaqqqtIxxxyjiy66SKtXr4541PFxqL65XE533XWXhg0bph49eqiqqkpjxozR9u3box52bBzu8ftpt9xyixKJhB555JHwBxpTR9L33Xff1ZVXXqlevXqpsrJS55xzjjZv3hzhqOPjcH05t3VOPp/XPffcoxNPPFHHHHOMTjrpJN13330KgqD1NpzjPrvD9eUcFwJDUfnwww9t0KBBdv3119t//ud/2qZNm+zll1+29evXt97mpptussGDB9vChQtt06ZN9sQTT1gqlbI5c+ZEOPJ4+OCDD2zHjh2tbwsWLDBJtnDhQjMzmzJlilVWVtqvf/1re/vtt+3b3/62nXDCCdbU1BTtwGPiUH0/+ugju+SSS+y5556zNWvW2LJly+zss8+2M888M+phx8bhHr8tXnjhBTvttNOsqqrKampqIhlrHB2u7/r16+24446z8ePHW0NDg23YsMHmzZtnu3btinbgMXG4vpzbOueBBx6wPn362Lx582zTpk02a9YsO/bYY+2RRx5pvQ3nuM/ucH05x/lj0lJk7rrrLjv//PMPeZuhQ4fafffd1+Zzw4cPt3vuucdzaCXpe9/7ng0ePNiCILAgCKxfv342ZcqU1q/v27fPevXqZXV1dRGOMr4+3bcjb7zxhkmyP/3pTyGPrDR01Hfr1q3Wv39/W7VqlQ0aNIhJSye07/vtb3/brrnmmohHVTra9+Xc1jlXXHGF3XDDDW0+d/XVV7c+ZjnHdc7h+naEc1zXYntYkZk7d67OOussffOb39Txxx+vM844Q08++WSb25x//vmaO3eutm3bJjPTwoULtXbtWo0aNSqiUcdTNptVfX29brjhBiUSCW3atEk7d+7UyJEjW29TUVGhCy+8UEuXLo1wpPHUvm9HGhsblUgk1Lt373AHVwI66hsEga699lqNHz9eQ4cOjXiE8da+bxAE+u1vf6svfvGLGjVqlI4//nidffbZmjNnTtRDjaWOHr+c2zrn/PPP1+9//3utXbtWkvTWW2/ptdde0+WXXy5JnOM66XB9O8I5rotFPWtCWxUVFVZRUWH/+I//aA0NDVZXV2fdunWzf/3Xf229TSaTsTFjxpgkS6fTVl5ebj//+c8jHHU8Pffcc5ZKpWzbtm1mZrZkyRKT1PrfLW6++WYbOXJkFEOMtfZ92/vkk0/szDPPtL/7u78LeWSloaO+kydPtksvvbT1N9estHx27fvu2LHDJFn37t3t4YcftjfffNMefPBBSyQStmjRoohHGz8dPX45t3VOEAR29913WyKRsHQ6bYlEwiZPntz6dc5xnXO4vu1xjut66UhnTDhAEAQ666yzNHnyZEnSGWecodWrV+vxxx/XmDFjJEmPPvqoXn/9dc2dO1eDBg3Sq6++qrFjx+qEE07QJZdcEuXwY2XmzJm67LLLVFVV1ebz7VcFzOygKwU4uIP1lfZfsPid73xHQRCotrY2gtHFX/u+K1as0NSpU9XQ0MDjtQu079tyse1VV12lcePGSZJOP/10LV26VHV1dbrwwgsjG2scdXR84NzWOc8995zq6+v1y1/+UkOHDtXKlSv1/e9/X1VVVbruuutab8c57rM50r4S5zg3Uc+a0NbAgQPtxhtvbPO52tpaq6qqMjOzvXv3WllZmc2bN6/NbW688UYbNWpUaOOMu/fff9+SyWSbCzw3bNhgkqyhoaHNba+88kobM2ZM2EOMtY76tshmszZ69Gj78z//c9u9e3cEo4u/jvrW1NRYIpGwVCrV+ibJksmkDRo0KLrBxlBHfTOZjKXTabv//vvb3HbChAl27rnnhj3EWOuoL+e2zquurrbHHnuszefuv/9++9KXvmRmnOM663B9W3CO88M1LUXmvPPO03vvvdfmc2vXrtWgQYMk7Z+953I5JZNt/+lSqVSblzXEoT311FM6/vjjdcUVV7R+7sQTT1S/fv20YMGC1s9ls1m98sorOvfcc6MYZmx11Ffa//j91re+pXXr1unll19Wnz59IhphvHXU99prr9V//dd/aeXKla1vVVVVGj9+vObPnx/haOOno77l5eUaMWLEIY/PODId9eXc1nl79+49ZD/OcZ1zuL4S5zh3Uc+a0NYbb7xh6XTafvSjH9m6devsmWeese7du1t9fX3rbS688EIbOnSoLVy40DZu3GhPPfWUdevWzWprayMceXwUCgUbOHCg3XXXXQd8bcqUKdarVy+bPXu2vf322/Y3f/M3vBzkUTpY31wuZ1deeaVVV1fbypUr27z0aSaTiWi08XOox297XNNy9A7Vd/bs2VZWVmbTp0+3devW2bRp0yyVStnixYsjGGk8Haov57bOue6666x///6tL8k7e/Zs69u3r02YMKH1NpzjPrvD9eUc549JSxH693//dzv11FOtoqLCTj75ZJs+fXqbr+/YscOuv/56q6qqsm7dutmXvvQle+ihhw76srJoa/78+SbJ3nvvvQO+FgSBTZw40fr162cVFRV2wQUX2Ntvvx3BKOPrYH03bdpkkjp8a/93RnBwh3r8tsek5egdru/MmTNtyJAh1q1bNzvttNP4GyJH6VB9Obd1TlNTk33ve9+zgQMHWrdu3eykk06yf/7nf27zAzPnuM/ucH05x/lLmJmFvrwDAAAAAEeIa1oAAAAAFDUmLQAAAACKGpMWAAAAAEWNSQsAAACAosakBQAAAEBRY9ICAAAAoKgxaQEAAABQ1Ji0AAAAAChqTFoAAAAAFDUmLQAAAACKGpMWAAAAAEXt/wNU43GrtGZYZQAAAABJRU5ErkJggg=="
},
"metadata": {}
@@ -323,18 +369,16 @@
{
"cell_type": "markdown",
"source": [
- "Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values: "
+ "We can also make box plots of subsets of our dataset, for example, grouped by player role."
],
"metadata": {}
},
{
"cell_type": "code",
- "execution_count": 40,
+ "execution_count": 210,
"source": [
- "df['Weight'].hist(bins=15)\r\n",
- "plt.suptitle('Weight distribution of MLB Players')\r\n",
- "plt.xlabel('Weight')\r\n",
- "plt.ylabel('Count')\r\n",
+ "df.boxplot(column='Height',by='Role')\r\n",
+ "plt.xticks(rotation='vertical')\r\n",
"plt.show()"
],
"outputs": [
@@ -344,8 +388,8 @@
"text/plain": [
"
"
],
- "image/svg+xml": "\r\n\r\n\r\n",
- "image/png": ""
+ "image/svg+xml": "\r\n\r\n\r\n",
+ "image/png": ""
},
"metadata": {}
}
@@ -353,40 +397,35 @@
"metadata": {}
},
{
- "cell_type": "code",
- "execution_count": 44,
+ "cell_type": "markdown",
"source": [
- "print(list(df['Weight'])[:20])"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]\n"
- ]
- }
+ "> **Note**: This diagram suggests, that on average, height of first basemen is higher that height of second basemen. Later we will learn how we can test this hypothesis more formally, and how to demonstrate that our data is statistically significant to show that. \r\n",
+ "\r\n",
+ "Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values: "
],
"metadata": {}
},
{
"cell_type": "code",
- "execution_count": 49,
+ "execution_count": 211,
"source": [
- "mean = df['Weight'].mean()\r\n",
- "var = df['Weight'].var()\r\n",
- "std = df['Weight'].std()\r\n",
- "print(f\"Mean = {mean}\\nVariance = {var}\\nStandard Deviation = {std}\")"
+ "df['Weight'].hist(bins=15)\r\n",
+ "plt.suptitle('Weight distribution of MLB Players')\r\n",
+ "plt.xlabel('Weight')\r\n",
+ "plt.ylabel('Count')\r\n",
+ "plt.show()"
],
"outputs": [
{
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Mean = 201.6892545982575\n",
- "Variance = 440.6426848120547\n",
- "Standard Deviation = 20.991490771549664\n"
- ]
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/svg+xml": "\r\n\r\n\r\n",
+ "image/png": ""
+ },
+ "metadata": {}
}
],
"metadata": {}
@@ -1023,10 +1062,12 @@
"metadata": {}
},
{
- "cell_type": "code",
- "execution_count": null,
- "source": [],
- "outputs": [],
+ "cell_type": "markdown",
+ "source": [
+ "## Conclusion\r\n",
+ "\r\n",
+ "In this notebook, we have learnt how to perform basic operations on data to compute statistical functions. We now know how to use sound apparatus of math and statistics in order to prove some hypotheses, and how to compute confidence intervals for random variable given data sample. "
+ ],
"metadata": {}
}
],
diff --git a/1-Introduction/04-stats-and-probability/solution/assignment.ipynb b/1-Introduction/04-stats-and-probability/solution/assignment.ipynb
new file mode 100644
index 0000000..da16d87
--- /dev/null
+++ b/1-Introduction/04-stats-and-probability/solution/assignment.ipynb
@@ -0,0 +1,945 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Introduction to Probability and Statistics\r\n",
+ "## Assignment\r\n",
+ "\r\n",
+ "In this assignment, we will use the dataset of diabetes patients taken [from here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "source": [
+ "import pandas as pd\r\n",
+ "import numpy as np\r\n",
+ "import matplotlib.pyplot as plt\r\n",
+ "\r\n",
+ "df = pd.read_csv(\"../../../data/diabetes.tsv\",sep='\\t')\r\n",
+ "df.head()"
+ ],
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y\n",
+ "0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151\n",
+ "1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75\n",
+ "2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141\n",
+ "3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206\n",
+ "4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
AGE
\n",
+ "
SEX
\n",
+ "
BMI
\n",
+ "
BP
\n",
+ "
S1
\n",
+ "
S2
\n",
+ "
S3
\n",
+ "
S4
\n",
+ "
S5
\n",
+ "
S6
\n",
+ "
Y
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
59
\n",
+ "
2
\n",
+ "
32.1
\n",
+ "
101.0
\n",
+ "
157
\n",
+ "
93.2
\n",
+ "
38.0
\n",
+ "
4.0
\n",
+ "
4.8598
\n",
+ "
87
\n",
+ "
151
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
48
\n",
+ "
1
\n",
+ "
21.6
\n",
+ "
87.0
\n",
+ "
183
\n",
+ "
103.2
\n",
+ "
70.0
\n",
+ "
3.0
\n",
+ "
3.8918
\n",
+ "
69
\n",
+ "
75
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
72
\n",
+ "
2
\n",
+ "
30.5
\n",
+ "
93.0
\n",
+ "
156
\n",
+ "
93.6
\n",
+ "
41.0
\n",
+ "
4.0
\n",
+ "
4.6728
\n",
+ "
85
\n",
+ "
141
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
24
\n",
+ "
1
\n",
+ "
25.3
\n",
+ "
84.0
\n",
+ "
198
\n",
+ "
131.4
\n",
+ "
40.0
\n",
+ "
5.0
\n",
+ "
4.8903
\n",
+ "
89
\n",
+ "
206
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
50
\n",
+ "
1
\n",
+ "
23.0
\n",
+ "
101.0
\n",
+ "
192
\n",
+ "
125.4
\n",
+ "
52.0
\n",
+ "
4.0
\n",
+ "
4.2905
\n",
+ "
80
\n",
+ "
135
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 13
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\r\n",
+ "In this dataset, columns as the following:\r\n",
+ "* Age and sex are self-explanatory\r\n",
+ "* BMI is body mass index\r\n",
+ "* BP is average blood pressure\r\n",
+ "* S1 through S6 are different blood measurements\r\n",
+ "* Y is the qualitative measure of disease progression over one year\r\n",
+ "\r\n",
+ "Let's study this dataset using methods of probability and statistics.\r\n",
+ "\r\n",
+ "### Task 1: Compute mean values and variance for all values"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "source": [
+ "df.describe()"
+ ],
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " AGE SEX BMI BP S1 S2 \\\n",
+ "count 442.000000 442.000000 442.000000 442.000000 442.000000 442.000000 \n",
+ "mean 48.518100 1.468326 26.375792 94.647014 189.140271 115.439140 \n",
+ "std 13.109028 0.499561 4.418122 13.831283 34.608052 30.413081 \n",
+ "min 19.000000 1.000000 18.000000 62.000000 97.000000 41.600000 \n",
+ "25% 38.250000 1.000000 23.200000 84.000000 164.250000 96.050000 \n",
+ "50% 50.000000 1.000000 25.700000 93.000000 186.000000 113.000000 \n",
+ "75% 59.000000 2.000000 29.275000 105.000000 209.750000 134.500000 \n",
+ "max 79.000000 2.000000 42.200000 133.000000 301.000000 242.400000 \n",
+ "\n",
+ " S3 S4 S5 S6 Y \n",
+ "count 442.000000 442.000000 442.000000 442.000000 442.000000 \n",
+ "mean 49.788462 4.070249 4.641411 91.260181 152.133484 \n",
+ "std 12.934202 1.290450 0.522391 11.496335 77.093005 \n",
+ "min 22.000000 2.000000 3.258100 58.000000 25.000000 \n",
+ "25% 40.250000 3.000000 4.276700 83.250000 87.000000 \n",
+ "50% 48.000000 4.000000 4.620050 91.000000 140.500000 \n",
+ "75% 57.750000 5.000000 4.997200 98.000000 211.500000 \n",
+ "max 99.000000 9.090000 6.107000 124.000000 346.000000 "
+ ],
+ "text/html": [
+ "