diff --git a/1-Introduction/04-stats-and-probability/README.md b/1-Introduction/04-stats-and-probability/README.md
index 4ea65986..9f673ecc 100644
--- a/1-Introduction/04-stats-and-probability/README.md
+++ b/1-Introduction/04-stats-and-probability/README.md
@@ -34,7 +34,7 @@ Another important distribution is **normal distribution**, which we will talk ab
## Mean, Variance and Standard Deviation
-Suppose we draw a sequence of n samples of a random variable X: x1, x2, ..., xn. We can define **mean** (or **arithmetic average**) value of the sequence in the traditional way as (x1+x2+xn)/n. As we grow the size of the sample (i.e. take the limit with n→∞), we will obtain the mean (also called **expectation**) of the distribution.
+Suppose we draw a sequence of n samples of a random variable X: x1, x2, ..., xn. We can define **mean** (or **arithmetic average**) value of the sequence in the traditional way as (x1+x2+xn)/n. As we grow the size of the sample (i.e. take the limit with n→∞), we will obtain the mean (also called **expectation**) of the distribution. We will denote expectation by **E**(x).
> It can be demonstrated that for any discrete distribution with values {x1, x2, ..., xN} and corresponding probabilities p1, p2, ..., pN, the expectation would equal to E(X)=x1p1+x2p2+...+xNpN.
@@ -80,7 +80,15 @@ One of the reasons why normal distribution is so important is so-called **centra
From central limit theorem it also follows that, when N→∞, the probability of the sample mean to be equal to μ becomes 1. This is known as **the law of large numbers**.
+## Covariance and Correlation
+One of the things Data Science does is finding relations between data. We say that two sequences **correlate** when they exhibit the similar behavior at the same time, i.e. they either rise/fall simultaneously, or one sequence rises when another one falls and vice versa. In other words, there seems to be some relation between two sequences.
+
+> Correlation does not necessarily indicate causal relationship between two sequences; sometimes both variables can depend on some external cause, or it can be purely by chance the two sequences correlate. However, strong mathematical correlation is a good indication that two variables are somehow connected.
+
+ Mathematically, the main concept that show the relation between two random variables is **covariance**, that is computed like this: Cov(X,Y) = **E**\[(X-**E**(X))(Y-**E**(Y))\]. We compute the deviation of both variables from their mean values, and then product of those deviations. If both variables deviate together, the product would always be a positive value, that would add up to positive covariance. If both variables deviate out-of-sync (i.e. one falls below average when another one rises above average), we will always get negative numbers, that will add up to negative covariance. If the deviations are not dependent, they will add up to roughly zero.
+
+The absolute value of covariance does not tell us much on how large the correlation is, because it depends on the magnitude of actual values. To normalize it, we can divide covariance by standard deviation of both variables, to get **correlation**. The good thing is that correlation is always in the range of [-1,1], where 1 indicates strong positive correlation between values, -1 - strong negative correlation, and 0 - no correlation at all (variables are independent).
## 🚀 Challenge
diff --git a/1-Introduction/04-stats-and-probability/notebook.ipynb b/1-Introduction/04-stats-and-probability/notebook.ipynb
index ad4cd849..7a3580ae 100644
--- a/1-Introduction/04-stats-and-probability/notebook.ipynb
+++ b/1-Introduction/04-stats-and-probability/notebook.ipynb
@@ -504,6 +504,256 @@
],
"metadata": {}
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Correlation and Evil Baseball Corp\r\n",
+ "\r\n",
+ "Correlation allows us to find inner connection between data sequences. In our toy example, let's pretend there is an evil baseball corporation that pays it's players according to their height - the taller the player is, the more money he/she gets. Suppose there is a base salary of $1000, and an additional bonus from $0 to $100, depending on height. We will take the real players from MLB, and compute their imaginary salaries:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 99,
+ "source": [
+ "heights = df['Height']\r\n",
+ "salaries = 1000+(heights-heights.min())/(heights.max()-heights.mean())*100\r\n",
+ "print(list(zip(heights,salaries))[:10])"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "[(74, 1075.2469071629068), (74, 1075.2469071629068), (72, 1053.7477908306478), (72, 1053.7477908306478), (73, 1064.4973489967772), (69, 1021.4991163322591), (69, 1021.4991163322591), (71, 1042.9982326645181), (76, 1096.746023495166), (71, 1042.9982326645181)]\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's now compute covariance and correlation of those sequences. `np.cov` will give us so-called **covariance matrix**, which is an extension of covariance to multiple variables. The element $M_{ij}$ of the covariance matrix $M$ is a correlation between input variables $X_i$ and $X_j$, and diagonal values $M_{ii}$ is the variance of $X_{i}$. Similarly, `np.corrcoef` will give us **correlation matrix**."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 100,
+ "source": [
+ "print(f\"Covariance matrix:\\n{np.cov(heights,salaries)}\")\r\n",
+ "print(f\"Covariance = {np.cov(heights,salaries)[0,1]}\")\r\n",
+ "print(f\"Correlation = {np.corrcoef(heights,salaries)[0,1]}\")"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Covariance matrix:\n",
+ "[[ 5.31679808 57.15323023]\n",
+ " [ 57.15323023 614.37197275]]\n",
+ "Covariance = 57.15323023054467\n",
+ "Correlation = 1.0\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Correlation equal to 1 means that there is a strong **linear relation** between two variables. We can visually see the linear relation by plotting one value against the other:"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 101,
+ "source": [
+ "plt.scatter(heights,salaries)\r\n",
+ "plt.show()"
+ ],
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/svg+xml": "\r\n\r\n