From 6dd1c054bab3b28081cc59f7d9259fcf41c08e9a Mon Sep 17 00:00:00 2001 From: Dmitri Soshnikov Date: Thu, 12 Aug 2021 14:57:28 +0300 Subject: [PATCH] Continue working on intro to probability --- .../04-stats-and-probability/README.md | 20 +- .../images/weight-histogram.png | Bin 0 -> 3948 bytes .../04-stats-and-probability/notebook.ipynb | 373 ++++++++++++++++++ 3 files changed, 390 insertions(+), 3 deletions(-) create mode 100644 1-Introduction/04-stats-and-probability/images/weight-histogram.png diff --git a/1-Introduction/04-stats-and-probability/README.md b/1-Introduction/04-stats-and-probability/README.md index 6c768e0a..4f762e2b 100644 --- a/1-Introduction/04-stats-and-probability/README.md +++ b/1-Introduction/04-stats-and-probability/README.md @@ -20,7 +20,7 @@ In the case of discrete random variables, it is easy to describe the probability The most well-known discrete distribution is **uniform distribution**, in which there is a sample space of N elements, with equal probability of 1/N for each of them. -It is more difficult to describe the probability distribution of a continuous variable. Consider the case of bus arrival time. In fact, for each exact arrival time $t$, the probability of a bus arriving at exactly that time is 0! +It is more difficult to describe the probability distribution of a continuous variable, with values drawn from some interval [a,b], or the whole set of real numbers ℝ. Consider the case of bus arrival time. In fact, for each exact arrival time $t$, the probability of a bus arriving at exactly that time is 0! > Now you know that events with 0 probability happen, and very often! At least each time when the bus arrives! @@ -34,9 +34,23 @@ Another important distribution is **normal distribution**, which we will talk ab ## Mean, Variance and Standard Deviation -Suppose we draw n samples of a random variable X: {x1, x2, ..., xn}. We can define **mean** (or ** arithmetic average**) value of the sequence in the traditional way as (x1+x2+xn)/n. As we grow the size of the sample (i.e. take the limit with n→∞), we will obtain the mean (also called **expectation**) of the distribution. +Suppose we draw a sequence of n samples of a random variable X: x1, x2, ..., xn. We can define **mean** (or **arithmetic average**) value of the sequence in the traditional way as (x1+x2+xn)/n. As we grow the size of the sample (i.e. take the limit with n→∞), we will obtain the mean (also called **expectation**) of the distribution. + +> It can be demonstrated that for any discrete distribution with values {x1, x2, ..., xN} and corresponding probabilities p1, p2, ..., pN, the expectation would equal to E(X)=x1p1+x2p2+...+xNpN. + +To identify how far the values are spread, we can compute the variance σ2 = ∑(xi - \mu;)2/n), where μ is the mean of the sequence. The value σ is called **standard deviation**, and σ2 is called a **variance**. + +## Real-world Data + +When we analyze data from real life, they often are not random variables as such, in a sense that we do not perform experiments with unknown result. For example, consider a team of baseball players, and their body data, such as height, weight and age. Those numbers are not exactly random, but we can still apply the same mathematical concepts. For example, a sequence of people's weights can be considered to be a sequence of values drawn from some random variable. Below is the sequence of weights of actual baseball players from [Major League Baseball](http://mlb.mlb.com/index.jsp), taken from [this dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights) (for your convenience, only first 20 values are shown): + +``` +[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0] +``` + +> When working with real data, we assume that data points are samples drawn from some probability distribution. This assumption allows us to apply machine learning techniques and build working predictive models. + -> It can be demonstrated that for any discrete distribution with values x1, x2, ..., xN and corresponding probabilities p1, p2, ..., pN, the expectation would equal to E(X)=x1p1+x2p2+...+xNpN. ## Normal Distribution diff --git a/1-Introduction/04-stats-and-probability/images/weight-histogram.png b/1-Introduction/04-stats-and-probability/images/weight-histogram.png new file mode 100644 index 0000000000000000000000000000000000000000..832c41ddff9091b806e11686644481aaf70d45d3 GIT binary patch literal 3948 zcmb7HXH-+$wvI$;(vBzsLf{<1C{=}kfj~q-L?lQD2_iMpA)zEhKtu(lgeHhUP!Q=L zy>}#`Mw)bx0HK5y6e%LSc+WZaz90A9JKh>&&%M{2YtJ#)T;KPNv7X;SXmfK4a{>SW zZk?N&h5!JIDgbcg%TXZHQ@m6x#C&mh-n75~0K7K8o+H`tEGMQD>aAt&ZRFwT?Pr5_ z0O;9xd%AjfyE@s5`#PX8P9E+Ga*A@7WyBwNdwXJ3T_N?UxDNgko)RF}^q2@Aeizw-FfAL#u62Z2qmuUG4oSOT3l9G`taPxPhOn z@Sk~S!k=TEi2Etib$4a)b%n`F$It66m>O{}pA%h>fYzPrw5$yxsUE&I6}R$@ArVhM zlQt|`|0A5{lw-Dy-E>Kmu5t8QHQf5NL5^+XGh$;l7`_(>CZAh z-9}5Cw}xWfk%_u$qaQNIo$e-}Z$pMw*NlGz2YxgzePy7RlWDWAOC_FGD+Lv$yGenJ z;F-jM29E-^SnQZ<{@w8EY=*spo$^gsivS?#q058K0GbQ$lwge#dt57frNF zw#U*e4r|0GiJS_h7pv{Tx7MdlW@Y_BQZP#Hn9vn4G&%L7J;mO{c(6OSxAB+(IJq}G z2Uj90G@KS#|2}U_)QP6#nlauNnJ2DiIk~13k&kH5h58)0&*t`iqjeVyxYItnV}P>&YH$NCqoL0O_h`rd-Hhxs0}Ybr`B@4UP@Z|V)Uss?@|_bn|+0~ z>oaf(no25&Z&k2x70xXu#|f@-8$68GdW-;0wXTi#!^_bvKgJpM^hcQtX3*u&^+E=J zNQ56o-dnrWRVj)0F%7DcJf)U<9ByQUvklLZJgrFRmE+OxrH_;*H7pvOn!K$%+ZNF~ zcSe)|HaVYcS*@M7Or$?OPh6WUuaOO34p?3;uw3?uFriz_VQqJhB~?#cly-4(vFuZ* zmG&QOUP!GX>x_jH^9x=hgP2v0r)!Tzs)L4DQztTT%d@#=(UEDwhGy$4bx}xH$Mkl7 zd74AMg)TH0zCI5M&074%xVP4q^;6wHWn!R5aLcWlG#|6;HQK3$zA!)<08NXWd;rRD z>xy7ma;#uwS12dzNsX0Cxj7nBw)ToF)e>73;FH6cpL+<0s7$Bvc`yIMuY~twqNU5< zkOXhaJi~iEH`IOeO32D?M~4$crmk$XxHghrLcOHBeB|j;4w0&JxS`_TcXlhGE#FH= zS=Y~a4QeIti zv6jWdq&RoDQT5OCz_y~#kE2=!>wLEPZtJ8@LEi|9`D}w!o{hFb?BM0mcfw8C+k_K15#_!6ms$8{05`VeNM7^v#4ak630Q%pY1nO01NNMVzDgP1ciR@?`$KDT0k({ zX37;sU4$Z~y(+))b8_g1xi$jHLa%R?PptuV0i|Ev$5l;AB)PiX;DxEKS>-jvBoQUn z_XkGH`G~8n*Se$@xrbx*JWyw_4diz$|J3LITz9XglS8GLGiu+454STOJC{E0X1^@9Ma-xhM$^+G}-| z-j@Tn;;l+!q(Z0X?Zf1MMD;wIo}aU4Cx*!jUzxt27W=C2 zQL<$_=o?2j+DhngMw-9o4i=eQFd&*YCy7Wm!o7E&%x&gxvlCdK_GsdNh4)u1Y@P*S zSrrR~VL%LT4i2DCRnJNC;)N*|0&n(Ix7gnRSvt#W=@gU=E)yZFUs7*elA(|jCQN8MNc@e7X*qR)T?`u$K-ARN(>cTJ zLPSm-=It?k0O6qVk-=UA*tsub6CJV0LA@mQBBiLAzbOuR_t5{jm;RTkg_rZ(u`>)& zfrI(e!KhDWjQ<7$3>i0;V3H>`ckW^g?NVWZy=o;dn*UX{av`*O6$A)p-B(Y@HL%`8 zn4_r_e(A0`m-^p;eY0jKxVOM6%;+#D2)X8EKVj_f_* z`QXc3A{FZOA)K{8Q_Iq$U0eprn0>Bi?Vn${DelkWU$~<{2%VUtV zc08EjzExZtYK4e$WeRPb3w4IURM^U{jwZ+{a7HtRDqg|x^n==HMOi16ew^$IS!7(G z@&DWosZ9|XnwULG3H5_MXV?EUFPj!l%FnMG>H3uwHCU31ADy2I+r)l+gcAI(ye0>x zslzrjBT%F>BBB4<75F4&+z`c%yHPPLc4$mA-y&Ux1;gwh7J{rqLN2@kMN|-~KToPc z7d&*xgH0P896a>ousqz~{+$m04me;OAd9$Z7HfhYsghvr@_2aDS&WK*VO`w`^8~iR zRqcSFxxiZi*}xCmpQ>0BlO`4l#J=$Z@j1K!fMytnJfw9NZo8@Ci)H0G&i1l&Hm!1M zM+q6Q2;2$6qdxWD;QPp+Y$(y)M<#X8I~U<^UhwbKuv~PiMVu(X$DAm|jz4xZkXMS7 z9g;6T>Gr%D%O~kC)M@zEI#(g?QAJO(z_s(4>o$lW-0s3cce3&s?7@K{!i8yazZMx7dc~91$a|wUMjx)MwRT_-B^j-9HQB`{Qp4eXeb8g! zR~-6FChUlSGL{Z!Qw?uAR?xOaWZ=k0XVX%DGmrQ6&CPnI_zJX781ujQDNJgiLN=E~ zMhl6nyQ=!ys9A_mt`oq>XZPmO;U0Fx{g|R%L|B>Y>HAf(u;5Q{Yvy|8^IRevgsNa6 zhK%>r?%K0=7p*GQt9~m~jp65LNrbGd{jt_|XRL{z{pXu*pPH#R*dG#&bxqveVGD{4 zk+*{HY8v@u^js+%&lzWtE@+t(dw6idis7F^<*50R=}*EdUq)(*9#ogr%Rh)&XC37U7Ts)jw}pxL&& z1ta%vvb;SV0kMiKgq-IROb5Kfj3cQ<{X;KlkYKi<`K;lWv<|9gM?>*Y_+^}oE-zH`BjS?V$;0WmCS%-XAI`JgvLHvTI_ zN@3A^Azk_c^L$S-510?savgi1Sa;Rum1xUX@fR;f0cL^0ZV|Ns&rz`8xD9gF!_2@Q zmAb|>K4Pc=&6GM`6xaTA1nIYRRK;(rljk@)fm2l$K8oD0fxF@~E~vW*udYf3HKg(r zy)*)=zhA0{_1Fv_U(q9r=djzBRiB&D)K86k*J<8*T;PgC`giju3+jl9HtC H{^S1voNmIn literal 0 HcmV?d00001 diff --git a/1-Introduction/04-stats-and-probability/notebook.ipynb b/1-Introduction/04-stats-and-probability/notebook.ipynb index e69de29b..41dda760 100644 --- a/1-Introduction/04-stats-and-probability/notebook.ipynb +++ b/1-Introduction/04-stats-and-probability/notebook.ipynb @@ -0,0 +1,373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Introduction to Probability and Statistics\r\n", + "|\r\n", + "In this notebook, we will play around with some of the concepts we have previously discussed. Many concepts from probability and statistics are well-represented in major libraries for data processing in Python, such as `numpy` and `pandas`." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 6, + "source": [ + "import numpy as np\r\n", + "import pandas as pd\r\n", + "import random\r\n", + "import matplotlib.pyplot as plt" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "\r\n", + "## Random Variables and Distributions\r\n", + "\r\n", + "Let's start with drawing a sample of 30 variables from a uniform disribution from 0 to 9. We will also compute mean and variance." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 17, + "source": [ + "sample = [ random.randint(0,10) for _ in range(30) ]\r\n", + "print(f\"Sample: {sample}\")\r\n", + "print(f\"Mean = {np.mean(sample)}\")\r\n", + "print(f\"Variance = {np.var(sample)}\")" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Sample: [4, 6, 3, 0, 3, 4, 7, 7, 9, 6, 8, 2, 0, 3, 10, 7, 2, 0, 2, 1, 1, 6, 5, 0, 9, 0, 1, 8, 2, 9]\n", + "Mean = 4.166666666666667\n", + "Variance = 10.272222222222222\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "To visually estimate how many different values are there in the sample, we can plot the **histogram**:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 18, + "source": [ + "plt.hist(sample)\r\n", + "plt.show()" + ], + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/svg+xml": "\r\n\r\n\r\n \r\n \r\n \r\n \r\n 2021-08-12T14:31:22.124750\r\n image/svg+xml\r\n \r\n \r\n Matplotlib v3.4.2, https://matplotlib.org/\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhYAAAGdCAYAAABO2DpVAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAUzklEQVR4nO3df4zXBf3A8dcJ8QHt7goK4caBWBQKYgZWICmlshFjudYPTY1l/WEDg26VoG2KC45suVoUhmu2VgZrhdJMFv0Aco2EmyRD54+JeuUPZrU7vObHCe/vH81b9xXUz/H63IfPx8dj+/zxft/7c+/X3rvd+7n3vT/3biqKoggAgAQn1XoAAKBxCAsAII2wAADSCAsAII2wAADSCAsAII2wAADSCAsAIM3wod7hkSNH4umnn47m5uZoamoa6t0DAINQFEUcOnQo2tra4qSTjn1dYsjD4umnn4729vah3i0AkKC7uzsmTJhwzK8PeVg0NzdHxH8Ha2lpGerdAwCD0NvbG+3t7f3n8WMZ8rB45c8fLS0twgIA6szr3cbg5k0AII2wAADSCAsAII2wAADSCAsAII2wAADSCAsAII2wAADSCAsAII2wAADSVBQWN954YzQ1NQ14jRs3rlqzAQB1puJnhUybNi1+//vf9y8PGzYsdSAAoH5VHBbDhw93lQIAOKqK77F49NFHo62tLSZPnhyXXnppPP7446+5fblcjt7e3gEvAKAxVXTF4oMf/GD89Kc/jfe85z3x3HPPxTe/+c2YM2dO7N+/P8aMGXPU93R2dsaqVatShn09p624e0j2k+mJtQtrPQIApGkqiqIY7Jv7+vriXe96V3z961+Pjo6Oo25TLpejXC73L/f29kZ7e3v09PRES0vLYHd9VMICAKqjt7c3WltbX/f8XfE9Fv/rlFNOibPOOiseffTRY25TKpWiVCodz24AgDpxXP/Holwux0MPPRTjx4/PmgcAqGMVhcVXv/rV2LFjRxw4cCD++te/xic/+cno7e2NxYsXV2s+AKCOVPSnkL///e9x2WWXxfPPPx/vfOc740Mf+lDs2rUrJk2aVK35AIA6UlFYbNy4sVpzAAANwLNCAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASCMsAIA0wgIASHNcYdHZ2RlNTU2xfPnypHEAgHo26LDYvXt3bNiwIWbMmJE5DwBQxwYVFi+88EJcfvnlcdttt8Xb3/727JkAgDo1qLBYsmRJLFy4MC666KLX3bZcLkdvb++AFwDQmIZX+oaNGzdGV1dX7Nmz5w1t39nZGatWrap4ME5cp624u9YjVOyJtQtrPQJQh/y+q1xFVyy6u7tj2bJl8fOf/zxGjhz5ht6zcuXK6Onp6X91d3cPalAA4MRX0RWLrq6uOHjwYMycObN/3eHDh2Pnzp2xbt26KJfLMWzYsAHvKZVKUSqVcqYFAE5oFYXFhRdeGPv27Ruw7vOf/3xMnTo1rr322ldFBQDw5lJRWDQ3N8f06dMHrDvllFNizJgxr1oPALz5+M+bAECaij8V8v9t3749YQwAoBG4YgEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApKkoLNavXx8zZsyIlpaWaGlpidmzZ8c999xTrdkAgDpTUVhMmDAh1q5dG3v27Ik9e/bERz/60fj4xz8e+/fvr9Z8AEAdGV7JxosWLRqwvHr16li/fn3s2rUrpk2bljoYAFB/KgqL/3X48OH45S9/GX19fTF79uxjblcul6NcLvcv9/b2DnaXAMAJruKw2LdvX8yePTtefPHFeOtb3xqbN2+OM88885jbd3Z2xqpVq45ryEZ22oq7az3Cm0K9Hucn1i6s9QhvCvX681Fv/Dy/OVT8qZD3vve9sXfv3ti1a1d86UtfisWLF8eDDz54zO1XrlwZPT09/a/u7u7jGhgAOHFVfMVixIgR8e53vzsiImbNmhW7d++O733ve/GjH/3oqNuXSqUolUrHNyUAUBeO+/9YFEUx4B4KAODNq6IrFtddd10sWLAg2tvb49ChQ7Fx48bYvn17bN26tVrzAQB1pKKweO655+LKK6+MZ555JlpbW2PGjBmxdevWuPjii6s1HwBQRyoKix//+MfVmgMAaACeFQIApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApKkoLDo7O+Pcc8+N5ubmGDt2bFxyySXx8MMPV2s2AKDOVBQWO3bsiCVLlsSuXbti27Zt8fLLL8f8+fOjr6+vWvMBAHVkeCUbb926dcDy7bffHmPHjo2urq44//zzUwcDAOpPRWHx//X09ERExOjRo4+5TblcjnK53L/c29t7PLsEAE5ggw6Loiiio6Mj5s6dG9OnTz/mdp2dnbFq1arB7gbe1E5bcXetR6jYE2sX1noETlD1+PNM5Qb9qZClS5fGAw88EL/4xS9ec7uVK1dGT09P/6u7u3uwuwQATnCDumJxzTXXxJYtW2Lnzp0xYcKE19y2VCpFqVQa1HAAQH2pKCyKoohrrrkmNm/eHNu3b4/JkydXay4AoA5VFBZLliyJO+64I+66665obm6OZ599NiIiWltbY9SoUVUZEACoHxXdY7F+/fro6emJefPmxfjx4/tfmzZtqtZ8AEAdqfhPIQAAx+JZIQBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKQRFgBAGmEBAKSpOCx27twZixYtira2tmhqaoo777yzCmMBAPWo4rDo6+uLs88+O9atW1eNeQCAOja80jcsWLAgFixYUI1ZAIA6V3FYVKpcLke5XO5f7u3trfYuAYAaqXpYdHZ2xqpVq6q9G+AEcdqKu2s9AlBDVf9UyMqVK6Onp6f/1d3dXe1dAgA1UvUrFqVSKUqlUrV3AwCcAPwfCwAgTcVXLF544YV47LHH+pcPHDgQe/fujdGjR8fEiRNThwMA6kvFYbFnz574yEc+0r/c0dERERGLFy+On/zkJ2mDAQD1p+KwmDdvXhRFUY1ZAIA65x4LACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACCNsAAA0ggLACDNoMLihz/8YUyePDlGjhwZM2fOjD//+c/ZcwEAdajisNi0aVMsX748rr/++rj//vvjwx/+cCxYsCCeeuqpaswHANSRisPilltuiS984QvxxS9+Mc4444z47ne/G+3t7bF+/fpqzAcA1JHhlWz80ksvRVdXV6xYsWLA+vnz58df/vKXo76nXC5HuVzuX+7p6YmIiN7e3kpnfV1Hyv9J/54AUE+qcX793+9bFMVrbldRWDz//PNx+PDhOPXUUwesP/XUU+PZZ5896ns6Oztj1apVr1rf3t5eya4BgDeg9bvV/f6HDh2K1tbWY369orB4RVNT04Dloihete4VK1eujI6Ojv7lI0eOxL/+9a8YM2bMMd8zGL29vdHe3h7d3d3R0tKS9n0ZyHEeOo710HCch4bjPDSqeZyLoohDhw5FW1vba25XUVi84x3viGHDhr3q6sTBgwdfdRXjFaVSKUql0oB1b3vb2yrZbUVaWlr80A4Bx3noONZDw3EeGo7z0KjWcX6tKxWvqOjmzREjRsTMmTNj27ZtA9Zv27Yt5syZU9l0AEDDqfhPIR0dHXHllVfGrFmzYvbs2bFhw4Z46qmn4uqrr67GfABAHak4LD7zmc/EP//5z7jpppvimWeeienTp8dvf/vbmDRpUjXme8NKpVLccMMNr/qzC7kc56HjWA8Nx3loOM5D40Q4zk3F631uBADgDfKsEAAgjbAAANIICwAgjbAAANI0TFh4lHt1dXZ2xrnnnhvNzc0xduzYuOSSS+Lhhx+u9VgNr7OzM5qammL58uW1HqXh/OMf/4grrrgixowZEyeffHK8733vi66urlqP1VBefvnl+MY3vhGTJ0+OUaNGxemnnx433XRTHDlypNaj1b2dO3fGokWLoq2tLZqamuLOO+8c8PWiKOLGG2+Mtra2GDVqVMybNy/2798/JLM1RFh4lHv17dixI5YsWRK7du2Kbdu2xcsvvxzz58+Pvr6+Wo/WsHbv3h0bNmyIGTNm1HqUhvPvf/87zjvvvHjLW94S99xzTzz44IPxne98p6r/FfjN6Fvf+lbceuutsW7dunjooYfi5ptvjm9/+9vx/e9/v9aj1b2+vr44++yzY926dUf9+s033xy33HJLrFu3Lnbv3h3jxo2Liy++OA4dOlT94YoG8IEPfKC4+uqrB6ybOnVqsWLFihpN1PgOHjxYRESxY8eOWo/SkA4dOlRMmTKl2LZtW3HBBRcUy5Ytq/VIDeXaa68t5s6dW+sxGt7ChQuLq666asC6T3ziE8UVV1xRo4kaU0QUmzdv7l8+cuRIMW7cuGLt2rX961588cWitbW1uPXWW6s+T91fsXjlUe7z588fsP61HuXO8evp6YmIiNGjR9d4ksa0ZMmSWLhwYVx00UW1HqUhbdmyJWbNmhWf+tSnYuzYsXHOOefEbbfdVuuxGs7cuXPjD3/4QzzyyCMREfG3v/0t7r333vjYxz5W48ka24EDB+LZZ58dcF4slUpxwQUXDMl5cVBPNz2RDOZR7hyfoiiio6Mj5s6dG9OnT6/1OA1n48aN0dXVFXv27Kn1KA3r8ccfj/Xr10dHR0dcd911cd9998WXv/zlKJVK8bnPfa7W4zWMa6+9Nnp6emLq1KkxbNiwOHz4cKxevTouu+yyWo/W0F459x3tvPjkk09Wff91HxavqORR7hyfpUuXxgMPPBD33ntvrUdpON3d3bFs2bL43e9+FyNHjqz1OA3ryJEjMWvWrFizZk1ERJxzzjmxf//+WL9+vbBItGnTpvjZz34Wd9xxR0ybNi327t0by5cvj7a2tli8eHGtx2t4tTov1n1YDOZR7gzeNddcE1u2bImdO3fGhAkTaj1Ow+nq6oqDBw/GzJkz+9cdPnw4du7cGevWrYtyuRzDhg2r4YSNYfz48XHmmWcOWHfGGWfEr371qxpN1Ji+9rWvxYoVK+LSSy+NiIizzjornnzyyejs7BQWVTRu3LiI+O+Vi/Hjx/evH6rzYt3fY+FR7kOjKIpYunRp/PrXv44//vGPMXny5FqP1JAuvPDC2LdvX+zdu7f/NWvWrLj88stj7969oiLJeeed96qPSz/yyCM1f5hio/nPf/4TJ5008DQzbNgwHzetssmTJ8e4ceMGnBdfeuml2LFjx5CcF+v+ikWER7kPhSVLlsQdd9wRd911VzQ3N/dfIWptbY1Ro0bVeLrG0dzc/Kr7Vk455ZQYM2aM+1kSfeUrX4k5c+bEmjVr4tOf/nTcd999sWHDhtiwYUOtR2soixYtitWrV8fEiRNj2rRpcf/998ctt9wSV111Va1Hq3svvPBCPPbYY/3LBw4ciL1798bo0aNj4sSJsXz58lizZk1MmTIlpkyZEmvWrImTTz45PvvZz1Z/uKp/7mSI/OAHPygmTZpUjBgxonj/+9/vY5DJIuKor9tvv73WozU8Hzetjt/85jfF9OnTi1KpVEydOrXYsGFDrUdqOL29vcWyZcuKiRMnFiNHjixOP/304vrrry/K5XKtR6t7f/rTn476O3nx4sVFUfz3I6c33HBDMW7cuKJUKhXnn39+sW/fviGZzWPTAYA0dX+PBQBw4hAWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAEAaYQEApBEWAECa/wM2GWyYQzH6qgAAAABJRU5ErkJggg==" + }, + "metadata": {} + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Analyzing Real Data\r\n", + "\r\n", + "Mean and variance are very important when analyzing real-world data. Let's load the data about baseball players from [SOCR MLB Height/Weight Data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 26, + "source": [ + "df = pd.read_csv(\"../../data/SOCR_MLB.tsv\",sep='\\t',header=None,names=['Name','Team','Rome','Height','Weight','Age'])\r\n", + "df" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Name Team Rome Height Weight Age\n", + "0 Adam_Donachie BAL Catcher 74 180.0 22.99\n", + "1 Paul_Bako BAL Catcher 74 215.0 34.69\n", + "2 Ramon_Hernandez BAL Catcher 72 210.0 30.78\n", + "3 Kevin_Millar BAL First_Baseman 72 210.0 35.43\n", + "4 Chris_Gomez BAL First_Baseman 73 188.0 35.71\n", + "... ... ... ... ... ... ...\n", + "1029 Brad_Thompson STL Relief_Pitcher 73 190.0 25.08\n", + "1030 Tyler_Johnson STL Relief_Pitcher 74 180.0 25.73\n", + "1031 Chris_Narveson STL Relief_Pitcher 75 205.0 25.19\n", + "1032 Randy_Keisler STL Relief_Pitcher 75 190.0 31.01\n", + "1033 Josh_Kinney STL Relief_Pitcher 73 195.0 27.92\n", + "\n", + "[1034 rows x 6 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameTeamRomeHeightWeightAge
0Adam_DonachieBALCatcher74180.022.99
1Paul_BakoBALCatcher74215.034.69
2Ramon_HernandezBALCatcher72210.030.78
3Kevin_MillarBALFirst_Baseman72210.035.43
4Chris_GomezBALFirst_Baseman73188.035.71
.....................
1029Brad_ThompsonSTLRelief_Pitcher73190.025.08
1030Tyler_JohnsonSTLRelief_Pitcher74180.025.73
1031Chris_NarvesonSTLRelief_Pitcher75205.025.19
1032Randy_KeislerSTLRelief_Pitcher75190.031.01
1033Josh_KinneySTLRelief_Pitcher73195.027.92
\n", + "

1034 rows × 6 columns

\n", + "
" + ] + }, + "metadata": {}, + "execution_count": 26 + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "> We are using a package called **Pandas** here for data analysis. We will talk more about Pandas and working with data in Python later in this course.\r\n", + "\r\n", + "Let's compute average values for age, height and weight:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 30, + "source": [ + "df[['Age','Height','Weight']].mean()" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Age 28.736712\n", + "Height 73.697292\n", + "Weight 201.689255\n", + "dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 30 + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values: " + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 40, + "source": [ + "df['Weight'].hist(bins=15)\r\n", + "plt.suptitle('Weight distribution of MLB Players')\r\n", + "plt.xlabel('Weight')\r\n", + "plt.ylabel('Count')\r\n", + "plt.show()" + ], + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/svg+xml": "\r\n\r\n\r\n \r\n \r\n \r\n \r\n 2021-08-12T14:47:40.679219\r\n image/svg+xml\r\n \r\n \r\n Matplotlib v3.4.2, https://matplotlib.org/\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", + "image/png": "" + }, + "metadata": {} + } + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 44, + "source": [ + "print(list(df['Weight'])[:20])" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[180.0, 215.0, 210.0, 210.0, 188.0, 176.0, 209.0, 200.0, 231.0, 180.0, 188.0, 180.0, 185.0, 160.0, 180.0, 185.0, 197.0, 189.0, 185.0, 219.0]\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "source": [], + "outputs": [], + "metadata": {} + } + ], + "metadata": { + "orig_nbformat": 4, + "language_info": { + "name": "python", + "version": "3.8.8", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3.8.8 64-bit (conda)" + }, + "interpreter": { + "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file