|
|
|
@ -3,9 +3,9 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": [
|
|
|
|
|
"## Introduction to Probability and Statistics\r\n",
|
|
|
|
|
"## Assignment\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"## Introduction to Probability and Statistics\n",
|
|
|
|
|
"## Assignment\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"In this assignment, we will use the dataset of diabetes patients taken [from here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)."
|
|
|
|
|
],
|
|
|
|
|
"metadata": {}
|
|
|
|
@ -14,11 +14,11 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 13,
|
|
|
|
|
"source": [
|
|
|
|
|
"import pandas as pd\r\n",
|
|
|
|
|
"import numpy as np\r\n",
|
|
|
|
|
"import matplotlib.pyplot as plt\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"df = pd.read_csv(\"../../../data/diabetes.tsv\",sep='\\t')\r\n",
|
|
|
|
|
"import pandas as pd\n",
|
|
|
|
|
"import numpy as np\n",
|
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"df = pd.read_csv(\"../../../data/diabetes.tsv\",sep='\\t')\n",
|
|
|
|
|
"df.head()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -150,16 +150,16 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": [
|
|
|
|
|
"\r\n",
|
|
|
|
|
"In this dataset, columns as the following:\r\n",
|
|
|
|
|
"* Age and sex are self-explanatory\r\n",
|
|
|
|
|
"* BMI is body mass index\r\n",
|
|
|
|
|
"* BP is average blood pressure\r\n",
|
|
|
|
|
"* S1 through S6 are different blood measurements\r\n",
|
|
|
|
|
"* Y is the qualitative measure of disease progression over one year\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"Let's study this dataset using methods of probability and statistics.\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"In this dataset, columns as the following:\n",
|
|
|
|
|
"* Age and sex are self-explanatory\n",
|
|
|
|
|
"* BMI is body mass index\n",
|
|
|
|
|
"* BP is average blood pressure\n",
|
|
|
|
|
"* S1 through S6 are different blood measurements\n",
|
|
|
|
|
"* Y is the qualitative measure of disease progression over one year\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Let's study this dataset using methods of probability and statistics.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### Task 1: Compute mean values and variance for all values"
|
|
|
|
|
],
|
|
|
|
|
"metadata": {}
|
|
|
|
@ -355,7 +355,7 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 8,
|
|
|
|
|
"source": [
|
|
|
|
|
"# Another way\r\n",
|
|
|
|
|
"# Another way\n",
|
|
|
|
|
"pd.DataFrame([df.mean(),df.var()],index=['Mean','Variance']).head()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -447,7 +447,7 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 9,
|
|
|
|
|
"source": [
|
|
|
|
|
"# Or, more simply, for the mean (variance can be done similarly)\r\n",
|
|
|
|
|
"# Or, more simply, for the mean (variance can be done similarly)\n",
|
|
|
|
|
"df.mean()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -486,8 +486,8 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 17,
|
|
|
|
|
"source": [
|
|
|
|
|
"for col in ['BMI','BP','Y']:\r\n",
|
|
|
|
|
" df.boxplot(column=col,by='SEX')\r\n",
|
|
|
|
|
"for col in ['BMI','BP','Y']:\n",
|
|
|
|
|
" df.boxplot(column=col,by='SEX')\n",
|
|
|
|
|
"plt.show()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -538,8 +538,8 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 19,
|
|
|
|
|
"source": [
|
|
|
|
|
"for col in ['AGE','SEX','BMI','Y']:\r\n",
|
|
|
|
|
" df[col].hist()\r\n",
|
|
|
|
|
"for col in ['AGE','SEX','BMI','Y']:\n",
|
|
|
|
|
" df[col].hist()\n",
|
|
|
|
|
" plt.show()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -593,9 +593,9 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": [
|
|
|
|
|
"Conclusions:\r\n",
|
|
|
|
|
"* Age - normal\r\n",
|
|
|
|
|
"* Sex - uniform\r\n",
|
|
|
|
|
"Conclusions:\n",
|
|
|
|
|
"* Age - normal\n",
|
|
|
|
|
"* Sex - uniform\n",
|
|
|
|
|
"* BMI, Y - hard to tell"
|
|
|
|
|
],
|
|
|
|
|
"metadata": {}
|
|
|
|
@ -603,8 +603,8 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": [
|
|
|
|
|
"### Task 4: Test the correlation between different variables and disease progression (Y)\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"### Task 4: Test the correlation between different variables and disease progression (Y)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"> **Hint** Correlation matrix would give you the most useful information on which values are dependent."
|
|
|
|
|
],
|
|
|
|
|
"metadata": {}
|
|
|
|
@ -847,7 +847,7 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": [
|
|
|
|
|
"Conclusion:\r\n",
|
|
|
|
|
"Conclusion:\n",
|
|
|
|
|
"* The strongest correlation of Y is BMI and S5 (blood sugar). This sounds reasonable."
|
|
|
|
|
],
|
|
|
|
|
"metadata": {}
|
|
|
|
@ -856,10 +856,10 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 26,
|
|
|
|
|
"source": [
|
|
|
|
|
"fig, ax = plt.subplots(1,3,figsize=(10,5))\r\n",
|
|
|
|
|
"for i,n in enumerate(['BMI','S5','BP']):\r\n",
|
|
|
|
|
" ax[i].scatter(df['Y'],df[n])\r\n",
|
|
|
|
|
" ax[i].set_title(n)\r\n",
|
|
|
|
|
"fig, ax = plt.subplots(1,3,figsize=(10,5))\n",
|
|
|
|
|
"for i,n in enumerate(['BMI','S5','BP']):\n",
|
|
|
|
|
" ax[i].scatter(df['Y'],df[n])\n",
|
|
|
|
|
" ax[i].set_title(n)\n",
|
|
|
|
|
"plt.show()"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
@ -888,9 +888,9 @@
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 27,
|
|
|
|
|
"source": [
|
|
|
|
|
"from scipy.stats import ttest_ind\r\n",
|
|
|
|
|
"\r\n",
|
|
|
|
|
"tval, pval = ttest_ind(df.loc[df['SEX']==1,['Y']], df.loc[df['SEX']==2,['Y']],equal_var=False)\r\n",
|
|
|
|
|
"from scipy.stats import ttest_ind\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"tval, pval = ttest_ind(df.loc[df['SEX']==1,['Y']], df.loc[df['SEX']==2,['Y']],equal_var=False)\n",
|
|
|
|
|
"print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
|
|
|
|
|
],
|
|
|
|
|
"outputs": [
|
|
|
|
|