parent
87d54575ba
commit
c119d0b3a9
@ -0,0 +1,252 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Introduction to Probability and Statistics\r\n",
|
||||
"## Assignment\r\n",
|
||||
"\r\n",
|
||||
"In this assignment, we will use the dataset of diabetes patients taken [from here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)."
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"source": [
|
||||
"import pandas as pd\r\n",
|
||||
"import numpy as np\r\n",
|
||||
"\r\n",
|
||||
"df = pd.read_csv(\"../../data/diabetes.tsv\",sep='\\t')\r\n",
|
||||
"df.head()"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"output_type": "execute_result",
|
||||
"data": {
|
||||
"text/plain": [
|
||||
" AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y\n",
|
||||
"0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151\n",
|
||||
"1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75\n",
|
||||
"2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141\n",
|
||||
"3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206\n",
|
||||
"4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135"
|
||||
],
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>AGE</th>\n",
|
||||
" <th>SEX</th>\n",
|
||||
" <th>BMI</th>\n",
|
||||
" <th>BP</th>\n",
|
||||
" <th>S1</th>\n",
|
||||
" <th>S2</th>\n",
|
||||
" <th>S3</th>\n",
|
||||
" <th>S4</th>\n",
|
||||
" <th>S5</th>\n",
|
||||
" <th>S6</th>\n",
|
||||
" <th>Y</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>59</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>32.1</td>\n",
|
||||
" <td>101.0</td>\n",
|
||||
" <td>157</td>\n",
|
||||
" <td>93.2</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" <td>4.8598</td>\n",
|
||||
" <td>87</td>\n",
|
||||
" <td>151</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>48</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>21.6</td>\n",
|
||||
" <td>87.0</td>\n",
|
||||
" <td>183</td>\n",
|
||||
" <td>103.2</td>\n",
|
||||
" <td>70.0</td>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" <td>3.8918</td>\n",
|
||||
" <td>69</td>\n",
|
||||
" <td>75</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>72</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>30.5</td>\n",
|
||||
" <td>93.0</td>\n",
|
||||
" <td>156</td>\n",
|
||||
" <td>93.6</td>\n",
|
||||
" <td>41.0</td>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" <td>4.6728</td>\n",
|
||||
" <td>85</td>\n",
|
||||
" <td>141</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>24</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>25.3</td>\n",
|
||||
" <td>84.0</td>\n",
|
||||
" <td>198</td>\n",
|
||||
" <td>131.4</td>\n",
|
||||
" <td>40.0</td>\n",
|
||||
" <td>5.0</td>\n",
|
||||
" <td>4.8903</td>\n",
|
||||
" <td>89</td>\n",
|
||||
" <td>206</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>50</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>23.0</td>\n",
|
||||
" <td>101.0</td>\n",
|
||||
" <td>192</td>\n",
|
||||
" <td>125.4</td>\n",
|
||||
" <td>52.0</td>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" <td>4.2905</td>\n",
|
||||
" <td>80</td>\n",
|
||||
" <td>135</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"execution_count": 13
|
||||
}
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"\r\n",
|
||||
"In this dataset, columns as the following:\r\n",
|
||||
"* Age and sex are self-explanatory\r\n",
|
||||
"* BMI is body mass index\r\n",
|
||||
"* BP is average blood pressure\r\n",
|
||||
"* S1 through S6 are different blood measurements\r\n",
|
||||
"* Y is the qualitative measure of disease progression over one year\r\n",
|
||||
"\r\n",
|
||||
"Let's study this dataset using methods of probability and statistics.\r\n",
|
||||
"\r\n",
|
||||
"### Task 1: Compute mean values and variance for all values"
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"source": [],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Task 2: Plot boxplots for BMI, BP and Y depending on gender"
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"source": [],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Task 3: What is the the distribution of Age, Sex, BMI and Y variables?"
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"source": [],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Task 4: Test the correlation between different variables and disease progression (Y)\r\n",
|
||||
"\r\n",
|
||||
"> **Hint** Correlation matrix would give you the most useful information on which values are dependent."
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Task 5: Test the hypothesis that the degree of diabetes progression is different between men and women"
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [],
|
||||
"metadata": {}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"orig_nbformat": 4,
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.8.8",
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"nbconvert_exporter": "python",
|
||||
"file_extension": ".py"
|
||||
},
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3.8.8 64-bit (conda)"
|
||||
},
|
||||
"interpreter": {
|
||||
"hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
@ -1,8 +1,25 @@
|
||||
# Title
|
||||
# Small Diabetes Study
|
||||
|
||||
In this assignment, we will work with a small dataset of diabetes patients taken from [here](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).
|
||||
|
||||
| | AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y |
|
||||
|---|-----|-----|-----|----|----|----|----|----|----|----|----|
|
||||
| 0 | 59 | 2 | 32.1 | 101. | 157 | 93.2 | 38.0 | 4. | 4.8598 | 87 | 151 |
|
||||
| 1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70. | 3. | 3.8918 | 69 | 75 |
|
||||
| 2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4. | 85 | 141 |
|
||||
| ... | ... | ... | ... | ...| ...| ...| ...| ...| ...| ...| ... |
|
||||
|
||||
## Instructions
|
||||
|
||||
* Open the [assignment notebook](assignment.ipynb) in a jupyter notebook environment
|
||||
* Complete all tasks listed in the notebook, namely:
|
||||
[ ] Compute mean values and variance for all values
|
||||
[ ] Plot boxplots for BMI, BP and Y depending on gender
|
||||
[ ] What is the the distribution of Age, Sex, BMI and Y variables?
|
||||
[ ] Test the correlation between different variables and disease progression (Y)
|
||||
[ ] Test the hypothesis that the degree of diabetes progression is different between men and women
|
||||
## Rubric
|
||||
|
||||
Exemplary | Adequate | Needs Improvement
|
||||
--- | --- | -- |
|
||||
All required tasks are complete, graphically illustrated and explained | Most of the tasks are complete, explanations or takeaways from graphs and/or obtained values are missing | Only basic tasks such as computation of mean/variance and basic plots are complete, no conclusions are made from the data
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
Loading…
Reference in new issue