Add 计算与检验

pull/2/head
benjas 5 years ago
parent 2a858eafef
commit cef63020a0

@ -35,12 +35,222 @@
"左下图是线性相关一般的,右下图有一定的相关但不是线性相关;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 皮尔逊相关系数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 连续变量的相关分析\n",
"<ul>\n",
" <li>连续变量即数据变量,它的取值之间可以比较大小,可以用加减法计算出差异的大小。\n",
" <li>如“年龄”、“收入”、“成绩\"等变量当两个变量都是正态连续变量,而且两者之间呈线性关系时,通常用 Pearson相关系数来衡量。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pearson.相关系数\n",
"**协方差:**\n",
"协方差是一个反映两个随机变量相关程度的指标,如果一个变量跟随着另一个变量同时变大或者变小,那么这两个变量的协方差就是正值\n",
"$$\n",
"cov(X, Y) = \\frac{\\sum_n^{i=1}(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"虽然协方差能反映两个随机变量的相关程度(协方差大于0的时候表示两者正相关小于0的时候表示两者负相关),但是协方差值的大小并不能很好地度量两个随机变量的关联程度\n",
"<br><br>在二维空间中分布着一些数据我们想知道数据点坐标X轴和Y轴的相关程度如果X与Y的相关程度较小但是数据分布的比较离散这样会导致求出的协方差值较大用这个值来度量相关程度是不合理的\n",
"<img src=\"assets/20201120215604.png\" width=\"50%\">\n",
"为了更好的度量两个随机变量的相关程度, 引入Pearson相关系数其在协方差的基础上除了两个随机变量的标准差"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Pearson相关系数**\n",
"$$\n",
"P_{X,Y} = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
"$$\n",
"pearson是一个介于-1和1之间的值当两个变量的线性关系增强时相关系数趋于1或-1当一个变量增大另一个变量也增大时表明它们之间是正相关的相关系数大于0如果一个变量增大另一个变量却减小表明它们之间是负相关的相关系数小于0如果相关系数等于0表明它们之间不存在线性相关关系\n",
"\n",
"<img src=\"assets/20201120220149.png\" width=\"50%\">\n",
"np.corrcoef(a)可结算行与行之间的相关系数np.corrcoef(a,rowvar=0)用于计算各列之间的相关系数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 计算与检验"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"data": {
"text/plain": [
"array([[10, 10, 8, 9, 7],\n",
" [ 4, 5, 4, 3, 3],\n",
" [ 3, 3, 1, 1, 1]])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"matrix_t = np.array([[10,10,8,9,7],\n",
" [4,5,4,3,3],\n",
" [3,3,1,1,1]])\n",
"matrix_t"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0.64168895, 0.84016805],\n",
" [0.64168895, 1. , 0.76376262],\n",
" [0.84016805, 0.76376262, 1. ]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.corrcoef(matrix_t) # 计算行相关系数,对角线是自己与自己比较"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0.98898224, 0.9526832 , 0.9939441 , 0.97986371],\n",
" [0.98898224, 1. , 0.98718399, 0.99926008, 0.99862543],\n",
" [0.9526832 , 0.98718399, 1. , 0.98031562, 0.99419163],\n",
" [0.9939441 , 0.99926008, 0.98031562, 1. , 0.99587059],\n",
" [0.97986371, 0.99862543, 0.99419163, 0.99587059, 1. ]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.corrcoef(matrix_t, rowvar=0) # 计算列相关系数,对角线是自己与自己比较"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"计算伦敦市月平均气温与降水量\n",
"<img src=\"assets/20201120221030.png\" width=\"50%\">\n",
"计算伦敦市月平均气温t与降水量p之间的相关系数\n",
"$$\n",
"r_tp = \\frac{\\sum_{i=1}^{12}(t_i-\\overline{t})(p_i-\\overline{p})}\n",
"{\\sqrt{\\sum^{12}_{t=1}(t_i-\\overline{t})^2}\\sqrt{\\sum^{12}_{i=1}(p_i-\\overline{p})}}\n",
"=\n",
"\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
"$$\n",
"$$\n",
"= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
"$$\n",
"计算结果表明伦敦市的月平均气温t与降水量p呈负相关即异向相关"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 相关系数的显著性检验\n",
"假设\n",
"<ul>\n",
" <li>H0p=0\n",
" <li>H1p≠0\n",
"</ul>\n",
"统计量\n",
"$$\n",
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10个学生初一数学分数与初二数学分数的相关系数为087问从总体上来说初一与初二数学分数是否存在相关?\n",
"<img src=\"assets/20201120222349.png\" width=\"50%\">\n",
"计算检验统计量:\n",
"$$\n",
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}} \n",
"= \\frac{0.78\\sqrt{10-2}}{\\sqrt{1-0.78^2}}\n",
"= 3.524\n",
"$$\n",
"\n",
"$$\n",
"t = 3.524 > 3.355 = t_{(8)0.01}\n",
"$$\n",
"所以,总体来说初一和初二的成绩存在正相关。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"correlation: 0.9891763198690562\n",
"pvalue: 5.926875946481138e-08\n"
]
}
],
"source": [
"from scipy import stats\n",
"x = [10.35, 6.24,3.18,8.46,3.21,7.65,4.32,8.66,9.12,10.31]\n",
"y = [5.1, 3.15,1.67,4.33,1.76,4.11,2.11,4.88,4.99,5.12]\n",
"correlation, pvalue = stats.stats.pearsonr(x,y)\n",
"print('correlation:', correlation)\n",
"print('pvalue:', pvalue)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 等级变量的相关分析\n",
"当测量得到的数据不是等距或等比数据,而是具有等级顺序的数据;或者得到的数据是等距或等比数据,但其所来自的总体分布不是正态的,不满足求皮尔森相关系数(积差相关)的要求。这时就要运用等级相关系数。"
]
}
],
"metadata": {

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

@ -60,7 +60,7 @@
"**协方差:**\n",
"协方差是一个反映两个随机变量相关程度的指标,如果一个变量跟随着另一个变量同时变大或者变小,那么这两个变量的协方差就是正值\n",
"$$\n",
"cov(X, Y) = \\frac{\\sum_n^i=1(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
"cov(X, Y) = \\frac{\\sum_n^{i=1}(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
"$$"
]
},
@ -80,18 +80,177 @@
"source": [
"**Pearson相关系数**\n",
"$$\n",
"PX,Y = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
"P_{X,Y} = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
"$$\n",
"pearson是一个介于-1和1之间的值当两个变量的线性关系增强时相关系数趋于1或-1当一个变量增大另一个变量也增大时表明它们之间是正相关的相关系数大于0如果一个变量增大另一个变量却减小表明它们之间是负相关的相关系数小于0如果相关系数等于0表明它们之间不存在线性相关关系\n",
"\n",
"<img src=\"assets/20201120220149.png\" width=\"50%\">\n",
"np.corrcoef(a)可结算行与行之间的相关系数np.corrcoe"
"np.corrcoef(a)可结算行与行之间的相关系数np.corrcoef(a,rowvar=0)用于计算各列之间的相关系数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
"source": [
"## 计算与检验"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[10, 10, 8, 9, 7],\n",
" [ 4, 5, 4, 3, 3],\n",
" [ 3, 3, 1, 1, 1]])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"matrix_t = np.array([[10,10,8,9,7],\n",
" [4,5,4,3,3],\n",
" [3,3,1,1,1]])\n",
"matrix_t"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0.64168895, 0.84016805],\n",
" [0.64168895, 1. , 0.76376262],\n",
" [0.84016805, 0.76376262, 1. ]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.corrcoef(matrix_t) # 计算行相关系数,对角线是自己与自己比较"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0.98898224, 0.9526832 , 0.9939441 , 0.97986371],\n",
" [0.98898224, 1. , 0.98718399, 0.99926008, 0.99862543],\n",
" [0.9526832 , 0.98718399, 1. , 0.98031562, 0.99419163],\n",
" [0.9939441 , 0.99926008, 0.98031562, 1. , 0.99587059],\n",
" [0.97986371, 0.99862543, 0.99419163, 0.99587059, 1. ]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.corrcoef(matrix_t, rowvar=0) # 计算列相关系数,对角线是自己与自己比较"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"计算伦敦市月平均气温与降水量\n",
"<img src=\"assets/20201120221030.png\" width=\"50%\">\n",
"计算伦敦市月平均气温t与降水量p之间的相关系数\n",
"$$\n",
"r_tp = \\frac{\\sum_{i=1}^{12}(t_i-\\overline{t})(p_i-\\overline{p})}\n",
"{\\sqrt{\\sum^{12}_{t=1}(t_i-\\overline{t})^2}\\sqrt{\\sum^{12}_{i=1}(p_i-\\overline{p})}}\n",
"=\n",
"\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
"$$\n",
"$$\n",
"= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
"$$\n",
"计算结果表明伦敦市的月平均气温t与降水量p呈负相关即异向相关"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 相关系数的显著性检验\n",
"假设\n",
"<ul>\n",
" <li>H0p=0\n",
" <li>H1p≠0\n",
"</ul>\n",
"统计量\n",
"$$\n",
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10个学生初一数学分数与初二数学分数的相关系数为087问从总体上来说初一与初二数学分数是否存在相关?\n",
"<img src=\"assets/20201120222349.png\" width=\"50%\">\n",
"计算检验统计量:\n",
"$$\n",
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}} \n",
"= \\frac{0.78\\sqrt{10-2}}{\\sqrt{1-0.78^2}}\n",
"= 3.524\n",
"$$\n",
"\n",
"$$\n",
"t = 3.524 > 3.355 = t_{(8)0.01}\n",
"$$\n",
"所以,总体来说初一和初二的成绩存在正相关。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"correlation: 0.9891763198690562\n",
"pvalue: 5.926875946481138e-08\n"
]
}
],
"source": [
"from scipy import stats\n",
"x = [10.35, 6.24,3.18,8.46,3.21,7.65,4.32,8.66,9.12,10.31]\n",
"y = [5.1, 3.15,1.67,4.33,1.76,4.11,2.11,4.88,4.99,5.12]\n",
"correlation, pvalue = stats.stats.pearsonr(x,y)\n",
"print('correlation:', correlation)\n",
"print('pvalue:', pvalue)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 等级变量的相关分析\n",
"当测量得到的数据不是等距或等比数据,而是具有等级顺序的数据;或者得到的数据是等距或等比数据,但其所来自的总体分布不是正态的,不满足求皮尔森相关系数(积差相关)的要求。这时就要运用等级相关系数。"
]
}
],
"metadata": {

Loading…
Cancel
Save