|
|
@ -35,12 +35,222 @@
|
|
|
|
"左下图是线性相关一般的,右下图有一定的相关但不是线性相关;"
|
|
|
|
"左下图是线性相关一般的,右下图有一定的相关但不是线性相关;"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## 皮尔逊相关系数"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"### 连续变量的相关分析\n",
|
|
|
|
|
|
|
|
"<ul>\n",
|
|
|
|
|
|
|
|
" <li>连续变量即数据变量,它的取值之间可以比较大小,可以用加减法计算出差异的大小。\n",
|
|
|
|
|
|
|
|
" <li>如“年龄”、“收入”、“成绩\"等变量当两个变量都是正态连续变量,而且两者之间呈线性关系时,通常用 Pearson相关系数来衡量。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"### Pearson.相关系数\n",
|
|
|
|
|
|
|
|
"**协方差:**\n",
|
|
|
|
|
|
|
|
"协方差是一个反映两个随机变量相关程度的指标,如果一个变量跟随着另一个变量同时变大或者变小,那么这两个变量的协方差就是正值\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"cov(X, Y) = \\frac{\\sum_n^{i=1}(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
|
|
|
|
|
|
|
|
"$$"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"虽然协方差能反映两个随机变量的相关程度(协方差大于0的时候表示两者正相关,小于0的时候表示两者负相关),但是协方差值的大小并不能很好地度量两个随机变量的关联程度\n",
|
|
|
|
|
|
|
|
"<br><br>在二维空间中分布着一些数据,我们想知道数据点坐标X轴和Y轴的相关程度,如果X与Y的相关程度较小但是数据分布的比较离散,这样会导致求出的协方差值较大,用这个值来度量相关程度是不合理的\n",
|
|
|
|
|
|
|
|
"<img src=\"assets/20201120215604.png\" width=\"50%\">\n",
|
|
|
|
|
|
|
|
"为了更好的度量两个随机变量的相关程度, 引入Pearson相关系数,其在协方差的基础上除了两个随机变量的标准差"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"**Pearson相关系数**\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"P_{X,Y} = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"pearson是一个介于-1和1之间的值,当两个变量的线性关系增强时,相关系数趋于1或-1;当一个变量增大,另一个变量也增大时,表明它们之间是正相关的,相关系数大于0;如果一个变量增大,另一个变量却减小,表明它们之间是负相关的,相关系数小于0;如果相关系数等于0,表明它们之间不存在线性相关关系\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"<img src=\"assets/20201120220149.png\" width=\"50%\">\n",
|
|
|
|
|
|
|
|
"np.corrcoef(a)可结算行与行之间的相关系数,np.corrcoef(a,rowvar=0)用于计算各列之间的相关系数"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## 计算与检验"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": null,
|
|
|
|
"execution_count": 2,
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [],
|
|
|
|
"outputs": [
|
|
|
|
"source": []
|
|
|
|
{
|
|
|
|
|
|
|
|
"data": {
|
|
|
|
|
|
|
|
"text/plain": [
|
|
|
|
|
|
|
|
"array([[10, 10, 8, 9, 7],\n",
|
|
|
|
|
|
|
|
" [ 4, 5, 4, 3, 3],\n",
|
|
|
|
|
|
|
|
" [ 3, 3, 1, 1, 1]])"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"execution_count": 2,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"import numpy as np\n",
|
|
|
|
|
|
|
|
"matrix_t = np.array([[10,10,8,9,7],\n",
|
|
|
|
|
|
|
|
" [4,5,4,3,3],\n",
|
|
|
|
|
|
|
|
" [3,3,1,1,1]])\n",
|
|
|
|
|
|
|
|
"matrix_t"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": 3,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"data": {
|
|
|
|
|
|
|
|
"text/plain": [
|
|
|
|
|
|
|
|
"array([[1. , 0.64168895, 0.84016805],\n",
|
|
|
|
|
|
|
|
" [0.64168895, 1. , 0.76376262],\n",
|
|
|
|
|
|
|
|
" [0.84016805, 0.76376262, 1. ]])"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"execution_count": 3,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"np.corrcoef(matrix_t) # 计算行相关系数,对角线是自己与自己比较"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"data": {
|
|
|
|
|
|
|
|
"text/plain": [
|
|
|
|
|
|
|
|
"array([[1. , 0.98898224, 0.9526832 , 0.9939441 , 0.97986371],\n",
|
|
|
|
|
|
|
|
" [0.98898224, 1. , 0.98718399, 0.99926008, 0.99862543],\n",
|
|
|
|
|
|
|
|
" [0.9526832 , 0.98718399, 1. , 0.98031562, 0.99419163],\n",
|
|
|
|
|
|
|
|
" [0.9939441 , 0.99926008, 0.98031562, 1. , 0.99587059],\n",
|
|
|
|
|
|
|
|
" [0.97986371, 0.99862543, 0.99419163, 0.99587059, 1. ]])"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"np.corrcoef(matrix_t, rowvar=0) # 计算列相关系数,对角线是自己与自己比较"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"计算伦敦市月平均气温与降水量\n",
|
|
|
|
|
|
|
|
"<img src=\"assets/20201120221030.png\" width=\"50%\">\n",
|
|
|
|
|
|
|
|
"计算伦敦市月平均气温(t)与降水量(p)之间的相关系数\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"r_tp = \\frac{\\sum_{i=1}^{12}(t_i-\\overline{t})(p_i-\\overline{p})}\n",
|
|
|
|
|
|
|
|
"{\\sqrt{\\sum^{12}_{t=1}(t_i-\\overline{t})^2}\\sqrt{\\sum^{12}_{i=1}(p_i-\\overline{p})}}\n",
|
|
|
|
|
|
|
|
"=\n",
|
|
|
|
|
|
|
|
"\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"计算结果表明,伦敦市的月平均气温(t)与降水量(p)呈负相关,即异向相关"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"### 相关系数的显著性检验\n",
|
|
|
|
|
|
|
|
"假设\n",
|
|
|
|
|
|
|
|
"<ul>\n",
|
|
|
|
|
|
|
|
" <li>H0:p=0\n",
|
|
|
|
|
|
|
|
" <li>H1:p≠0\n",
|
|
|
|
|
|
|
|
"</ul>\n",
|
|
|
|
|
|
|
|
"统计量\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}}\n",
|
|
|
|
|
|
|
|
"$$"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"10个学生初一数学分数与初二数学分数的相关系数为087,问从总体上来说,初一与初二数学分数是否存在相关?\n",
|
|
|
|
|
|
|
|
"<img src=\"assets/20201120222349.png\" width=\"50%\">\n",
|
|
|
|
|
|
|
|
"计算检验统计量:\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}} \n",
|
|
|
|
|
|
|
|
"= \\frac{0.78\\sqrt{10-2}}{\\sqrt{1-0.78^2}}\n",
|
|
|
|
|
|
|
|
"= 3.524\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"t = 3.524 > 3.355 = t_{(8)0.01}\n",
|
|
|
|
|
|
|
|
"$$\n",
|
|
|
|
|
|
|
|
"所以,总体来说初一和初二的成绩存在正相关。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
|
|
|
"text": [
|
|
|
|
|
|
|
|
"correlation: 0.9891763198690562\n",
|
|
|
|
|
|
|
|
"pvalue: 5.926875946481138e-08\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"from scipy import stats\n",
|
|
|
|
|
|
|
|
"x = [10.35, 6.24,3.18,8.46,3.21,7.65,4.32,8.66,9.12,10.31]\n",
|
|
|
|
|
|
|
|
"y = [5.1, 3.15,1.67,4.33,1.76,4.11,2.11,4.88,4.99,5.12]\n",
|
|
|
|
|
|
|
|
"correlation, pvalue = stats.stats.pearsonr(x,y)\n",
|
|
|
|
|
|
|
|
"print('correlation:', correlation)\n",
|
|
|
|
|
|
|
|
"print('pvalue:', pvalue)"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"### 等级变量的相关分析\n",
|
|
|
|
|
|
|
|
"当测量得到的数据不是等距或等比数据,而是具有等级顺序的数据;或者得到的数据是等距或等比数据,但其所来自的总体分布不是正态的,不满足求皮尔森相关系数(积差相关)的要求。这时就要运用等级相关系数。"
|
|
|
|
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
],
|
|
|
|
],
|
|
|
|
"metadata": {
|
|
|
|
"metadata": {
|
|
|
|