diff --git a/notebook_必备数学基础/相关分析/.ipynb_checkpoints/相关分析-checkpoint.ipynb b/notebook_必备数学基础/相关分析/.ipynb_checkpoints/相关分析-checkpoint.ipynb
index d15e054..0f44ecc 100644
--- a/notebook_必备数学基础/相关分析/.ipynb_checkpoints/相关分析-checkpoint.ipynb
+++ b/notebook_必备数学基础/相关分析/.ipynb_checkpoints/相关分析-checkpoint.ipynb
@@ -35,12 +35,222 @@
"左下图是线性相关一般的,右下图有一定的相关但不是线性相关;"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 皮尔逊相关系数"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 连续变量的相关分析\n",
+ "
\n",
+ " - 连续变量即数据变量,它的取值之间可以比较大小,可以用加减法计算出差异的大小。\n",
+ "
- 如“年龄”、“收入”、“成绩\"等变量当两个变量都是正态连续变量,而且两者之间呈线性关系时,通常用 Pearson相关系数来衡量。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Pearson.相关系数\n",
+ "**协方差:**\n",
+ "协方差是一个反映两个随机变量相关程度的指标,如果一个变量跟随着另一个变量同时变大或者变小,那么这两个变量的协方差就是正值\n",
+ "$$\n",
+ "cov(X, Y) = \\frac{\\sum_n^{i=1}(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "虽然协方差能反映两个随机变量的相关程度(协方差大于0的时候表示两者正相关,小于0的时候表示两者负相关),但是协方差值的大小并不能很好地度量两个随机变量的关联程度\n",
+ "
在二维空间中分布着一些数据,我们想知道数据点坐标X轴和Y轴的相关程度,如果X与Y的相关程度较小但是数据分布的比较离散,这样会导致求出的协方差值较大,用这个值来度量相关程度是不合理的\n",
+ "
\n",
+ "为了更好的度量两个随机变量的相关程度, 引入Pearson相关系数,其在协方差的基础上除了两个随机变量的标准差"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Pearson相关系数**\n",
+ "$$\n",
+ "P_{X,Y} = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
+ "$$\n",
+ "pearson是一个介于-1和1之间的值,当两个变量的线性关系增强时,相关系数趋于1或-1;当一个变量增大,另一个变量也增大时,表明它们之间是正相关的,相关系数大于0;如果一个变量增大,另一个变量却减小,表明它们之间是负相关的,相关系数小于0;如果相关系数等于0,表明它们之间不存在线性相关关系\n",
+ "\n",
+ "
\n",
+ "np.corrcoef(a)可结算行与行之间的相关系数,np.corrcoef(a,rowvar=0)用于计算各列之间的相关系数"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 计算与检验"
+ ]
+ },
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 2,
"metadata": {},
- "outputs": [],
- "source": []
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[10, 10, 8, 9, 7],\n",
+ " [ 4, 5, 4, 3, 3],\n",
+ " [ 3, 3, 1, 1, 1]])"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "matrix_t = np.array([[10,10,8,9,7],\n",
+ " [4,5,4,3,3],\n",
+ " [3,3,1,1,1]])\n",
+ "matrix_t"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1. , 0.64168895, 0.84016805],\n",
+ " [0.64168895, 1. , 0.76376262],\n",
+ " [0.84016805, 0.76376262, 1. ]])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.corrcoef(matrix_t) # 计算行相关系数,对角线是自己与自己比较"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1. , 0.98898224, 0.9526832 , 0.9939441 , 0.97986371],\n",
+ " [0.98898224, 1. , 0.98718399, 0.99926008, 0.99862543],\n",
+ " [0.9526832 , 0.98718399, 1. , 0.98031562, 0.99419163],\n",
+ " [0.9939441 , 0.99926008, 0.98031562, 1. , 0.99587059],\n",
+ " [0.97986371, 0.99862543, 0.99419163, 0.99587059, 1. ]])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.corrcoef(matrix_t, rowvar=0) # 计算列相关系数,对角线是自己与自己比较"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "计算伦敦市月平均气温与降水量\n",
+ "
\n",
+ "计算伦敦市月平均气温(t)与降水量(p)之间的相关系数\n",
+ "$$\n",
+ "r_tp = \\frac{\\sum_{i=1}^{12}(t_i-\\overline{t})(p_i-\\overline{p})}\n",
+ "{\\sqrt{\\sum^{12}_{t=1}(t_i-\\overline{t})^2}\\sqrt{\\sum^{12}_{i=1}(p_i-\\overline{p})}}\n",
+ "=\n",
+ "\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
+ "$$\n",
+ "$$\n",
+ "= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
+ "$$\n",
+ "计算结果表明,伦敦市的月平均气温(t)与降水量(p)呈负相关,即异向相关"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 相关系数的显著性检验\n",
+ "假设\n",
+ "\n",
+ " - H0:p=0\n",
+ "
- H1:p≠0\n",
+ "
\n",
+ "统计量\n",
+ "$$\n",
+ "t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "10个学生初一数学分数与初二数学分数的相关系数为087,问从总体上来说,初一与初二数学分数是否存在相关?\n",
+ "
\n",
+ "计算检验统计量:\n",
+ "$$\n",
+ "t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}} \n",
+ "= \\frac{0.78\\sqrt{10-2}}{\\sqrt{1-0.78^2}}\n",
+ "= 3.524\n",
+ "$$\n",
+ "\n",
+ "$$\n",
+ "t = 3.524 > 3.355 = t_{(8)0.01}\n",
+ "$$\n",
+ "所以,总体来说初一和初二的成绩存在正相关。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "correlation: 0.9891763198690562\n",
+ "pvalue: 5.926875946481138e-08\n"
+ ]
+ }
+ ],
+ "source": [
+ "from scipy import stats\n",
+ "x = [10.35, 6.24,3.18,8.46,3.21,7.65,4.32,8.66,9.12,10.31]\n",
+ "y = [5.1, 3.15,1.67,4.33,1.76,4.11,2.11,4.88,4.99,5.12]\n",
+ "correlation, pvalue = stats.stats.pearsonr(x,y)\n",
+ "print('correlation:', correlation)\n",
+ "print('pvalue:', pvalue)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 等级变量的相关分析\n",
+ "当测量得到的数据不是等距或等比数据,而是具有等级顺序的数据;或者得到的数据是等距或等比数据,但其所来自的总体分布不是正态的,不满足求皮尔森相关系数(积差相关)的要求。这时就要运用等级相关系数。"
+ ]
}
],
"metadata": {
diff --git a/notebook_必备数学基础/相关分析/assets/20201120221030.png b/notebook_必备数学基础/相关分析/assets/20201120221030.png
new file mode 100644
index 0000000..5560838
Binary files /dev/null and b/notebook_必备数学基础/相关分析/assets/20201120221030.png differ
diff --git a/notebook_必备数学基础/相关分析/assets/20201120222349.png b/notebook_必备数学基础/相关分析/assets/20201120222349.png
new file mode 100644
index 0000000..335f51d
Binary files /dev/null and b/notebook_必备数学基础/相关分析/assets/20201120222349.png differ
diff --git a/notebook_必备数学基础/相关分析/相关分析.ipynb b/notebook_必备数学基础/相关分析/相关分析.ipynb
index 0f5984e..0f44ecc 100644
--- a/notebook_必备数学基础/相关分析/相关分析.ipynb
+++ b/notebook_必备数学基础/相关分析/相关分析.ipynb
@@ -60,7 +60,7 @@
"**协方差:**\n",
"协方差是一个反映两个随机变量相关程度的指标,如果一个变量跟随着另一个变量同时变大或者变小,那么这两个变量的协方差就是正值\n",
"$$\n",
- "cov(X, Y) = \\frac{\\sum_n^i=1(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
+ "cov(X, Y) = \\frac{\\sum_n^{i=1}(X_i-\\overline{X})(Y_i-\\overline{Y})}{n-1}\n",
"$$"
]
},
@@ -80,18 +80,177 @@
"source": [
"**Pearson相关系数**\n",
"$$\n",
- "PX,Y = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
+ "P_{X,Y} = \\frac{cov(X,Y)}{σXσY} = \\frac{E[(X-μX)(Y-μY)]}{σXσY} \n",
"$$\n",
"pearson是一个介于-1和1之间的值,当两个变量的线性关系增强时,相关系数趋于1或-1;当一个变量增大,另一个变量也增大时,表明它们之间是正相关的,相关系数大于0;如果一个变量增大,另一个变量却减小,表明它们之间是负相关的,相关系数小于0;如果相关系数等于0,表明它们之间不存在线性相关关系\n",
"\n",
"
\n",
- "np.corrcoef(a)可结算行与行之间的相关系数,np.corrcoe"
+ "np.corrcoef(a)可结算行与行之间的相关系数,np.corrcoef(a,rowvar=0)用于计算各列之间的相关系数"
]
},
{
"cell_type": "markdown",
"metadata": {},
- "source": []
+ "source": [
+ "## 计算与检验"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[10, 10, 8, 9, 7],\n",
+ " [ 4, 5, 4, 3, 3],\n",
+ " [ 3, 3, 1, 1, 1]])"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "matrix_t = np.array([[10,10,8,9,7],\n",
+ " [4,5,4,3,3],\n",
+ " [3,3,1,1,1]])\n",
+ "matrix_t"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1. , 0.64168895, 0.84016805],\n",
+ " [0.64168895, 1. , 0.76376262],\n",
+ " [0.84016805, 0.76376262, 1. ]])"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.corrcoef(matrix_t) # 计算行相关系数,对角线是自己与自己比较"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1. , 0.98898224, 0.9526832 , 0.9939441 , 0.97986371],\n",
+ " [0.98898224, 1. , 0.98718399, 0.99926008, 0.99862543],\n",
+ " [0.9526832 , 0.98718399, 1. , 0.98031562, 0.99419163],\n",
+ " [0.9939441 , 0.99926008, 0.98031562, 1. , 0.99587059],\n",
+ " [0.97986371, 0.99862543, 0.99419163, 0.99587059, 1. ]])"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.corrcoef(matrix_t, rowvar=0) # 计算列相关系数,对角线是自己与自己比较"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "计算伦敦市月平均气温与降水量\n",
+ "
\n",
+ "计算伦敦市月平均气温(t)与降水量(p)之间的相关系数\n",
+ "$$\n",
+ "r_tp = \\frac{\\sum_{i=1}^{12}(t_i-\\overline{t})(p_i-\\overline{p})}\n",
+ "{\\sqrt{\\sum^{12}_{t=1}(t_i-\\overline{t})^2}\\sqrt{\\sum^{12}_{i=1}(p_i-\\overline{p})}}\n",
+ "=\n",
+ "\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
+ "$$\n",
+ "$$\n",
+ "= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
+ "$$\n",
+ "计算结果表明,伦敦市的月平均气温(t)与降水量(p)呈负相关,即异向相关"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 相关系数的显著性检验\n",
+ "假设\n",
+ "\n",
+ " - H0:p=0\n",
+ "
- H1:p≠0\n",
+ "
\n",
+ "统计量\n",
+ "$$\n",
+ "t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "10个学生初一数学分数与初二数学分数的相关系数为087,问从总体上来说,初一与初二数学分数是否存在相关?\n",
+ "
\n",
+ "计算检验统计量:\n",
+ "$$\n",
+ "t = \\frac{r\\sqrt{n-2}}{\\sqrt{1-r^2}} \n",
+ "= \\frac{0.78\\sqrt{10-2}}{\\sqrt{1-0.78^2}}\n",
+ "= 3.524\n",
+ "$$\n",
+ "\n",
+ "$$\n",
+ "t = 3.524 > 3.355 = t_{(8)0.01}\n",
+ "$$\n",
+ "所以,总体来说初一和初二的成绩存在正相关。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "correlation: 0.9891763198690562\n",
+ "pvalue: 5.926875946481138e-08\n"
+ ]
+ }
+ ],
+ "source": [
+ "from scipy import stats\n",
+ "x = [10.35, 6.24,3.18,8.46,3.21,7.65,4.32,8.66,9.12,10.31]\n",
+ "y = [5.1, 3.15,1.67,4.33,1.76,4.11,2.11,4.88,4.99,5.12]\n",
+ "correlation, pvalue = stats.stats.pearsonr(x,y)\n",
+ "print('correlation:', correlation)\n",
+ "print('pvalue:', pvalue)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 等级变量的相关分析\n",
+ "当测量得到的数据不是等距或等比数据,而是具有等级顺序的数据;或者得到的数据是等距或等比数据,但其所来自的总体分布不是正态的,不满足求皮尔森相关系数(积差相关)的要求。这时就要运用等级相关系数。"
+ ]
}
],
"metadata": {