Update 偏相关和复相关

pull/2/head
benjas 4 years ago
parent 5bb7ad3c0e
commit fed080b133

@ -181,7 +181,7 @@
"\\frac{-300.91}{\\sqrt{250.55}\\sqrt{1508.34}}\n",
"$$\n",
"$$\n",
"= \\frac{-300.91}{15.83*38.84} = -0.4895\n",
"= \\frac{-300.91}{15.83×38.84} = -0.4895\n",
"$$\n",
"计算结果表明伦敦市的月平均气温t与降水量p呈负相关即异向相关"
]
@ -421,7 +421,7 @@
"计算等级相关系数\n",
"$$\n",
"r_R = 1-\\frac{2\\sum D^2}{n(n^2-1)}\n",
"= 1-\\frac{6*18}{10(10^2-1)}\n",
"= 1-\\frac{6×18}{10(10^2-1)}\n",
"=0.891\n",
"$$\n",
"**等级相关系数的显著性检验**\n",
@ -503,7 +503,7 @@
"由于每个评分老师对6篇论文的评定都无相同的等级\n",
"$$\n",
"S=\\sum^6_{i=1}-\\frac{1}{6}(\\sum^6_{i=1}R_i)^2\n",
"= 3192- \\frac{1}{6} * 126^2 = 546\n",
"= 3192- \\frac{1}{6} × 126^2 = 546\n",
"$$\n",
"$$\n",
"W = \\frac{S}{\\frac{1}{12}K^2(N^3-N)} = \\frac{546}{\\frac{1}{12}6^2(6^3-6)}\n",
@ -532,11 +532,11 @@
"丙T = (23 - 2)+(23 - 2) = 12\n",
"$$\n",
"S=\\sum^6_{i=1}-\\frac{1}{6}(\\sum^6_{i=1}R_i)^2\n",
"=791.5-\\frac{1}{6} * 63^2 = 130.00\n",
"=791.5-\\frac{1}{6} × 63^2 = 130.00\n",
"$$\n",
"$$\n",
"W = \\frac{S}{\\frac{1}{12}[K^2(N^3-N)-K\\sum^K_{i=1}T_i]}\n",
"=\\frac{130}{\\frac{1}{12}[3^2(6^3-6)-3*(6+12)]}\n",
"=\\frac{130}{\\frac{1}{12}[3^2(6^3-6)-3×(6+12)]}\n",
"=\\frac{130}{153} = 0.849\n",
"$$\n",
"由W=0.849可看出专家评定结果有较大的一致性"
@ -577,6 +577,275 @@
"print('p_value', p_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 质量相关分析\n",
"质量相关是指一个变量为质,另一个变量为量,这两个变量之间的相关。如智商、学科分数、身高、体重等是表现为量的变量,男与女、优与劣、及格与不及格等是表现为质的变量。\n",
"\n",
"质与量的相关主要包括二列相关、点二列相关、多系列相关。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 二列相关\n",
"当两个变量都是正态连续变量。其中一个变量被人为地划分成二分变量(如按一定标推将属于正态连续变量的学科考试分数划分成及格与不及格,录取与未录取,把某一体育项目测验结果划分成通过与未通过,达标与末达标,把健康状况划分成好与差,等等),表示这两个变量之间的相关,称为二列相关\n",
"\n",
"**二列相关的使用条件:**\n",
"<ul>\n",
" <li>两个变量都是连续变量,且总体呈正态分布,或总体接近正态分布,至少是单峰对称分布。\n",
" <li>两个变量之间是线性关系。\n",
" <li>二分变量是人为划分的,其分界点应尽量靠近中值。\n",
" <li>样本容量应当大于80。\n",
"</ul>\n",
"$$\n",
"R = \\frac{\\overline{X}_p-\\overline{X}_q}{σ} × \\frac{pq}{Y}\n",
"$$\n",
"\n",
"$$p 表示二分变量中某一类别频数的比率$$\n",
"$$q 表示二分变量中另一类别频数的比率$$\n",
"$$\\overline{X}_p 表示与二分变量中p类别相对应的连续变量的平均数$$\n",
"$$\\overline{X}_q 表示与二分变量中q类别相对应的连续变量的平均数$$\n",
"$$σ 表示连续变量的标准差$$\n",
"$$Y 表示正态曲线下与p相对应的纵线高度$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**二列相关实例:**\n",
"<br>10名考生成绩如下包括总分和一道问答题试求该问答题的区分度6分以上为通过包括6分\n",
"<img src=\"assets/20201121183005.png\" width=\"50%\">\n",
"问答题,被人为的分成两类,通过和不通过,应求二列相关。\n",
"<br>\n",
"$$\n",
"当p=0.6时查正态分布表得到x=0.25\n",
"$$\n",
"$$\n",
"当x=0.25时代入标准正态密度函数Y=\\frac{1}{\\sqrt{2π}}e^{-\\frac{x^2}{x}}\n",
"得到Y=0.3866\n",
"$$\n",
"$$\n",
"\\overline{X}_p = 67.33, \\overline{X}_q=61.25,σ=6.12\n",
"$$\n",
"则可以通过公式计算得到二列相关系数:\n",
"$$\n",
"R=\\frac{\\overline{X}_p-\\overline{X}_q}{σ}×\\frac{pq}{Y}\n",
"=\\frac{67.33-61.25}{6.12}×\\frac{0.6×0.4}{0.3866} ≈0.62\n",
"$$\n",
"区分度较高"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 点二列相关\n",
"当两个变量其中一个是正态连续性变量,另一个是真正的二分名义变量(例如,男与女,已婚和未婚,色肓与非色盲,生与死,等等),这时,表示这两个变量之间的相关,称为点二列相关。\n",
"$$\n",
"R = \\frac{\\overline{X}_p-\\overline{X}_q}{σ} × \\sqrt{pq}\n",
"$$\n",
"\n",
"$$p表示二分变量中某一类别频数的比率$$\n",
"$$q 表示二分变量中另一类别频数的比率$$\n",
"$$\\overline{X}_p 表示与二分变量中p类别相对应的连续变量的平均数$$\n",
"$$\\overline{X}_q 表示与二分变量中q类别相对应的连续变量的平均数$$\n",
"$$σ 表示连续变量的标准差$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**点二列相关实例:**\n",
"<br>有50道选择题,每题2分,有20人的总成绩和第五题的情况,第五题与总分的相关程度如亻\n",
"<img src=\"assets/20201121183858.png\" width=\"50%\">\n",
"p(答对学生的比例) = 10/20=0.5q=1-p=0.5\n",
"$$\n",
"\\overline{X}_p=88.4, \\overline{X}_q=74.8, σ=8.66\n",
"$$\n",
"$$\n",
"R = \\frac{\\overline{X}_p-\\overline{X}_q}{σ} × \\sqrt{pq}\n",
"= \\frac{88.4-74.8}{8.66}\\sqrt{0.5×0.8} = 0.785\n",
"$$\n",
"相关系数较高,第五题的情况与总分有一致性(区分度较高)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PointbiserialrResult(correlation=0.7849870641173371, pvalue=4.145927973490392e-05)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#拿上面的实例对了就是1错了是0x是第5题的选答情况y是分数\n",
"x = [1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,1,1,0,0,0]\n",
"y = [84,82,76,60,72,74,76,84,88,90,78,80,92,94,96,88,90,78,76,74]\n",
"stats.pointbiserialr(x,y) #可以看到相关系数值是0.7849,和上面的计算结果一致"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 品质相关分析\n",
"两个变量都是按质划分成几种类别,表示这两个变量之间的相关称为品质相关。\n",
"\n",
"如,一个变量按性别分成男与女,另一个变量按学科成绩分成及格与不及格;又如,一个变量按学校类别分成重点及非重点,另一个变量按学科成绩分成优、良、中、差,等等"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**列联相关系数**\n",
"\n",
" 当两个变量均被分成两个以上类别,或其中一个变量被分成两个以上类别这两个变量之问的相关程度可用列联相关系数( contingency coefficient)来测度。如行政人员、现任教师、学生家长与对现有考试制度持赞同、不置可否、反对意见有无相关。\n",
" \n",
" 假设变量x被分成a个类别,y被分成b个类别,而且a和b至少有一个大于2,这时变量x与变量y的列联相关系数记为0记m。为观察数据属于变量x的第1类别(=1,2,…,a)、变量y的第类b)的频数。记m为观察数据属于变量x的第i类别i=12...a、变量y的第j类别j=12...b的频数。记\n",
"$$\n",
"a_i = \\sum^b_{i=1}m(i=1,2,...,m)\n",
"$$\n",
"$$\n",
"b_i = \\sum^a_{i=1}m(j=1,2,...,m)\n",
"$$\n",
"$$\n",
" 构造X^2 = N(\\sum \\sum \\frac{m^2}{a_ib_j}-1),其中N= \\sum \\sum m这样得到列联相关系数\n",
"$$\n",
"$$\n",
"C的计算公式C = \\sqrt{\\frac{x^2}{N+x^2}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**例子:**\n",
"2531名学生和教室进行了抽样调查计算调查对象和态度之间的列联相关系数并进行统计显著检验\n",
"<img src=\"assets/20201121200013.png\" width=\"50%\">\n",
"解根据公式计算X^2\n",
"$$\n",
"X^2 = 2531(\\frac{446^2}{981*977}\\frac{212^2}{730*977}+...+\\frac{177^2}{820*764})\n",
"≈130.02\n",
"$$\n",
"$$\n",
"C=\\sqrt{\\frac{X^2}{N+X^2}}=\\sqrt{\\frac{130.2}{2531+130.2}}≈0.221\n",
"$$\n",
"$$\n",
"查X^2分布表得到临界值X^2_{0.01}(4)=12.277\n",
"$$\n",
"$$\n",
"X^2=130.02>12.277所以求得的列联系数C=0.221具有统计显著意义。\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"还有等于2的是用另外一套公式"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 偏相关分析\n",
"在多要素所构成的地理系统中,先不考虑其它要素的影响,而单独研究两个要素之间的相互关系的密切程度,这称为偏相关。用以度量偏相关程度的统计量,称为偏相关系数\n",
"\n",
"在分析变量x1和x2之间的净相关时,当控制了变量x3的线性作用后,x1和x2之间的一阶偏相关系数定义为\n",
"$$\n",
"r_{12.3} = \\frac{r_{12}-r_{13}r_{23}}{\\sqrt{(1-r_{13}^2)(1-r_{23}^2)}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于某四个地理要素x1,x2,X3,×4的23个样本数据,经过计算得到了如下的单相关系数矩阵:\n",
"<img src=\"assets/20201121202149.png\" width=\"50%\">\n",
"计算可得部分偏相关系数\n",
"<img src=\"assets/20201121202207.png\" width=\"50%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**偏相关系数的性质**\n",
"<ul>\n",
" <li>偏相关系数分布的范围在-1到1之间\n",
" <li>偏相关系数的绝对值越大,表示其偏相关程度越大\n",
" <li>偏相关系数的绝对值必小于或最多等于由同一系列资料所求得的复相关系数,即R1*23≥|r12*3|\n",
"</ul>\n",
"\n",
"**偏相关系数的显著性检验**\n",
"\n",
"$$\n",
"t=\\frac{r\\sqrt{r-k-2}}{\\sqrt{1-r^2}},服从t(n-k-2)分布\n",
"$$\n",
"<ul>\n",
"<li>n 是样本容量\n",
"<li>k 是剔除了的变量数\n",
"<li>r 是偏相关系数\n",
"</ul>\n",
"当有3个要素时,有三个偏相关系数,称为一级偏相关系数\n",
"\n",
"当有4个要素时,则有六个偏相关系数,则称他们为二级偏相关系数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 复相关系数\n",
"<ul>\n",
"<li>反映几个要素与某一个要素之间的复相关程度。复相关系数介于0到1之间。\n",
"<li>复相关系数越大,则表明要素(变量)之间的相关程度越密切。复相关系数为1,表示完全相关:复相关系数为0,表示完全无关。\n",
"<li>复相关系数必大于或至少等于单相关系数的绝对值。\n",
"</ul>\n",
"\n",
"测定一个变量y当有两个自变量时\n",
"$$\n",
"R_{y.12}=\\sqrt{1-(1-r^2_{y1})(1-r^2_{y2.1})}\n",
"$$\n",
"当有三个自变量时:\n",
"$$\n",
"R_{y.123}=\\sqrt{1-(1-r^2_{y1})(1-r^2_{y2.1})(1-r^2_{y3.12})}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**实例:**\n",
"\n",
"在上例中,若以x4为因变量,x1,x2,x3为自变量,试计算x4与x1,x2,x3之间的复相关系数\n",
"$$\n",
"R_{4.123}=\\sqrt{1-(1-r^2_{41})(1-r^2_{42.1})(1-r^2_{43.12})}\n",
"$$\n",
"$$\n",
"=\\sqrt{1-(1-0.579^2)(1-0.956^2)(1-0.337^2)} = 0.974\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": null,

Loading…
Cancel
Save