You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
686 lines
30 KiB
686 lines
30 KiB
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Jenga mfano wa logistic regression - Somo la 4\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"#### **[Jaribio la kabla ya somo](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n",
|
|
"\n",
|
|
"#### Utangulizi\n",
|
|
"\n",
|
|
"Katika somo hili la mwisho kuhusu Regression, mojawapo ya mbinu za msingi za *klasiki* za ML, tutachunguza Logistic Regression. Ungetumia mbinu hii kugundua mifumo ya kutabiri makundi mawili. Je, pipi hii ni ya chokoleti au la? Je, ugonjwa huu unaambukiza au la? Je, mteja huyu atachagua bidhaa hii au la?\n",
|
|
"\n",
|
|
"Katika somo hili, utajifunza:\n",
|
|
"\n",
|
|
"- Mbinu za logistic regression\n",
|
|
"\n",
|
|
"✅ Kuimarisha uelewa wako wa kufanya kazi na aina hii ya regression katika [moduli ya kujifunza](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott)\n",
|
|
"\n",
|
|
"## Mahitaji ya awali\n",
|
|
"\n",
|
|
"Baada ya kufanya kazi na data ya malenge, sasa tunajua vya kutosha kutambua kwamba kuna kundi moja la binary ambalo tunaweza kufanya kazi nalo: `Color`.\n",
|
|
"\n",
|
|
"Hebu tujenge mfano wa logistic regression ili kutabiri, kwa kuzingatia baadhi ya vigezo, *rangi ya malenge fulani itakuwa nini* (machungwa 🎃 au nyeupe 👻).\n",
|
|
"\n",
|
|
"> Kwa nini tunazungumzia binary classification katika somo lililojumuishwa kuhusu regression? Ni kwa urahisi wa lugha tu, kwani logistic regression ni [kweli ni mbinu ya classification](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), ingawa inategemea linear. Jifunze kuhusu njia nyingine za kuainisha data katika kundi la somo linalofuata.\n",
|
|
"\n",
|
|
"Kwa somo hili, tutahitaji vifurushi vifuatavyo:\n",
|
|
"\n",
|
|
"- `tidyverse`: [tidyverse](https://www.tidyverse.org/) ni [mkusanyiko wa vifurushi vya R](https://www.tidyverse.org/packages) vilivyoundwa kufanya sayansi ya data kuwa ya haraka, rahisi na ya kufurahisha!\n",
|
|
"\n",
|
|
"- `tidymodels`: Mfumo wa [tidymodels](https://www.tidymodels.org/) ni [mkusanyiko wa vifurushi](https://www.tidymodels.org/packages/) kwa ajili ya uundaji wa mifano na machine learning.\n",
|
|
"\n",
|
|
"- `janitor`: Kifurushi cha [janitor](https://github.com/sfirke/janitor) kinatoa zana rahisi za kuchunguza na kusafisha data chafu.\n",
|
|
"\n",
|
|
"- `ggbeeswarm`: Kifurushi cha [ggbeeswarm](https://github.com/eclarke/ggbeeswarm) kinatoa mbinu za kuunda michoro ya mtindo wa beeswarm kwa kutumia ggplot2.\n",
|
|
"\n",
|
|
"Unaweza kuvifunga kwa kutumia:\n",
|
|
"\n",
|
|
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n",
|
|
"\n",
|
|
"Vinginevyo, script hapa chini itakagua kama una vifurushi vinavyohitajika kukamilisha moduli hii na kuvifunga kwako endapo havipo.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n",
|
|
"\n",
|
|
"pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## **Tafsiri swali**\n",
|
|
"\n",
|
|
"Kwa madhumuni yetu, tutalielezea kama binary: 'Nyeupe' au 'Sio Nyeupe'. Pia kuna kategoria ya 'mistari' katika seti yetu ya data lakini kuna mifano michache sana ya hiyo, kwa hivyo hatutaitumia. Inatoweka mara tu tunapoondoa thamani za null kutoka kwenye seti ya data, hata hivyo.\n",
|
|
"\n",
|
|
"> 🎃 Ukweli wa kufurahisha, wakati mwingine tunaita maboga meupe 'maboga ya roho'. Hayachongwi kwa urahisi, kwa hivyo si maarufu kama yale ya rangi ya machungwa lakini yanaonekana ya kuvutia! Kwa hivyo tunaweza pia kuunda upya swali letu kama: 'Roho' au 'Sio Roho'. 👻\n",
|
|
"\n",
|
|
"## **Kuhusu regression ya logistic**\n",
|
|
"\n",
|
|
"Regression ya logistic inatofautiana na regression ya linear, ambayo ulijifunza hapo awali, kwa njia kadhaa muhimu.\n",
|
|
"\n",
|
|
"#### **Uainishaji wa binary**\n",
|
|
"\n",
|
|
"Regression ya logistic haitoi vipengele sawa na regression ya linear. Ya kwanza inatoa utabiri kuhusu `kategoria ya binary` (\"machungwa au sio machungwa\") ilhali ya pili ina uwezo wa kutabiri `thamani zinazoendelea`, kwa mfano ukizingatia asili ya boga na wakati wa kuvuna, *bei yake itaongezeka kwa kiasi gani*.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"### Uainishaji mwingine\n",
|
|
"\n",
|
|
"Kuna aina nyingine za regression ya logistic, ikiwa ni pamoja na multinomial na ordinal:\n",
|
|
"\n",
|
|
"- **Multinomial**, ambayo inahusisha kuwa na zaidi ya kategoria moja - \"Machungwa, Nyeupe, na Mistari\".\n",
|
|
"\n",
|
|
"- **Ordinal**, ambayo inahusisha kategoria zilizo na mpangilio, muhimu ikiwa tungependa kupanga matokeo yetu kwa mantiki, kama maboga yetu ambayo yamepangwa kwa idadi finyu ya ukubwa (mini,sm,med,lg,xl,xxl).\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"#### **Vigezo HAVIHITAJI kuhusiana**\n",
|
|
"\n",
|
|
"Unakumbuka jinsi regression ya linear ilivyofanya kazi vizuri zaidi na vigezo vilivyohusiana? Regression ya logistic ni kinyume - vigezo havihitaji kuhusiana. Hii inafaa kwa data hii ambayo ina uhusiano dhaifu kiasi.\n",
|
|
"\n",
|
|
"#### **Unahitaji data safi nyingi**\n",
|
|
"\n",
|
|
"Regression ya logistic itatoa matokeo sahihi zaidi ikiwa utatumia data nyingi; seti yetu ndogo ya data si bora kwa kazi hii, kwa hivyo kumbuka hilo.\n",
|
|
"\n",
|
|
"✅ Fikiria aina za data ambazo zinaweza kufaa kwa regression ya logistic\n",
|
|
"\n",
|
|
"## Zoezi - safisha data\n",
|
|
"\n",
|
|
"Kwanza, safisha data kidogo, ukiondoa thamani za null na kuchagua baadhi tu ya safu:\n",
|
|
"\n",
|
|
"1. Ongeza msimbo ufuatao:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load the core tidyverse packages\n",
|
|
"library(tidyverse)\n",
|
|
"\n",
|
|
"# Import the data and clean column names\n",
|
|
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n",
|
|
" clean_names()\n",
|
|
"\n",
|
|
"# Select desired columns\n",
|
|
"pumpkins_select <- pumpkins %>% \n",
|
|
" select(c(city_name, package, variety, origin, item_size, color)) \n",
|
|
"\n",
|
|
"# Drop rows containing missing values and encode color as factor (category)\n",
|
|
"pumpkins_select <- pumpkins_select %>% \n",
|
|
" drop_na() %>% \n",
|
|
" mutate(color = factor(color))\n",
|
|
"\n",
|
|
"# View the first few rows\n",
|
|
"pumpkins_select %>% \n",
|
|
" slice_head(n = 5)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Unaweza kila wakati kutazama kwa haraka dataframe yako mpya, kwa kutumia [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) kama ilivyo hapa chini:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"pumpkins_select %>% \n",
|
|
" glimpse()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Tuthibitishe kwamba tutakuwa tunashughulikia tatizo la uainishaji wa binary:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Subset distinct observations in outcome column\n",
|
|
"pumpkins_select %>% \n",
|
|
" distinct(color)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Uonyeshaji - mchoro wa kategoria\n",
|
|
"Hadi sasa umepakia tena data ya malenge na kuisafisha ili kuhifadhi seti ya data inayojumuisha vigezo kadhaa, ikiwemo Rangi. Hebu tuonyeshe dataframe kwenye daftari kwa kutumia maktaba ya ggplot.\n",
|
|
"\n",
|
|
"Maktaba ya ggplot inatoa njia nzuri za kuonyesha data yako. Kwa mfano, unaweza kulinganisha usambazaji wa data kwa kila Aina na Rangi katika mchoro wa kategoria.\n",
|
|
"\n",
|
|
"1. Tengeneza mchoro kama huo kwa kutumia kazi ya geombar, ukitumia data yetu ya malenge, na kubainisha ramani ya rangi kwa kila kategoria ya malenge (machungwa au nyeupe):\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "python"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Specify colors for each value of the hue variable\n",
|
|
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
|
|
"\n",
|
|
"# Create the bar plot\n",
|
|
"ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n",
|
|
" geom_bar(position = \"dodge\") +\n",
|
|
" scale_fill_manual(values = palette) +\n",
|
|
" labs(y = \"Variety\", fill = \"Color\") +\n",
|
|
" theme_minimal()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Kwa kuangalia data, unaweza kuona jinsi data ya Rangi inavyohusiana na Aina.\n",
|
|
"\n",
|
|
"✅ Kwa kuzingatia mchoro huu wa kategoria, ni uchunguzi gani wa kuvutia unaweza kufikiria?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Usindikaji wa data: usimbaji wa vipengele\n",
|
|
"\n",
|
|
"Seti yetu ya data ya malenge ina thamani za maandishi kwa safu zake zote. Kufanya kazi na data ya kategoria ni rahisi kwa binadamu lakini si kwa mashine. Algorithimu za kujifunza kwa mashine hufanya kazi vizuri na nambari. Ndiyo sababu usimbaji ni hatua muhimu sana katika awamu ya usindikaji wa data, kwani inatuwezesha kubadilisha data ya kategoria kuwa data ya nambari bila kupoteza taarifa yoyote. Usimbaji mzuri huchangia kujenga modeli nzuri.\n",
|
|
"\n",
|
|
"Kwa usimbaji wa vipengele kuna aina mbili kuu za usimbaji:\n",
|
|
"\n",
|
|
"1. **Ordinal encoder**: Inafaa vizuri kwa vigezo vya ordinal, ambavyo ni vigezo vya kategoria ambapo data yake inafuata mpangilio wa kimantiki, kama safu ya `item_size` katika seti yetu ya data. Inaunda ramani ambapo kila kategoria inawakilishwa na nambari, ambayo ni mpangilio wa kategoria katika safu.\n",
|
|
"\n",
|
|
"2. **Categorical encoder**: Inafaa vizuri kwa vigezo vya nominal, ambavyo ni vigezo vya kategoria ambapo data yake haifuati mpangilio wa kimantiki, kama vipengele vyote tofauti na `item_size` katika seti yetu ya data. Hii ni usimbaji wa one-hot, ambayo inamaanisha kwamba kila kategoria inawakilishwa na safu ya binary: kigezo kilichosimbwa ni sawa na 1 ikiwa malenge yanahusiana na Aina hiyo na 0 vinginevyo.\n",
|
|
"\n",
|
|
"Tidymodels inatoa kifurushi kingine kizuri: [recipes](https://recipes.tidymodels.org/) - kifurushi cha kusindika data. Tutafafanua `recipe` inayobainisha kwamba safu zote za utabiri zinapaswa kusimbwa kuwa seti ya nambari, `prep` ili kukadiria kiasi na takwimu zinazohitajika kwa operesheni yoyote, na hatimaye `bake` ili kutumia hesabu kwa data mpya.\n",
|
|
"\n",
|
|
"> Kwa kawaida, recipes hutumika kama usindikaji wa awali kwa uundaji wa modeli ambapo inabainisha hatua gani zinapaswa kutumika kwa seti ya data ili kuifanya iwe tayari kwa uundaji wa modeli. Katika hali hiyo, **inapendekezwa sana** kwamba utumie `workflow()` badala ya kukadiria recipe kwa mikono ukitumia prep na bake. Tutaliona hili kwa undani muda si mrefu.\n",
|
|
">\n",
|
|
"> Hata hivyo, kwa sasa tunatumia recipes + prep + bake kubainisha hatua gani zinapaswa kutumika kwa seti ya data ili kuifanya iwe tayari kwa uchambuzi wa data na kisha kutoa data iliyosindikwa na hatua zilizotumika.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Preprocess and extract data to allow some data analysis\n",
|
|
"baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n",
|
|
" # Define ordering for item_size column\n",
|
|
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
|
|
" # Convert factors to numbers using the order defined above (Ordinal encoding)\n",
|
|
" step_integer(item_size, zero_based = F) %>%\n",
|
|
" # Encode all other predictors using one hot encoding\n",
|
|
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n",
|
|
" prep(data = pumpkin_select) %>%\n",
|
|
" bake(new_data = NULL)\n",
|
|
"\n",
|
|
"# Display the first few rows of preprocessed data\n",
|
|
"baked_pumpkins %>% \n",
|
|
" slice_head(n = 5)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"✅ Je, ni faida gani za kutumia ordinal encoder kwa safu ya Item Size?\n",
|
|
"\n",
|
|
"### Kuchambua uhusiano kati ya vigezo\n",
|
|
"\n",
|
|
"Sasa kwa kuwa tumeshughulikia data yetu, tunaweza kuchambua uhusiano kati ya vipengele na lebo ili kupata wazo la jinsi ambavyo modeli itaweza kutabiri lebo kwa kuzingatia vipengele. Njia bora ya kufanya uchambuzi wa aina hii ni kwa kuchora data. \n",
|
|
"Tutatumia tena kipengele cha ggplot geom_boxplot_ ili kuonyesha uhusiano kati ya Item Size, Variety, na Color katika mchoro wa kategoria. Ili kuchora data vizuri zaidi, tutatumia safu ya Item Size iliyosimbwa na safu ya Variety ambayo haijasimbwa.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Define the color palette\n",
|
|
"palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n",
|
|
"\n",
|
|
"# We need the encoded Item Size column to use it as the x-axis values in the plot\n",
|
|
"pumpkins_select_plot<-pumpkins_select\n",
|
|
"pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n",
|
|
"\n",
|
|
"# Create the grouped box plot\n",
|
|
"ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n",
|
|
" geom_boxplot() +\n",
|
|
" facet_grid(variety ~ ., scales = \"free_x\") +\n",
|
|
" scale_fill_manual(values = palette) +\n",
|
|
" labs(x = \"Item Size\", y = \"\") +\n",
|
|
" theme_minimal() +\n",
|
|
" theme(strip.text = element_text(size = 12)) +\n",
|
|
" theme(axis.text.x = element_text(size = 10)) +\n",
|
|
" theme(axis.title.x = element_text(size = 12)) +\n",
|
|
" theme(axis.title.y = element_blank()) +\n",
|
|
" theme(legend.position = \"bottom\") +\n",
|
|
" guides(fill = guide_legend(title = \"Color\")) +\n",
|
|
" theme(panel.spacing = unit(0.5, \"lines\"))+\n",
|
|
" theme(strip.text.y = element_text(size = 4, hjust = 0)) \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Tumia mchoro wa kundi\n",
|
|
"\n",
|
|
"Kwa kuwa Rangi ni kategoria ya binary (Nyeupe au Sio Nyeupe), inahitaji '[mbinu maalum](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)' kwa ajili ya uonyeshaji.\n",
|
|
"\n",
|
|
"Jaribu `mchoro wa kundi` kuonyesha usambazaji wa rangi kulingana na ukubwa wa kipengee.\n",
|
|
"\n",
|
|
"Tutatumia [pakiti ya ggbeeswarm](https://github.com/eclarke/ggbeeswarm) ambayo inatoa mbinu za kuunda michoro ya mtindo wa nyuki kwa kutumia ggplot2. Michoro ya nyuki ni njia ya kuchora alama ambazo kwa kawaida zingekuwa zinagongana ili ziweze kuangukia karibu na kila moja badala yake.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create beeswarm plots of color and item_size\n",
|
|
"baked_pumpkins %>% \n",
|
|
" mutate(color = factor(color)) %>% \n",
|
|
" ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n",
|
|
" geom_quasirandom() +\n",
|
|
" scale_color_brewer(palette = \"Dark2\", direction = -1) +\n",
|
|
" theme(legend.position = \"none\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Sasa kwa kuwa tuna wazo la uhusiano kati ya makundi mawili ya rangi na kundi kubwa la ukubwa, hebu tuchunguze logistic regression ili kubaini rangi inayowezekana ya malenge fulani.\n",
|
|
"\n",
|
|
"## Tengeneza modeli yako\n",
|
|
"\n",
|
|
"Chagua vigezo unavyotaka kutumia katika modeli yako ya uainishaji na gawanya data katika seti za mafunzo na majaribio. [rsample](https://rsample.tidymodels.org/), kifurushi katika Tidymodels, kinatoa miundombinu ya kugawanya data kwa ufanisi na kufanya resampling:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Split data into 80% for training and 20% for testing\n",
|
|
"set.seed(2056)\n",
|
|
"pumpkins_split <- pumpkins_select %>% \n",
|
|
" initial_split(prop = 0.8)\n",
|
|
"\n",
|
|
"# Extract the data in each split\n",
|
|
"pumpkins_train <- training(pumpkins_split)\n",
|
|
"pumpkins_test <- testing(pumpkins_split)\n",
|
|
"\n",
|
|
"# Print out the first 5 rows of the training set\n",
|
|
"pumpkins_train %>% \n",
|
|
" slice_head(n = 5)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"🙌 Sasa tuko tayari kufundisha modeli kwa kuoanisha vipengele vya mafunzo na lebo ya mafunzo (rangi).\n",
|
|
"\n",
|
|
"Tutaanza kwa kuunda mapishi yanayobainisha hatua za awali za uchakataji ambazo zinapaswa kufanywa kwenye data yetu ili kujiandaa kwa uundaji wa modeli, yaani: kubadilisha vigezo vya kategoria kuwa seti ya nambari. Kama vile `baked_pumpkins`, tunaunda `pumpkins_recipe` lakini hatufanyi `prep` na `bake` kwa sababu itajumuishwa katika mtiririko wa kazi, ambao utaona katika hatua chache zijazo.\n",
|
|
"\n",
|
|
"Kuna njia kadhaa za kubainisha modeli ya regression ya logistic katika Tidymodels. Tazama `?logistic_reg()` Kwa sasa, tutabainisha modeli ya regression ya logistic kupitia injini ya msingi `stats::glm()`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create a recipe that specifies preprocessing steps for modelling\n",
|
|
"pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n",
|
|
" step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n",
|
|
" step_integer(item_size, zero_based = F) %>% \n",
|
|
" step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n",
|
|
"\n",
|
|
"# Create a logistic model specification\n",
|
|
"log_reg <- logistic_reg() %>% \n",
|
|
" set_engine(\"glm\") %>% \n",
|
|
" set_mode(\"classification\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Sasa kwa kuwa tuna mapishi na maelezo ya mfano, tunahitaji kupata njia ya kuyafungasha pamoja kuwa kitu kimoja ambacho kwanza kitachakata data (prep+bake kwa nyuma ya pazia), kufundisha mfano kwa data iliyochakatwa, na pia kuruhusu shughuli za baada ya uchakataji ikiwa zitahitajika.\n",
|
|
"\n",
|
|
"Katika Tidymodels, kitu hiki rahisi kinaitwa [`workflow`](https://workflows.tidymodels.org/) na kwa urahisi kinashikilia vipengele vyako vya uundaji wa mifano.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Bundle modelling components in a workflow\n",
|
|
"log_reg_wf <- workflow() %>% \n",
|
|
" add_recipe(pumpkins_recipe) %>% \n",
|
|
" add_model(log_reg)\n",
|
|
"\n",
|
|
"# Print out the workflow\n",
|
|
"log_reg_wf\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Baada ya mtiririko wa kazi kuwa *umeainishwa*, modeli inaweza `kufundishwa` kwa kutumia [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) kazi. Mtiririko wa kazi utatathmini mapishi na kuchakata data kabla ya mafunzo, kwa hivyo hatutalazimika kufanya hivyo kwa mikono kwa kutumia prep na bake.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Train the model\n",
|
|
"wf_fit <- log_reg_wf %>% \n",
|
|
" fit(data = pumpkins_train)\n",
|
|
"\n",
|
|
"# Print the trained workflow\n",
|
|
"wf_fit\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Mfano unatoa viwango vilivyojifunzwa wakati wa mafunzo.\n",
|
|
"\n",
|
|
"Sasa tumefundisha mfano kwa kutumia data ya mafunzo, tunaweza kufanya utabiri kwenye data ya majaribio kwa kutumia [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Hebu tuanze kwa kutumia mfano kutabiri lebo za seti yetu ya majaribio na uwezekano wa kila lebo. Wakati uwezekano ni zaidi ya 0.5, darasa linalotabiriwa ni `WHITE` vinginevyo ni `ORANGE`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Make predictions for color and corresponding probabilities\n",
|
|
"results <- pumpkins_test %>% select(color) %>% \n",
|
|
" bind_cols(wf_fit %>% \n",
|
|
" predict(new_data = pumpkins_test)) %>%\n",
|
|
" bind_cols(wf_fit %>%\n",
|
|
" predict(new_data = pumpkins_test, type = \"prob\"))\n",
|
|
"\n",
|
|
"# Compare predictions\n",
|
|
"results %>% \n",
|
|
" slice_head(n = 10)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Hii ni nzuri sana! Inatoa ufahamu zaidi kuhusu jinsi logistic regression inavyofanya kazi.\n",
|
|
"\n",
|
|
"### Uelewa bora kupitia matriki ya kuchanganya\n",
|
|
"\n",
|
|
"Kulinganisha kila utabiri na thamani yake halisi ya \"ground truth\" si njia bora sana ya kuamua jinsi mfano unavyotabiri kwa usahihi. Kwa bahati nzuri, Tidymodels ina mbinu nyingine chache za kusaidia: [`yardstick`](https://yardstick.tidymodels.org/) - kifurushi kinachotumika kupima ufanisi wa mifano kwa kutumia vipimo vya utendaji.\n",
|
|
"\n",
|
|
"Kipimo kimoja cha utendaji kinachohusiana na matatizo ya uainishaji ni [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). Matriki ya kuchanganya inaelezea jinsi mfano wa uainishaji unavyofanya kazi. Matriki ya kuchanganya huonyesha ni mifano mingapi katika kila darasa iliyoainishwa kwa usahihi na mfano. Katika hali yetu, itakuonyesha ni maboga ya rangi ya machungwa mangapi yaliyoainishwa kama machungwa na ni maboga meupe mangapi yaliyoainishwa kama meupe; matriki ya kuchanganya pia inaonyesha ni mangapi yaliyoainishwa katika makundi **yasiyo sahihi**.\n",
|
|
"\n",
|
|
"Kazi ya [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) kutoka yardstick huhesabu msalaba huu wa tabulation wa madarasa yaliyotazamwa na yaliyotabiriwa.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Confusion matrix for prediction results\n",
|
|
"conf_mat(data = results, truth = color, estimate = .pred_class)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Tuchambue matriki ya mkanganyiko. Modeli yetu imepewa jukumu la kuainisha maboga kati ya makundi mawili ya binary, kundi `nyeupe` na kundi `sio-nyeupe`.\n",
|
|
"\n",
|
|
"- Ikiwa modeli yako inatabiri boga kuwa nyeupe na kwa kweli linahusiana na kundi 'nyeupe', tunaliita `chanya halisi`, linaonyeshwa na namba ya juu kushoto.\n",
|
|
"\n",
|
|
"- Ikiwa modeli yako inatabiri boga kuwa sio nyeupe na kwa kweli linahusiana na kundi 'nyeupe', tunaliita `hasi ya uongo`, linaonyeshwa na namba ya chini kushoto.\n",
|
|
"\n",
|
|
"- Ikiwa modeli yako inatabiri boga kuwa nyeupe na kwa kweli linahusiana na kundi 'sio-nyeupe', tunaliita `chanya ya uongo`, linaonyeshwa na namba ya juu kulia.\n",
|
|
"\n",
|
|
"- Ikiwa modeli yako inatabiri boga kuwa sio nyeupe na kwa kweli linahusiana na kundi 'sio-nyeupe', tunaliita `hasi halisi`, linaonyeshwa na namba ya chini kulia.\n",
|
|
"\n",
|
|
"| Ukweli |\n",
|
|
"|:-----:|\n",
|
|
"\n",
|
|
"\n",
|
|
"| | | |\n",
|
|
"|---------------|--------|-------|\n",
|
|
"| **Iliyotabiriwa** | NYEUPE | MACHUNGWA |\n",
|
|
"| NYEUPE | TP | FP |\n",
|
|
"| MACHUNGWA | FN | TN |\n",
|
|
"\n",
|
|
"Kama ulivyotambua, ni bora kuwa na idadi kubwa ya chanya halisi na hasi halisi, na idadi ndogo ya chanya ya uongo na hasi ya uongo, ambayo inaonyesha kuwa modeli inafanya kazi vizuri zaidi.\n",
|
|
"\n",
|
|
"Matriki ya mkanganyiko ni muhimu kwa sababu inazalisha vipimo vingine ambavyo vinaweza kutusaidia kutathmini utendaji wa modeli ya uainishaji kwa usahihi zaidi. Hebu tuzipitie:\n",
|
|
"\n",
|
|
"🎓 Usahihi: `TP/(TP + FP)` inafafanuliwa kama uwiano wa chanya zilizotabiriwa ambazo kwa kweli ni chanya. Pia huitwa [thamani ya utabiri chanya](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\").\n",
|
|
"\n",
|
|
"🎓 Urejeshaji: `TP/(TP + FN)` inafafanuliwa kama uwiano wa matokeo chanya kati ya idadi ya sampuli ambazo kwa kweli ni chanya. Pia inajulikana kama `hisia`.\n",
|
|
"\n",
|
|
"🎓 Umaalumu: `TN/(TN + FP)` inafafanuliwa kama uwiano wa matokeo hasi kati ya idadi ya sampuli ambazo kwa kweli ni hasi.\n",
|
|
"\n",
|
|
"🎓 Usahihi wa jumla: `TP + TN/(TP + TN + FP + FN)` Asilimia ya lebo zilizotabiriwa kwa usahihi kwa sampuli.\n",
|
|
"\n",
|
|
"🎓 Kipimo cha F: Wastani wa uzito wa usahihi na urejeshaji, bora ikiwa ni 1 na mbaya ikiwa ni 0.\n",
|
|
"\n",
|
|
"Hebu tuhisi vipimo hivi!\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Combine metric functions and calculate them all at once\n",
|
|
"eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n",
|
|
"eval_metrics(data = results, truth = color, estimate = .pred_class)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Kuonyesha Mchoro wa ROC wa mfano huu\n",
|
|
"\n",
|
|
"Hebu tufanye uonyeshaji mwingine ili kuona kinachoitwa [`Mchoro wa ROC`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Make a roc_curve\n",
|
|
"results %>% \n",
|
|
" roc_curve(color, .pred_ORANGE) %>% \n",
|
|
" autoplot()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Mikondo ya ROC mara nyingi hutumika kupata mtazamo wa matokeo ya kiondoaji katika muktadha wa kweli dhidi ya chanya za uongo. Mikondo ya ROC kwa kawaida huonyesha `True Positive Rate`/Unyeti kwenye mhimili wa Y, na `False Positive Rate`/1-Specifisiti kwenye mhimili wa X. Kwa hivyo, mwinuko wa mkondo na nafasi kati ya mstari wa katikati na mkondo ni muhimu: unataka mkondo unaopanda haraka na kuvuka mstari. Katika hali yetu, kuna chanya za uongo mwanzoni, kisha mstari unapanda na kuvuka vizuri.\n",
|
|
"\n",
|
|
"Hatimaye, hebu tutumie `yardstick::roc_auc()` kuhesabu eneo halisi chini ya mkondo. Njia moja ya kufasiri AUC ni kama uwezekano kwamba modeli itaweka mfano chanya wa nasibu juu zaidi kuliko mfano hasi wa nasibu.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"vscode": {
|
|
"languageId": "r"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Calculate area under curve\n",
|
|
"results %>% \n",
|
|
" roc_auc(color, .pred_ORANGE)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Matokeo ni karibu `0.975`. Kwa kuwa AUC inatoka 0 hadi 1, unataka alama kubwa, kwani modeli ambayo ni sahihi kwa 100% katika utabiri wake itakuwa na AUC ya 1; katika hali hii, modeli ni *nzuri sana*.\n",
|
|
"\n",
|
|
"Katika masomo ya baadaye kuhusu uainishaji, utajifunza jinsi ya kuboresha alama za modeli yako (kama kushughulikia data isiyo na uwiano katika hali hii).\n",
|
|
"\n",
|
|
"## 🚀Changamoto\n",
|
|
"\n",
|
|
"Kuna mengi zaidi ya kuchunguza kuhusu regression ya logistic! Lakini njia bora ya kujifunza ni kujaribu. Tafuta seti ya data inayofaa kwa aina hii ya uchambuzi na tengeneza modeli nayo. Unajifunza nini? kidokezo: jaribu [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) kwa seti za data za kuvutia.\n",
|
|
"\n",
|
|
"## Mapitio na Kujisomea\n",
|
|
"\n",
|
|
"Soma kurasa chache za mwanzo za [karatasi hii kutoka Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) kuhusu matumizi ya vitendo ya regression ya logistic. Fikiria kuhusu kazi ambazo zinafaa zaidi kwa aina moja au nyingine ya kazi za regression ambazo tumejifunza hadi sasa. Nini kingefanya kazi vizuri zaidi?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"anaconda-cloud": "",
|
|
"kernelspec": {
|
|
"display_name": "R",
|
|
"langauge": "R",
|
|
"name": "ir"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": "r",
|
|
"file_extension": ".r",
|
|
"mimetype": "text/x-r-source",
|
|
"name": "R",
|
|
"pygments_lexer": "r",
|
|
"version": "3.4.1"
|
|
},
|
|
"coopTranslator": {
|
|
"original_hash": "feaf125f481a89c468fa115bf2aed580",
|
|
"translation_date": "2025-09-06T13:34:32+00:00",
|
|
"source_file": "2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb",
|
|
"language_code": "sw"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
} |