You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/sw/2-Regression/1-Tools/solution/R/lesson_1-R.ipynb

448 lines
17 KiB

{
"nbformat": 4,
"nbformat_minor": 2,
"metadata": {
"colab": {
"name": "lesson_1-R.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
},
"coopTranslator": {
"original_hash": "c18d3bd0bd8ae3878597e89dcd1fa5c1",
"translation_date": "2025-09-06T13:43:37+00:00",
"source_file": "2-Regression/1-Tools/solution/R/lesson_1-R.ipynb",
"language_code": "sw"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "YJUHCXqK57yz"
}
},
{
"cell_type": "markdown",
"source": [
"## Utangulizi wa Urejeleaji - Somo la 1\n",
"\n",
"#### Kuweka katika muktadha\n",
"\n",
"✅ Kuna aina nyingi za mbinu za urejeleaji, na chaguo lako linategemea jibu unalotafuta. Ikiwa unataka kutabiri urefu unaowezekana wa mtu wa umri fulani, ungetumia `urejeleaji wa mstari`, kwa kuwa unatafuta **thamani ya nambari**. Ikiwa unavutiwa na kugundua kama aina fulani ya chakula inapaswa kuzingatiwa kuwa cha mboga au la, unatafuta **ugawaji wa kategoria**, kwa hivyo ungetumia `urejeleaji wa kimantiki`. Utajifunza zaidi kuhusu urejeleaji wa kimantiki baadaye. Fikiria kidogo kuhusu maswali unayoweza kuuliza kutoka kwa data, na ni mbinu ipi kati ya hizi ingefaa zaidi.\n",
"\n",
"Katika sehemu hii, utatumia [seti ndogo ya data kuhusu kisukari](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Fikiria kwamba unataka kujaribu matibabu kwa wagonjwa wa kisukari. Miundo ya Kujifunza kwa Mashine inaweza kukusaidia kubaini ni wagonjwa gani wangepokea matibabu vizuri zaidi, kulingana na mchanganyiko wa vigezo. Hata mfano wa urejeleaji wa msingi kabisa, unapowekwa kwenye taswira, unaweza kuonyesha taarifa kuhusu vigezo ambavyo vingekusaidia kupanga majaribio yako ya kinadharia ya kliniki.\n",
"\n",
"Kwa hayo, hebu tuanze kazi hii!\n",
"\n",
"<p >\n",
" <img src=\"../../images/encouRage.jpg\"\n",
" width=\"630\"/>\n",
" <figcaption>Uchoraji na @allison_horst</figcaption>\n",
"\n",
"<!--![Uchoraji na \\@allison_horst](../../../../../../2-Regression/1-Tools/images/encouRage.jpg)<br>Uchoraji na @allison_horst-->\n"
],
"metadata": {
"id": "LWNNzfqd6feZ"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Kuandaa zana zetu\n",
"\n",
"Kwa kazi hii, tutahitaji vifurushi vifuatavyo:\n",
"\n",
"- `tidyverse`: [tidyverse](https://www.tidyverse.org/) ni [mkusanyiko wa vifurushi vya R](https://www.tidyverse.org/packages) ulioundwa kufanya sayansi ya data kuwa ya haraka, rahisi, na ya kufurahisha!\n",
"\n",
"- `tidymodels`: Mfumo wa [tidymodels](https://www.tidymodels.org/) ni [mkusanyiko wa vifurushi](https://www.tidymodels.org/packages/) kwa ajili ya uundaji wa mifano na ujifunzaji wa mashine.\n",
"\n",
"Unaweza kuvifunga kwa kutumia:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n",
"\n",
"Skripti iliyo hapa chini inakagua kama una vifurushi vinavyohitajika kukamilisha moduli hii na inavifunga kwa ajili yako endapo baadhi havipo.\n"
],
"metadata": {
"id": "FIo2YhO26wI9"
}
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
"pacman::p_load(tidyverse, tidymodels)"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Loading required package: pacman\n",
"\n"
]
}
],
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
}
},
{
"cell_type": "markdown",
"source": [
"Sasa, hebu tupakie vifurushi hivi vya kushangaza na kuvifanya vipatikane katika kikao chetu cha sasa cha R. (Hii ni kwa maelezo tu, `pacman::p_load()` tayari ilifanya hivyo kwa ajili yako)\n"
],
"metadata": {
"id": "gpO_P_6f9WUG"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# load the core Tidyverse packages\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# load the core Tidymodels packages\r\n",
"library(tidymodels)\r\n"
],
"outputs": [],
"metadata": {
"id": "NLMycgG-9ezO"
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Seti ya data ya kisukari\n",
"\n",
"Katika zoezi hili, tutatumia ujuzi wetu wa regression kufanya utabiri kwenye seti ya data ya kisukari. [Seti ya data ya kisukari](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt) ina `442 sampuli` za data zinazohusiana na kisukari, ikiwa na vipengele 10 vya utabiri, `umri`, `jinsia`, `kiwango cha uzito wa mwili`, `shinikizo la damu la wastani`, na `vipimo sita vya damu` pamoja na kipengele cha matokeo `y`: kipimo cha kiasi cha maendeleo ya ugonjwa mwaka mmoja baada ya msingi.\n",
"\n",
"|Idadi ya uchunguzi|442|\n",
"|------------------|:---|\n",
"|Idadi ya vipengele vya utabiri|Safu 10 za kwanza ni za nambari za utabiri|\n",
"|Matokeo/Lengo|Safu ya 11 ni kipimo cha kiasi cha maendeleo ya ugonjwa mwaka mmoja baada ya msingi|\n",
"|Maelezo ya vipengele vya utabiri|- umri kwa miaka\n",
"||- jinsia\n",
"||- bmi kiwango cha uzito wa mwili\n",
"||- bp shinikizo la damu la wastani\n",
"||- s1 tc, jumla ya cholesterol ya damu\n",
"||- s2 ldl, lipoproteini zenye msongamano mdogo\n",
"||- s3 hdl, lipoproteini zenye msongamano mkubwa\n",
"||- s4 tch, jumla ya cholesterol / HDL\n",
"||- s5 ltg, labda logi ya kiwango cha triglycerides ya damu\n",
"||- s6 glu, kiwango cha sukari kwenye damu|\n",
"\n",
"> 🎓 Kumbuka, huu ni ujifunzaji unaosimamiwa, na tunahitaji lengo lililopewa jina 'y'.\n",
"\n",
"Kabla ya kuweza kushughulikia data na R, unahitaji kuingiza data kwenye kumbukumbu ya R, au kujenga muunganisho na data ambayo R inaweza kutumia kufikia data hiyo kwa mbali.\n",
"\n",
"> Kifurushi cha [readr](https://readr.tidyverse.org/), ambacho ni sehemu ya Tidyverse, kinatoa njia ya haraka na rafiki ya kusoma data ya mstatili kwenye R.\n",
"\n",
"Sasa, hebu tuingize seti ya data ya kisukari iliyotolewa kwenye URL hii ya chanzo: <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html>\n",
"\n",
"Pia, tutafanya ukaguzi wa hali ya data yetu kwa kutumia `glimpse()` na kuonyesha safu 5 za kwanza kwa kutumia `slice()`.\n",
"\n",
"Kabla ya kuendelea zaidi, hebu pia tuanzishe kitu ambacho utakutana nacho mara nyingi katika msimbo wa R 🥁🥁: opereta wa bomba `%>%`\n",
"\n",
"Opereta wa bomba (`%>%`) hufanya operesheni kwa mpangilio wa kimantiki kwa kupitisha kitu mbele kwenye kazi au usemi wa wito. Unaweza kufikiria opereta wa bomba kama kusema \"na kisha\" katika msimbo wako.\n"
],
"metadata": {
"id": "KM6iXLH996Cl"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Import the data set\r\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\r\n",
"\r\n",
"\r\n",
"# Get a glimpse and dimensions of the data\r\n",
"glimpse(diabetes)\r\n",
"\r\n",
"\r\n",
"# Select the first 5 rows of the data\r\n",
"diabetes %>% \r\n",
" slice(1:5)"
],
"outputs": [],
"metadata": {
"id": "Z1geAMhM-bSP"
}
},
{
"cell_type": "markdown",
"source": [
"`glimpse()` inaonyesha kuwa data hii ina safu 442 na nguzo 11, ambapo nguzo zote zina aina ya data `double`\n",
"\n",
"<br>\n",
"\n",
"> glimpse() na slice() ni kazi katika [`dplyr`](https://dplyr.tidyverse.org/). Dplyr, sehemu ya Tidyverse, ni sarufi ya uendeshaji wa data inayotoa seti thabiti ya vitenzi vinavyokusaidia kutatua changamoto za kawaida za uendeshaji wa data.\n",
"\n",
"<br>\n",
"\n",
"Sasa kwa kuwa tuna data, hebu tuzingatie kipengele kimoja (`bmi`) kwa lengo la zoezi hili. Hii itahitaji kuchagua nguzo zinazohitajika. Kwa hivyo, tunafanyaje hili?\n",
"\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) inatuwezesha *kuchagua* (na kwa hiari kubadilisha jina) nguzo katika fremu ya data.\n"
],
"metadata": {
"id": "UwjVT1Hz-c3Z"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Select predictor feature `bmi` and outcome `y`\r\n",
"diabetes_select <- diabetes %>% \r\n",
" select(c(bmi, y))\r\n",
"\r\n",
"# Print the first 5 rows\r\n",
"diabetes_select %>% \r\n",
" slice(1:10)"
],
"outputs": [],
"metadata": {
"id": "RDY1oAKI-m80"
}
},
{
"cell_type": "markdown",
"source": [
"## 3. Mafunzo na Data ya Kupima\n",
"\n",
"Ni jambo la kawaida katika kujifunza kwa usimamizi *kugawanya* data katika sehemu mbili; seti moja (ambayo kwa kawaida ni kubwa zaidi) ya kufundishia modeli, na seti ndogo ya \"kuhifadhi\" ili kuona jinsi modeli ilivyofanya kazi.\n",
"\n",
"Sasa kwa kuwa tuna data tayari, tunaweza kuona kama mashine inaweza kusaidia kuamua mgawanyo wa kimantiki kati ya nambari katika seti hii ya data. Tunaweza kutumia kifurushi cha [rsample](https://tidymodels.github.io/rsample/), ambacho ni sehemu ya mfumo wa Tidymodels, kuunda kitu kinachobeba taarifa juu ya *jinsi* ya kugawanya data, na kisha kutumia kazi mbili zaidi za rsample kutoa seti za mafunzo na kupima zilizoundwa:\n"
],
"metadata": {
"id": "SDk668xK-tc3"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"set.seed(2056)\r\n",
"# Split 67% of the data for training and the rest for tesing\r\n",
"diabetes_split <- diabetes_select %>% \r\n",
" initial_split(prop = 0.67)\r\n",
"\r\n",
"# Extract the resulting train and test sets\r\n",
"diabetes_train <- training(diabetes_split)\r\n",
"diabetes_test <- testing(diabetes_split)\r\n",
"\r\n",
"# Print the first 3 rows of the training set\r\n",
"diabetes_train %>% \r\n",
" slice(1:10)"
],
"outputs": [],
"metadata": {
"id": "EqtHx129-1h-"
}
},
{
"cell_type": "markdown",
"source": [
"## 4. Fundisha mfano wa regression ya mstari kwa kutumia Tidymodels\n",
"\n",
"Sasa tuko tayari kufundisha mfano wetu!\n",
"\n",
"Katika Tidymodels, unataja mifano kwa kutumia `parsnip()` kwa kufafanua dhana tatu:\n",
"\n",
"- **Aina ya mfano** inatofautisha mifano kama regression ya mstari, regression ya logistic, mifano ya mti wa maamuzi, na kadhalika.\n",
"\n",
"- **Hali ya mfano** inajumuisha chaguo za kawaida kama regression na uainishaji; baadhi ya aina za mifano zinaunga mkono mojawapo ya hizi au zote mbili, wakati nyingine zina hali moja tu.\n",
"\n",
"- **Injini ya mfano** ni zana ya kihesabu ambayo itatumika kufanikisha mfano. Mara nyingi hizi ni pakiti za R, kama **`\"lm\"`** au **`\"ranger\"`**\n",
"\n",
"Taarifa hii ya uundaji wa mfano inahifadhiwa katika maelezo ya mfano, kwa hivyo hebu tuunde moja!\n"
],
"metadata": {
"id": "sBOS-XhB-6v7"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Build a linear model specification\r\n",
"lm_spec <- \r\n",
" # Type\r\n",
" linear_reg() %>% \r\n",
" # Engine\r\n",
" set_engine(\"lm\") %>% \r\n",
" # Mode\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Print the model specification\r\n",
"lm_spec"
],
"outputs": [],
"metadata": {
"id": "20OwEw20--t3"
}
},
{
"cell_type": "markdown",
"source": [
"Baada ya mfano *kuelezwa*, mfano unaweza `kukadiriwa` au `kufunzwa` kwa kutumia kazi ya [`fit()`](https://parsnip.tidymodels.org/reference/fit.html), kwa kawaida kwa kutumia fomula na data fulani.\n",
"\n",
"`y ~ .` inamaanisha tutafitisha `y` kama kiasi kinachotabiriwa/lengo, kinachoelezewa na vihisishi/vipengele vyote yaani, `.` (katika kesi hii, tuna kihisishi kimoja tu: `bmi`)\n"
],
"metadata": {
"id": "_oDHs89k_CJj"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Build a linear model specification\r\n",
"lm_spec <- linear_reg() %>% \r\n",
" set_engine(\"lm\") %>%\r\n",
" set_mode(\"regression\")\r\n",
"\r\n",
"\r\n",
"# Train a linear regression model\r\n",
"lm_mod <- lm_spec %>% \r\n",
" fit(y ~ ., data = diabetes_train)\r\n",
"\r\n",
"# Print the model\r\n",
"lm_mod"
],
"outputs": [],
"metadata": {
"id": "YlsHqd-q_GJQ"
}
},
{
"cell_type": "markdown",
"source": [
"Kutoka kwenye matokeo ya modeli, tunaweza kuona vigezo vilivyojifunzwa wakati wa mafunzo. Vigezo hivi vinawakilisha vigezo vya mstari wa kufaa bora ambao hutupatia makosa ya chini kabisa kati ya thamani halisi na iliyotabiriwa.\n",
"\n",
"<br>\n",
"\n",
"## 5. Fanya utabiri kwenye seti ya majaribio\n",
"\n",
"Sasa kwa kuwa tumefundisha modeli, tunaweza kuitumia kutabiri maendeleo ya ugonjwa y kwa kutumia seti ya data ya majaribio kwa kutumia [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Hii itatumika kuchora mstari kati ya makundi ya data.\n"
],
"metadata": {
"id": "kGZ22RQj_Olu"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make predictions for the test set\r\n",
"predictions <- lm_mod %>% \r\n",
" predict(new_data = diabetes_test)\r\n",
"\r\n",
"# Print out some of the predictions\r\n",
"predictions %>% \r\n",
" slice(1:5)"
],
"outputs": [],
"metadata": {
"id": "nXHbY7M2_aao"
}
},
{
"cell_type": "markdown",
"source": [
"Woohoo! 💃🕺 Tumefanikiwa kufundisha modeli na kuitumia kufanya utabiri!\n",
"\n",
"Wakati wa kufanya utabiri, utaratibu wa tidymodels ni daima kutoa tibble/data frame ya matokeo yenye majina ya safu yaliyosanifishwa. Hii hufanya iwe rahisi kuunganisha data ya awali na utabiri katika muundo unaoweza kutumika kwa shughuli zinazofuata kama vile kuchora grafu.\n",
"\n",
"`dplyr::bind_cols()` inaunganisha kwa ufanisi data frame nyingi kwa safu.\n"
],
"metadata": {
"id": "R_JstwUY_bIs"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Combine the predictions and the original test set\r\n",
"results <- diabetes_test %>% \r\n",
" bind_cols(predictions)\r\n",
"\r\n",
"\r\n",
"results %>% \r\n",
" slice(1:5)"
],
"outputs": [],
"metadata": {
"id": "RybsMJR7_iI8"
}
},
{
"cell_type": "markdown",
"source": [
"## 6. Onyesha matokeo ya uundaji wa modeli\n",
"\n",
"Sasa ni wakati wa kuona hili kwa njia ya picha 📈. Tutatengeneza mchoro wa alama za kutawanyika wa thamani zote za `y` na `bmi` kutoka kwenye seti ya majaribio, kisha tutatumia utabiri kuonyesha mstari mahali panapofaa zaidi, kati ya makundi ya data ya modeli.\n",
"\n",
"R ina mifumo kadhaa ya kutengeneza grafu, lakini `ggplot2` ni mojawapo ya mifumo maridadi na yenye matumizi mengi zaidi. Hii inakuwezesha kuunda grafu kwa **kuunganisha vipengele huru**.\n"
],
"metadata": {
"id": "XJbYbMZW_n_s"
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Set a theme for the plot\r\n",
"theme_set(theme_light())\r\n",
"# Create a scatter plot\r\n",
"results %>% \r\n",
" ggplot(aes(x = bmi)) +\r\n",
" # Add a scatter plot\r\n",
" geom_point(aes(y = y), size = 1.6) +\r\n",
" # Add a line plot\r\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
],
"outputs": [],
"metadata": {
"id": "R9tYp3VW_sTn"
}
},
{
"cell_type": "markdown",
"source": [
"> ✅ Fikiria kidogo kuhusu kinachoendelea hapa. Mstari wa moja kwa moja unapita katikati ya nukta nyingi ndogo za data, lakini unafanya nini hasa? Je, unaweza kuona jinsi unavyoweza kutumia mstari huu kutabiri mahali ambapo nukta mpya ya data isiyoonekana inapaswa kuwekwa kuhusiana na mhimili wa y wa mchoro? Jaribu kuelezea kwa maneno matumizi ya vitendo ya mfano huu.\n",
"\n",
"Hongera, umeunda mfano wako wa kwanza wa regression ya mstari, ukatengeneza utabiri kwa kutumia huo, na ukaonyesha katika mchoro!\n"
],
"metadata": {
"id": "zrPtHIxx_tNI"
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n"
]
}
]
}