You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1084 lines
41 KiB
1084 lines
41 KiB
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2,
|
|
"metadata": {
|
|
"colab": {
|
|
"name": "lesson_3-R.ipynb",
|
|
"provenance": [],
|
|
"collapsed_sections": [],
|
|
"toc_visible": true
|
|
},
|
|
"kernelspec": {
|
|
"name": "ir",
|
|
"display_name": "R"
|
|
},
|
|
"language_info": {
|
|
"name": "R"
|
|
},
|
|
"coopTranslator": {
|
|
"original_hash": "5015d65d61ba75a223bfc56c273aa174",
|
|
"translation_date": "2025-09-06T13:21:17+00:00",
|
|
"source_file": "2-Regression/3-Linear/solution/R/lesson_3-R.ipynb",
|
|
"language_code": "sw"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [],
|
|
"metadata": {
|
|
"id": "EgQw8osnsUV-"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Urejeleaji wa Linear na Polynomial kwa Bei ya Maboga - Somo la 3\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/linear-polynomial.png\"\n",
|
|
" width=\"800\"/>\n",
|
|
" <figcaption>Picha ya Dasani Madipalli</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"#### Utangulizi\n",
|
|
"\n",
|
|
"Hadi sasa umechunguza maana ya urejeleaji (regression) kwa kutumia data ya mfano iliyokusanywa kutoka kwenye seti ya data ya bei ya maboga ambayo tutatumia katika somo hili. Pia umeweza kuiona kwa kutumia `ggplot2`.💪\n",
|
|
"\n",
|
|
"Sasa uko tayari kuingia kwa undani zaidi katika urejeleaji kwa ML. Katika somo hili, utajifunza zaidi kuhusu aina mbili za urejeleaji: *urejeleaji wa msingi wa linear* na *urejeleaji wa polynomial*, pamoja na baadhi ya hesabu zinazohusiana na mbinu hizi.\n",
|
|
"\n",
|
|
"> Katika mtaala huu, tunadhani ujuzi mdogo wa hesabu, na tunalenga kuifanya iwe rahisi kwa wanafunzi kutoka nyanja nyingine, kwa hivyo angalia maelezo, 🧮 vidokezo, michoro, na zana nyingine za kujifunza ili kusaidia kuelewa.\n",
|
|
"\n",
|
|
"#### Maandalizi\n",
|
|
"\n",
|
|
"Kama ukumbusho, unachukua data hii ili kuuliza maswali kuhusu data hiyo.\n",
|
|
"\n",
|
|
"- Ni wakati gani bora wa kununua maboga?\n",
|
|
"\n",
|
|
"- Ni bei gani ninayoweza kutarajia kwa sanduku la maboga madogo?\n",
|
|
"\n",
|
|
"- Je, ninunue maboga kwa vikapu vya nusu-bushel au kwa sanduku la bushel 1 1/9? Hebu tuendelee kuchimba data hii.\n",
|
|
"\n",
|
|
"Katika somo lililopita, uliunda `tibble` (mabadiliko ya kisasa ya fremu ya data) na kuijaza na sehemu ya seti ya data ya awali, ukistandardisha bei kwa bushel. Kwa kufanya hivyo, hata hivyo, uliweza tu kukusanya takriban alama 400 za data na tu kwa miezi ya vuli. Labda tunaweza kupata maelezo zaidi kuhusu asili ya data kwa kuisafisha zaidi? Tutaona... 🕵️♀️\n",
|
|
"\n",
|
|
"Kwa kazi hii, tutahitaji vifurushi vifuatavyo:\n",
|
|
"\n",
|
|
"- `tidyverse`: [tidyverse](https://www.tidyverse.org/) ni [mkusanyiko wa vifurushi vya R](https://www.tidyverse.org/packages) iliyoundwa kufanya sayansi ya data kuwa ya haraka, rahisi na ya kufurahisha!\n",
|
|
"\n",
|
|
"- `tidymodels`: Mfumo wa [tidymodels](https://www.tidymodels.org/) ni [mkusanyiko wa vifurushi](https://www.tidymodels.org/packages/) kwa ajili ya uundaji wa mifano na ujifunzaji wa mashine.\n",
|
|
"\n",
|
|
"- `janitor`: Kifurushi cha [janitor](https://github.com/sfirke/janitor) kinatoa zana rahisi za kuchunguza na kusafisha data chafu.\n",
|
|
"\n",
|
|
"- `corrplot`: Kifurushi cha [corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) kinatoa zana ya kuona kwa uchunguzi wa matriki ya uhusiano ambayo inaunga mkono upangaji upya wa kiotomatiki wa vigezo ili kusaidia kugundua mifumo iliyofichwa kati ya vigezo.\n",
|
|
"\n",
|
|
"Unaweza kuvifunga kwa kutumia:\n",
|
|
"\n",
|
|
"`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"corrplot\"))`\n",
|
|
"\n",
|
|
"Skripti iliyo hapa chini inakagua kama una vifurushi vinavyohitajika kukamilisha moduli hii na kuvifunga kwako endapo havipo.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "WqQPS1OAsg3H"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n",
|
|
"\n",
|
|
"pacman::p_load(tidyverse, tidymodels, janitor, corrplot)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "tA4C2WN3skCf",
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"outputId": "c06cd805-5534-4edc-f72b-d0d1dab96ac0"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Tutatumia baadaye vifurushi hivi vya ajabu na kuvifanya viweze kupatikana katika kikao chetu cha sasa cha R. (Hii ni kwa madhumuni ya maelezo tu, `pacman::p_load()` tayari imefanya hivyo kwako)\n",
|
|
"\n",
|
|
"## 1. Mstari wa regression ya mstari\n",
|
|
"\n",
|
|
"Kama ulivyojifunza katika Somo la 1, lengo la zoezi la regression ya mstari ni kuweza kuchora *mstari* *wa* *ufanisi bora* ili:\n",
|
|
"\n",
|
|
"- **Kuonyesha uhusiano wa vigezo**. Kuonyesha uhusiano kati ya vigezo\n",
|
|
"\n",
|
|
"- **Kutabiri**. Kufanya utabiri sahihi kuhusu mahali ambapo data mpya itakuwa ikilinganishwa na mstari huo.\n",
|
|
"\n",
|
|
"Ili kuchora aina hii ya mstari, tunatumia mbinu ya takwimu inayoitwa **Regression ya Least-Squares**. Neno `least-squares` linamaanisha kwamba alama zote za data zinazozunguka mstari wa regression zinapigwa mraba na kisha kuongezwa. Kwa kawaida, jumla ya mwisho inapaswa kuwa ndogo iwezekanavyo, kwa sababu tunataka idadi ndogo ya makosa, au `least-squares`. Kwa hivyo, mstari wa ufanisi bora ni mstari unaotupa thamani ya chini zaidi kwa jumla ya makosa yaliyopigwa mraba - hivyo jina *least squares regression*.\n",
|
|
"\n",
|
|
"Tunafanya hivyo kwa sababu tunataka kuunda mstari ambao una umbali wa chini zaidi wa jumla kutoka kwa alama zote za data. Pia tunapiga mraba maneno kabla ya kuyaongeza kwa sababu tunajali ukubwa wake badala ya mwelekeo wake.\n",
|
|
"\n",
|
|
"> **🧮 Nionyeshe hesabu**\n",
|
|
">\n",
|
|
"> Mstari huu, unaoitwa *mstari wa ufanisi bora* unaweza kuonyeshwa na [mchoro wa hesabu](https://en.wikipedia.org/wiki/Simple_linear_regression):\n",
|
|
">\n",
|
|
"> Y = a + bX\n",
|
|
">\n",
|
|
"> `X` ni '`kigezo cha kuelezea` au `kigezo cha kutabiri`'. `Y` ni '`kigezo kinachotegemea` au `matokeo`'. Mwelekeo wa mstari ni `b` na `a` ni y-intercept, ambayo inahusu thamani ya `Y` wakati `X = 0`.\n",
|
|
">\n",
|
|
"\n",
|
|
"> \n",
|
|
" Picha ya maelezo na Jen Looper\n",
|
|
">\n",
|
|
"> Kwanza, hesabu mwelekeo `b`.\n",
|
|
">\n",
|
|
"> Kwa maneno mengine, na tukirejelea swali la awali la data ya malenge: \"tabiri bei ya malenge kwa gunia kwa mwezi\", `X` ingekuwa inahusu bei na `Y` ingekuwa inahusu mwezi wa mauzo.\n",
|
|
">\n",
|
|
"> \n",
|
|
" Picha ya maelezo na Jen Looper\n",
|
|
"> \n",
|
|
"> Hesabu thamani ya Y. Ikiwa unalipa karibu \\$4, lazima iwe Aprili!\n",
|
|
">\n",
|
|
"> Hesabu inayochora mstari lazima ionyeshe mwelekeo wa mstari, ambao pia unategemea intercept, au mahali ambapo `Y` iko wakati `X = 0`.\n",
|
|
">\n",
|
|
"> Unaweza kuona mbinu ya hesabu ya thamani hizi kwenye tovuti ya [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html). Pia tembelea [hii Calculator ya Least-Squares](https://www.mathsisfun.com/data/least-squares-calculator.html) ili kuona jinsi thamani za namba zinavyoathiri mstari.\n",
|
|
"\n",
|
|
"Si ya kutisha sana, sivyo? 🤓\n",
|
|
"\n",
|
|
"#### Uhusiano\n",
|
|
"\n",
|
|
"Neno moja zaidi la kuelewa ni **Coefficient ya Uhusiano** kati ya vigezo X na Y vilivyotolewa. Kwa kutumia scatterplot, unaweza kuona haraka coefficient hii. Mchoro wenye alama za data zilizopangwa kwa mstari mzuri una uhusiano wa juu, lakini mchoro wenye alama za data zilizotawanyika kila mahali kati ya X na Y una uhusiano wa chini.\n",
|
|
"\n",
|
|
"Mfano mzuri wa regression ya mstari utakuwa ule ambao una Coefficient ya Uhusiano ya juu (karibu na 1 kuliko 0) kwa kutumia mbinu ya Regression ya Least-Squares na mstari wa regression.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "cdX5FRpvsoP5"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## **2. Mchezo na data: kuunda fremu ya data itakayotumika kwa uundaji wa mifano**\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/janitor.jpg\"\n",
|
|
" width=\"700\"/>\n",
|
|
" <figcaption>Sanaa na @allison_horst</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--{width=\"700\"}-->\n"
|
|
],
|
|
"metadata": {
|
|
"id": "WdUKXk7Bs8-V"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Pakia maktaba zinazohitajika na seti ya data. Badilisha data kuwa fremu ya data inayojumuisha sehemu ndogo ya data:\n",
|
|
"\n",
|
|
"- Chagua tu maboga yanayouzwa kwa bei ya bushel\n",
|
|
"\n",
|
|
"- Badilisha tarehe kuwa mwezi\n",
|
|
"\n",
|
|
"- Hesabu bei kuwa wastani wa bei ya juu na ya chini\n",
|
|
"\n",
|
|
"- Badilisha bei ili kuonyesha bei kulingana na idadi ya bushel\n",
|
|
"\n",
|
|
"> Tulifunika hatua hizi katika [somo la awali](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2-R.ipynb).\n"
|
|
],
|
|
"metadata": {
|
|
"id": "fMCtu2G2s-p8"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Load the core Tidyverse packages\n",
|
|
"library(tidyverse)\n",
|
|
"library(lubridate)\n",
|
|
"\n",
|
|
"# Import the pumpkins data\n",
|
|
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
|
|
"\n",
|
|
"\n",
|
|
"# Get a glimpse and dimensions of the data\n",
|
|
"glimpse(pumpkins)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print the first 50 rows of the data set\n",
|
|
"pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "ryMVZEEPtERn"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kwa roho ya ujasiri wa kweli, hebu tuchunguze [`janitor package`](../../../../../../2-Regression/3-Linear/solution/R/github.com/sfirke/janitor) ambayo inatoa kazi rahisi za kuchunguza na kusafisha data chafu. Kwa mfano, hebu tuangalie majina ya safu kwa data yetu:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "xcNxM70EtJjb"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Return column names\n",
|
|
"pumpkins %>% \n",
|
|
" names()"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "5XtpaIigtPfW"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"🤔 Tunaweza kufanya vizuri zaidi. Hebu tufanye majina haya ya safu `friendR` kwa kuyabadilisha kuwa muundo wa [snake_case](https://en.wikipedia.org/wiki/Snake_case) kwa kutumia `janitor::clean_names`. Ili kujifunza zaidi kuhusu kazi hii: `?clean_names`\n"
|
|
],
|
|
"metadata": {
|
|
"id": "IbIqrMINtSHe"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Clean names to the snake_case convention\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" clean_names(case = \"snake\")\n",
|
|
"\n",
|
|
"# Return column names\n",
|
|
"pumpkins %>% \n",
|
|
" names()"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "a2uYvclYtWvX"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Safi sana tidyR 🧹! Sasa, dansi na data ukitumia `dplyr` kama kwenye somo lililopita! 💃\n"
|
|
],
|
|
"metadata": {
|
|
"id": "HfhnuzDDtaDd"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Select desired columns\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" select(variety, city_name, package, low_price, high_price, date)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"# Extract the month from the dates to a new column\n",
|
|
"pumpkins <- pumpkins %>%\n",
|
|
" mutate(date = mdy(date),\n",
|
|
" month = month(date)) %>% \n",
|
|
" select(-date)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create a new column for average Price\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" mutate(price = (low_price + high_price)/2)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Retain only pumpkins with the string \"bushel\"\n",
|
|
"new_pumpkins <- pumpkins %>% \n",
|
|
" filter(str_detect(string = package, pattern = \"bushel\"))\n",
|
|
"\n",
|
|
"\n",
|
|
"# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel\n",
|
|
"new_pumpkins <- new_pumpkins %>% \n",
|
|
" mutate(price = case_when(\n",
|
|
" str_detect(package, \"1 1/9\") ~ price/(1.1),\n",
|
|
" str_detect(package, \"1/2\") ~ price*2,\n",
|
|
" TRUE ~ price))\n",
|
|
"\n",
|
|
"# Relocate column positions\n",
|
|
"new_pumpkins <- new_pumpkins %>% \n",
|
|
" relocate(month, .before = variety)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Display the first 5 rows\n",
|
|
"new_pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "X0wU3gQvtd9f"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kazi nzuri!👌 Sasa una seti ya data safi na nadhifu ambayo unaweza kutumia kujenga mfano wako mpya wa regression!\n",
|
|
"\n",
|
|
"Ungependa mchoro wa kutawanyika?\n"
|
|
],
|
|
"metadata": {
|
|
"id": "UpaIwaxqth82"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Set theme\n",
|
|
"theme_set(theme_light())\n",
|
|
"\n",
|
|
"# Make a scatter plot of month and price\n",
|
|
"new_pumpkins %>% \n",
|
|
" ggplot(mapping = aes(x = month, y = price)) +\n",
|
|
" geom_point(size = 1.6)\n"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "DXgU-j37tl5K"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Mchoro wa kutawanyika unatukumbusha kwamba tuna data ya miezi kuanzia Agosti hadi Desemba tu. Huenda tunahitaji data zaidi ili kuweza kutoa hitimisho kwa mtindo wa mstari.\n",
|
|
"\n",
|
|
"Hebu tuangalie tena data yetu ya uundaji modeli:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Ve64wVbwtobI"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Display first 5 rows\n",
|
|
"new_pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "HFQX2ng1tuSJ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Je, tungependa kutabiri `bei` ya boga kwa kuzingatia safu za `jiji` au `kifurushi` ambazo ni za aina ya herufi? Au hata kwa urahisi zaidi, tunawezaje kupata uhusiano (ambao unahitaji viingizo vyake vyote viwe vya namba) kati ya, kwa mfano, `kifurushi` na `bei`? 🤷🤷\n",
|
|
"\n",
|
|
"Mifano ya kujifunza kwa mashine hufanya kazi vizuri zaidi na vipengele vya namba badala ya thamani za maandishi, kwa hivyo kwa kawaida unahitaji kubadilisha vipengele vya kategoria kuwa uwakilishi wa namba.\n",
|
|
"\n",
|
|
"Hii inamaanisha kuwa tunapaswa kupata njia ya kurekebisha vigezo vyetu ili kuzifanya ziwe rahisi kwa mfano kutumia kwa ufanisi, mchakato unaojulikana kama `ufundi wa vipengele` (feature engineering).\n"
|
|
],
|
|
"metadata": {
|
|
"id": "7hsHoxsStyjJ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## 3. Kuchakata data kwa ajili ya modeli kwa kutumia recipes 👩🍳👨🍳\n",
|
|
"\n",
|
|
"Shughuli zinazobadilisha maadili ya utabiri ili kuyafanya rahisi kwa modeli kutumia kwa ufanisi zimepewa jina `feature engineering`.\n",
|
|
"\n",
|
|
"Modeli tofauti zina mahitaji tofauti ya uchakataji wa awali. Kwa mfano, least squares inahitaji `encoding categorical variables` kama mwezi, aina, na city_name. Hii inahusisha tu `kutafsiri` safu yenye `categorical values` kuwa moja au zaidi ya `numeric columns` zinazochukua nafasi ya ile ya awali.\n",
|
|
"\n",
|
|
"Kwa mfano, fikiria data yako ina kipengele cha kategoria kama ifuatavyo:\n",
|
|
"\n",
|
|
"| city |\n",
|
|
"|:-------:|\n",
|
|
"| Denver |\n",
|
|
"| Nairobi |\n",
|
|
"| Tokyo |\n",
|
|
"\n",
|
|
"Unaweza kutumia *ordinal encoding* kubadilisha kila kategoria kuwa thamani ya kipekee ya nambari, kama hivi:\n",
|
|
"\n",
|
|
"| city |\n",
|
|
"|:----:|\n",
|
|
"| 0 |\n",
|
|
"| 1 |\n",
|
|
"| 2 |\n",
|
|
"\n",
|
|
"Na hivyo ndivyo tutakavyofanya kwa data yetu!\n",
|
|
"\n",
|
|
"Katika sehemu hii, tutachunguza kifurushi kingine cha ajabu cha Tidymodels: [recipes](https://tidymodels.github.io/recipes/) - ambacho kimeundwa kusaidia kuchakata data yako **kabla** ya kufundisha modeli yako. Kwa msingi wake, recipe ni kitu kinachofafanua hatua gani zinapaswa kutumika kwenye seti ya data ili kuifanya iwe tayari kwa modeli.\n",
|
|
"\n",
|
|
"Sasa, hebu tuunde recipe inayotayarisha data yetu kwa modeli kwa kubadilisha nambari ya kipekee kwa maoni yote katika safu za utabiri:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "AD5kQbcvt3Xl"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Specify a recipe\n",
|
|
"pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>% \n",
|
|
" step_integer(all_predictors(), zero_based = TRUE)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print out the recipe\n",
|
|
"pumpkins_recipe"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "BNaFKXfRt9TU"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Hongera! 👏 Tumetengeneza mapishi yetu ya kwanza yanayobainisha matokeo (bei) na viashiria vyake vinavyolingana, na kwamba safu zote za viashiria zinapaswa kubadilishwa kuwa seti ya nambari za mzima 🙌! Hebu tuichambue haraka:\n",
|
|
"\n",
|
|
"- Wito kwa `recipe()` na fomula unaeleza mapishi kuhusu *majukumu* ya vigezo kwa kutumia data ya `new_pumpkins` kama rejeleo. Kwa mfano, safu ya `price` imepewa jukumu la `outcome` huku safu nyingine zote zikipewa jukumu la `predictor`.\n",
|
|
"\n",
|
|
"- `step_integer(all_predictors(), zero_based = TRUE)` inaeleza kwamba viashiria vyote vinapaswa kubadilishwa kuwa seti ya nambari za mzima, na kuhesabu kuanzia 0.\n",
|
|
"\n",
|
|
"Tuna hakika unaweza kuwa na mawazo kama: \"Hii ni ya kuvutia sana!! Lakini vipi kama ningehitaji kuthibitisha kwamba mapishi yanatekeleza kile ninachotarajia? 🤔\"\n",
|
|
"\n",
|
|
"Hilo ni wazo zuri sana! Unaona, mara mapishi yako yanapobainishwa, unaweza kukadiria vigezo vinavyohitajika ili kuchakata data, kisha kutoa data iliyochakatwa. Kwa kawaida huhitaji kufanya hivi unapotumia Tidymodels (tutaona utaratibu wa kawaida muda si muda-\\> `workflows`) lakini inaweza kuwa muhimu unapohitaji kufanya ukaguzi wa haraka ili kuthibitisha kwamba mapishi yanatekeleza kile unachotarajia.\n",
|
|
"\n",
|
|
"Kwa hilo, utahitaji vitenzi viwili zaidi: `prep()` na `bake()` na kama kawaida, marafiki wetu wadogo wa R kutoka kwa [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) wanakusaidia kuelewa hili vyema zaidi!\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/recipes.png\"\n",
|
|
" width=\"550\"/>\n",
|
|
" <figcaption>Uchoraji na @allison_horst</figcaption>\n"
|
|
],
|
|
"metadata": {
|
|
"id": "KEiO0v7kuC9O"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"[`prep()`](https://recipes.tidymodels.org/reference/prep.html): inakadiria vigezo vinavyohitajika kutoka kwenye seti ya mafunzo ambayo inaweza kutumika baadaye kwenye seti nyingine za data. Kwa mfano, kwa safu fulani ya utabiri, ni uchunguzi gani utakaotolewa namba ya mzima 0 au 1 au 2 na kadhalika.\n",
|
|
"\n",
|
|
"[`bake()`](https://recipes.tidymodels.org/reference/bake.html): inachukua mapishi yaliyotayarishwa na kutekeleza operesheni kwenye seti yoyote ya data.\n",
|
|
"\n",
|
|
"Kwa kusema hivyo, hebu tuandae na kutekeleza mapishi yetu ili kuthibitisha kweli kwamba ndani ya mfumo, safu za utabiri zitakuwa zimekodishwa kwanza kabla ya modeli kufanyiwa kazi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Q1xtzebuuTCP"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Prep the recipe\n",
|
|
"pumpkins_prep <- prep(pumpkins_recipe)\n",
|
|
"\n",
|
|
"# Bake the recipe to extract a preprocessed new_pumpkins data\n",
|
|
"baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)\n",
|
|
"\n",
|
|
"# Print out the baked data set\n",
|
|
"baked_pumpkins %>% \n",
|
|
" slice_head(n = 10)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "FGBbJbP_uUUn"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Woo-hoo!🥳 Data iliyochakatwa `baked_pumpkins` ina vigezo vyake vyote vimekodishwa, ikithibitisha kwamba hatua za awali za uchakataji zilizofafanuliwa kama mapishi yetu zitafanya kazi kama ilivyotarajiwa. Hii inafanya iwe ngumu kwako kusoma lakini rahisi zaidi kueleweka kwa Tidymodels! Chukua muda kidogo kugundua ni uchunguzi gani umebadilishwa kuwa nambari inayolingana.\n",
|
|
"\n",
|
|
"Pia ni muhimu kutaja kwamba `baked_pumpkins` ni fremu ya data ambayo tunaweza kufanya mahesabu juu yake.\n",
|
|
"\n",
|
|
"Kwa mfano, hebu jaribu kutafuta uhusiano mzuri kati ya alama mbili za data yako ili kujenga mfano mzuri wa utabiri. Tutatumia kazi `cor()` kufanya hivyo. Andika `?cor()` ili kujifunza zaidi kuhusu kazi hiyo.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "1dvP0LBUueAW"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Find the correlation between the city_name and the price\n",
|
|
"cor(baked_pumpkins$city_name, baked_pumpkins$price)\n",
|
|
"\n",
|
|
"# Find the correlation between the package and the price\n",
|
|
"cor(baked_pumpkins$package, baked_pumpkins$price)\n"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "3bQzXCjFuiSV"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kama inavyotokea, kuna uhusiano dhaifu tu kati ya Jiji na Bei. Hata hivyo, kuna uhusiano bora kidogo kati ya Kifurushi na Bei yake. Hilo lina mantiki, sivyo? Kwa kawaida, kadri sanduku la mazao linavyokuwa kubwa, ndivyo bei inavyokuwa juu.\n",
|
|
"\n",
|
|
"Wakati tuko hapa, hebu pia tujaribu kuonyesha matriki ya uhusiano wa safu zote kwa kutumia pakiti ya `corrplot`.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "BToPWbgjuoZw"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Load the corrplot package\n",
|
|
"library(corrplot)\n",
|
|
"\n",
|
|
"# Obtain correlation matrix\n",
|
|
"corr_mat <- cor(baked_pumpkins %>% \n",
|
|
" # Drop columns that are not really informative\n",
|
|
" select(-c(low_price, high_price)))\n",
|
|
"\n",
|
|
"# Make a correlation plot between the variables\n",
|
|
"corrplot(corr_mat, method = \"shade\", shade.col = NA, tl.col = \"black\", tl.srt = 45, addCoef.col = \"black\", cl.pos = \"n\", order = \"original\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "ZwAL3ksmutVR"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"🤩🤩 Bora zaidi.\n",
|
|
"\n",
|
|
"Swali zuri la kuuliza kuhusu data hii sasa ni: '`Ni bei gani ninayoweza kutarajia kwa kifurushi fulani cha malenge?`' Hebu tuanze moja kwa moja!\n",
|
|
"\n",
|
|
"> Note: Unapobake **`bake()`** mapishi yaliyoandaliwa **`pumpkins_prep`** na **`new_data = NULL`**, unatoa data ya mafunzo iliyosindikwa (yaani, iliyosimbwa). Ikiwa ulikuwa na seti nyingine ya data, kwa mfano seti ya majaribio, na ungependa kuona jinsi mapishi yanavyoweza kuisindika, ungebake tu **`pumpkins_prep`** na **`new_data = test_set`**\n",
|
|
"\n",
|
|
"## 4. Tengeneza modeli ya regression ya mstari\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/linear-polynomial.png\"\n",
|
|
" width=\"800\"/>\n",
|
|
" <figcaption>Picha ya Dasani Madipalli</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--{width=\"800\"}-->\n"
|
|
],
|
|
"metadata": {
|
|
"id": "YqXjLuWavNxW"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Sasa kwa kuwa tumetengeneza mapishi, na tumethibitisha kuwa data itachakatwa ipasavyo, hebu sasa tujenge mfano wa regression ili kujibu swali: `Ni bei gani ninayoweza kutarajia kwa kifurushi fulani cha malenge?`\n",
|
|
"\n",
|
|
"#### Fundisha mfano wa regression ya mstari ukitumia seti ya mafunzo\n",
|
|
"\n",
|
|
"Kama unavyoweza kuwa umeshagundua, safu ya *price* ni `kigezo cha matokeo` wakati safu ya *package* ni `kigezo cha utabiri`.\n",
|
|
"\n",
|
|
"Ili kufanya hivi, tutagawanya data kwanza ili asilimia 80 iingie kwenye seti ya mafunzo na asilimia 20 kwenye seti ya majaribio, kisha tutaelezea mapishi ambayo yataweka safu ya utabiri katika seti ya namba nzima, kisha tujenge maelezo ya mfano. Hatutapika na kuandaa mapishi yetu kwa sababu tayari tunajua yatachakata data kama inavyotarajiwa.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Pq0bSzCevW-h"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"set.seed(2056)\n",
|
|
"# Split the data into training and test sets\n",
|
|
"pumpkins_split <- new_pumpkins %>% \n",
|
|
" initial_split(prop = 0.8)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Extract training and test data\n",
|
|
"pumpkins_train <- training(pumpkins_split)\n",
|
|
"pumpkins_test <- testing(pumpkins_split)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create a recipe for preprocessing the data\n",
|
|
"lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>% \n",
|
|
" step_integer(all_predictors(), zero_based = TRUE)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create a linear model specification\n",
|
|
"lm_spec <- linear_reg() %>% \n",
|
|
" set_engine(\"lm\") %>% \n",
|
|
" set_mode(\"regression\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "CyoEh_wuvcLv"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kazi nzuri! Sasa kwa kuwa tuna mapishi na maelezo ya mfano, tunahitaji kupata njia ya kuyafungamanisha pamoja katika kitu ambacho kitafanya kazi ya kwanza ya kuchakata data (prep+bake nyuma ya pazia), kufundisha mfano kwenye data iliyochakatwa, na pia kuruhusu shughuli za baada ya uchakataji. Hiyo inakupa utulivu wa akili, sivyo!🤩\n",
|
|
"\n",
|
|
"Katika Tidymodels, kitu hiki rahisi kinaitwa [`workflow`](https://workflows.tidymodels.org/) na kwa urahisi kinashikilia vipengele vyako vya uundaji wa mifano! Hiki ndicho tunachokiita *pipelines* katika *Python*.\n",
|
|
"\n",
|
|
"Sasa hebu tufungamanishe kila kitu katika workflow!📦\n"
|
|
],
|
|
"metadata": {
|
|
"id": "G3zF_3DqviFJ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Hold modelling components in a workflow\n",
|
|
"lm_wf <- workflow() %>% \n",
|
|
" add_recipe(lm_pumpkins_recipe) %>% \n",
|
|
" add_model(lm_spec)\n",
|
|
"\n",
|
|
"# Print out the workflow\n",
|
|
"lm_wf"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "T3olroU3v-WX"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Pia, mchakato wa kazi unaweza kufaa/kufunzwa kwa njia sawa na jinsi mfano unavyoweza kufanywa.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "zd1A5tgOwEPX"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Train the model\n",
|
|
"lm_wf_fit <- lm_wf %>% \n",
|
|
" fit(data = pumpkins_train)\n",
|
|
"\n",
|
|
"# Print the model coefficients learned \n",
|
|
"lm_wf_fit"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "NhJagFumwFHf"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kutoka kwenye matokeo ya modeli, tunaweza kuona vigezo vilivyojifunzwa wakati wa mafunzo. Vigezo hivi vinawakilisha vigezo vya mstari wa kufaa bora ambao hutupatia makosa ya chini kabisa kati ya thamani halisi na ile iliyotabiriwa.\n",
|
|
"\n",
|
|
"#### Kutathmini utendaji wa modeli kwa kutumia seti ya majaribio\n",
|
|
"\n",
|
|
"Ni wakati wa kuona jinsi modeli ilivyofanya kazi 📏! Tunafanyaje hivi?\n",
|
|
"\n",
|
|
"Sasa kwa kuwa tumefundisha modeli, tunaweza kuitumia kutabiri kwa seti ya majaribio (`test_set`) kwa kutumia `parsnip::predict()`. Kisha tunaweza kulinganisha utabiri huu na thamani halisi za lebo ili kutathmini jinsi modeli inavyofanya kazi (au haifanyi kazi!).\n",
|
|
"\n",
|
|
"Hebu tuanze kwa kufanya utabiri kwa seti ya majaribio kisha tuunganishe safu kwenye seti ya majaribio.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "_4QkGtBTwItF"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Make predictions for the test set\n",
|
|
"predictions <- lm_wf_fit %>% \n",
|
|
" predict(new_data = pumpkins_test)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Bind predictions to the test set\n",
|
|
"lm_results <- pumpkins_test %>% \n",
|
|
" select(c(package, price)) %>% \n",
|
|
" bind_cols(predictions)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print the first ten rows of the tibble\n",
|
|
"lm_results %>% \n",
|
|
" slice_head(n = 10)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "UFZzTG0gwTs9"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Ndio, umeshafundisha modeli na kuitumia kufanya utabiri!🔮 Je, ni nzuri? Hebu tathmini utendaji wa modeli!\n",
|
|
"\n",
|
|
"Katika Tidymodels, tunafanya hivi kwa kutumia `yardstick::metrics()`! Kwa regression ya mstari, hebu tuzingatie vipimo vifuatavyo:\n",
|
|
"\n",
|
|
"- `Root Mean Square Error (RMSE)`: Mizizi ya mraba ya [MSE](https://en.wikipedia.org/wiki/Mean_squared_error). Hii inatoa kipimo cha moja kwa moja katika kipimo sawa na lebo (katika kesi hii, bei ya malenge). Thamani ndogo zaidi, ndivyo modeli inavyokuwa bora (kwa mtazamo rahisi, inawakilisha wastani wa bei ambayo utabiri uko makosa!)\n",
|
|
"\n",
|
|
"- `Coefficient of Determination (maarufu kama R-squared au R2)`: Kipimo cha kulinganisha ambapo thamani ya juu zaidi inaonyesha modeli inayofaa zaidi. Kimsingi, kipimo hiki kinaonyesha ni kiasi gani cha tofauti kati ya thamani za lebo zilizotabiriwa na halisi ambacho modeli inaweza kuelezea.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "0A5MjzM7wW9M"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Evaluate performance of linear regression\n",
|
|
"metrics(data = lm_results,\n",
|
|
" truth = price,\n",
|
|
" estimate = .pred)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "reJ0UIhQwcEH"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Hapo ndipo utendaji wa modeli unaporudi chini. Hebu tuone kama tunaweza kupata dalili bora kwa kuonyesha mchoro wa kutawanyika wa kifurushi na bei kisha kutumia utabiri uliofanywa kuweka mstari wa kufaa bora.\n",
|
|
"\n",
|
|
"Hii inamaanisha tutalazimika kuandaa na kuchakata seti ya majaribio ili kusimba safu ya kifurushi kisha kuunganisha hii na utabiri uliofanywa na modeli yetu.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "fdgjzjkBwfWt"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Encode package column\n",
|
|
"package_encode <- lm_pumpkins_recipe %>% \n",
|
|
" prep() %>% \n",
|
|
" bake(new_data = pumpkins_test) %>% \n",
|
|
" select(package)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Bind encoded package column to the results\n",
|
|
"lm_results <- lm_results %>% \n",
|
|
" bind_cols(package_encode %>% \n",
|
|
" rename(package_integer = package)) %>% \n",
|
|
" relocate(package_integer, .after = package)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print new results data frame\n",
|
|
"lm_results %>% \n",
|
|
" slice_head(n = 5)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Make a scatter plot\n",
|
|
"lm_results %>% \n",
|
|
" ggplot(mapping = aes(x = package_integer, y = price)) +\n",
|
|
" geom_point(size = 1.6) +\n",
|
|
" # Overlay a line of best fit\n",
|
|
" geom_line(aes(y = .pred), color = \"orange\", size = 1.2) +\n",
|
|
" xlab(\"package\")\n",
|
|
" \n"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "R0nw719lwkHE"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Nzuri! Kama unavyoona, mfano wa linear regression hauwezi kwa kweli kujumlisha uhusiano kati ya kifurushi na bei yake inayolingana.\n",
|
|
"\n",
|
|
"🎃 Hongera, umeunda mfano ambao unaweza kusaidia kutabiri bei ya aina chache za maboga. Shamba lako la maboga kwa ajili ya sikukuu litakuwa zuri. Lakini pengine unaweza kuunda mfano bora zaidi!\n",
|
|
"\n",
|
|
"## 5. Jenga mfano wa polynomial regression\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/linear-polynomial.png\"\n",
|
|
" width=\"800\"/>\n",
|
|
" <figcaption>Picha ya maelezo na Dasani Madipalli</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--{width=\"800\"}-->\n"
|
|
],
|
|
"metadata": {
|
|
"id": "HOCqJXLTwtWI"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Wakati mwingine data zetu zinaweza kuwa hazina uhusiano wa moja kwa moja, lakini bado tunataka kutabiri matokeo. Urejeleaji wa polinomu unaweza kutusaidia kufanya utabiri kwa uhusiano mgumu zaidi usio wa moja kwa moja.\n",
|
|
"\n",
|
|
"Chukua kwa mfano uhusiano kati ya kifurushi na bei katika seti yetu ya data ya maboga. Ingawa wakati mwingine kuna uhusiano wa moja kwa moja kati ya vigezo - boga kubwa zaidi kwa ujazo, bei ya juu zaidi - wakati mwingine uhusiano huu hauwezi kuchorwa kama ndege au mstari wa moja kwa moja.\n",
|
|
"\n",
|
|
"> ✅ Hapa kuna [mifano zaidi](https://online.stat.psu.edu/stat501/lesson/9/9.8) ya data ambayo inaweza kutumia urejeleaji wa polinomu \n",
|
|
"> \n",
|
|
"> Angalia tena uhusiano kati ya Aina na Bei katika mchoro wa awali. Je, mchoro huu wa alama unaonekana kama unapaswa kuchambuliwa kwa mstari wa moja kwa moja? Labda hapana. Katika hali hii, unaweza kujaribu urejeleaji wa polinomu. \n",
|
|
"> \n",
|
|
"> ✅ Polinomu ni maelezo ya kihisabati ambayo yanaweza kuwa na moja au zaidi ya vigezo na viwango \n",
|
|
"\n",
|
|
"#### Fundisha mfano wa urejeleaji wa polinomu kwa kutumia seti ya mafunzo\n",
|
|
"\n",
|
|
"Urejeleaji wa polinomu huunda *mstari uliopinda* ili kuendana vyema na data isiyo ya moja kwa moja.\n",
|
|
"\n",
|
|
"Hebu tuone kama mfano wa polinomu utaonyesha utendaji bora katika kufanya utabiri. Tutafuata utaratibu unaofanana kidogo na tulivyofanya awali:\n",
|
|
"\n",
|
|
"- Unda mapishi yanayobainisha hatua za awali za uchakataji ambazo zinapaswa kufanywa kwenye data yetu ili kuifanya iwe tayari kwa uundaji wa mifano, yaani: kuweka vigezo na kuhesabu polinomu za kiwango *n*\n",
|
|
"\n",
|
|
"- Tengeneza maelezo ya mfano\n",
|
|
"\n",
|
|
"- Unganisha mapishi na maelezo ya mfano katika mtiririko wa kazi\n",
|
|
"\n",
|
|
"- Unda mfano kwa kufanikisha mtiririko wa kazi\n",
|
|
"\n",
|
|
"- Tathmini jinsi mfano unavyofanya kazi kwenye data ya majaribio\n",
|
|
"\n",
|
|
"Hebu tuanze moja kwa moja!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "VcEIpRV9wzYr"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Specify a recipe\r\n",
|
|
"poly_pumpkins_recipe <-\r\n",
|
|
" recipe(price ~ package, data = pumpkins_train) %>%\r\n",
|
|
" step_integer(all_predictors(), zero_based = TRUE) %>% \r\n",
|
|
" step_poly(all_predictors(), degree = 4)\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Create a model specification\r\n",
|
|
"poly_spec <- linear_reg() %>% \r\n",
|
|
" set_engine(\"lm\") %>% \r\n",
|
|
" set_mode(\"regression\")\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Bundle recipe and model spec into a workflow\r\n",
|
|
"poly_wf <- workflow() %>% \r\n",
|
|
" add_recipe(poly_pumpkins_recipe) %>% \r\n",
|
|
" add_model(poly_spec)\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Create a model\r\n",
|
|
"poly_wf_fit <- poly_wf %>% \r\n",
|
|
" fit(data = pumpkins_train)\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Print learned model coefficients\r\n",
|
|
"poly_wf_fit\r\n",
|
|
"\r\n",
|
|
" "
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "63n_YyRXw3CC"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### Tathmini Utendaji wa Modeli\n",
|
|
"\n",
|
|
"👏👏Umeunda modeli ya polinomial, hebu tufanye utabiri kwenye seti ya majaribio!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "-LHZtztSxDP0"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Make price predictions on test data\r\n",
|
|
"poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>% \r\n",
|
|
" bind_cols(pumpkins_test %>% select(c(package, price))) %>% \r\n",
|
|
" relocate(.pred, .after = last_col())\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Print the results\r\n",
|
|
"poly_results %>% \r\n",
|
|
" slice_head(n = 10)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "YUFpQ_dKxJGx"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Woo-hoo, wacha tupime jinsi mfano ulivyofanya kazi kwenye test_set kwa kutumia `yardstick::metrics()`.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "qxdyj86bxNGZ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"metrics(data = poly_results, truth = price, estimate = .pred)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "8AW5ltkBxXDm"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"🤩🤩 Utendaji bora zaidi.\n",
|
|
"\n",
|
|
"`rmse` ilipungua kutoka takriban 7 hadi takriban 3, ikionyesha kupungua kwa makosa kati ya bei halisi na bei iliyotabiriwa. Unaweza *kwa urahisi* kufasiri hili kama kwamba kwa wastani, utabiri usio sahihi unakosea kwa takriban \\$3. `rsq` iliongezeka kutoka takriban 0.4 hadi 0.8.\n",
|
|
"\n",
|
|
"Vipimo vyote hivi vinaonyesha kwamba modeli ya polynomial inafanya kazi vizuri zaidi kuliko modeli ya mstari. Kazi nzuri!\n",
|
|
"\n",
|
|
"Hebu tuone kama tunaweza kuonyesha hili!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "6gLHNZDwxYaS"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Bind encoded package column to the results\r\n",
|
|
"poly_results <- poly_results %>% \r\n",
|
|
" bind_cols(package_encode %>% \r\n",
|
|
" rename(package_integer = package)) %>% \r\n",
|
|
" relocate(package_integer, .after = package)\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Print new results data frame\r\n",
|
|
"poly_results %>% \r\n",
|
|
" slice_head(n = 5)\r\n",
|
|
"\r\n",
|
|
"\r\n",
|
|
"# Make a scatter plot\r\n",
|
|
"poly_results %>% \r\n",
|
|
" ggplot(mapping = aes(x = package_integer, y = price)) +\r\n",
|
|
" geom_point(size = 1.6) +\r\n",
|
|
" # Overlay a line of best fit\r\n",
|
|
" geom_line(aes(y = .pred), color = \"midnightblue\", size = 1.2) +\r\n",
|
|
" xlab(\"package\")\r\n"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "A83U16frxdF1"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Unaweza kuona mstari uliopinda unaofaa data yako vizuri zaidi! 🤩\n",
|
|
"\n",
|
|
"Unaweza kufanya hii kuwa laini zaidi kwa kupitisha fomula ya polinomial kwa `geom_smooth` kama hivi:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "4U-7aHOVxlGU"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Make a scatter plot\r\n",
|
|
"poly_results %>% \r\n",
|
|
" ggplot(mapping = aes(x = package_integer, y = price)) +\r\n",
|
|
" geom_point(size = 1.6) +\r\n",
|
|
" # Overlay a line of best fit\r\n",
|
|
" geom_smooth(method = lm, formula = y ~ poly(x, degree = 4), color = \"midnightblue\", size = 1.2, se = FALSE) +\r\n",
|
|
" xlab(\"package\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "5vzNT0Uexm-w"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kama vile mwelekeo laini!🤩\n",
|
|
"\n",
|
|
"Hivi ndivyo unavyoweza kufanya utabiri mpya:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "v9u-wwyLxq4G"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Make a hypothetical data frame\r\n",
|
|
"hypo_tibble <- tibble(package = \"bushel baskets\")\r\n",
|
|
"\r\n",
|
|
"# Make predictions using linear model\r\n",
|
|
"lm_pred <- lm_wf_fit %>% predict(new_data = hypo_tibble)\r\n",
|
|
"\r\n",
|
|
"# Make predictions using polynomial model\r\n",
|
|
"poly_pred <- poly_wf_fit %>% predict(new_data = hypo_tibble)\r\n",
|
|
"\r\n",
|
|
"# Return predictions in a list\r\n",
|
|
"list(\"linear model prediction\" = lm_pred, \r\n",
|
|
" \"polynomial model prediction\" = poly_pred)\r\n"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "jRPSyfQGxuQv"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Utabiri wa `polynomial model` una mantiki, ukizingatia grafu za kutawanyika za `price` na `package`! Na, ikiwa huu ni mfano bora kuliko ule wa awali, ukitazama data hiyo hiyo, unahitaji kupanga bajeti kwa ajili ya malenge haya ya gharama kubwa zaidi!\n",
|
|
"\n",
|
|
"🏆 Hongera! Umeunda mifano miwili ya regression katika somo moja. Katika sehemu ya mwisho ya regression, utajifunza kuhusu logistic regression ili kubaini kategoria.\n",
|
|
"\n",
|
|
"## **🚀Changamoto**\n",
|
|
"\n",
|
|
"Jaribu kutofautisha vigezo kadhaa katika daftari hili ili kuona jinsi uhusiano unavyolingana na usahihi wa mfano.\n",
|
|
"\n",
|
|
"## [**Jaribio baada ya somo**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/14/)\n",
|
|
"\n",
|
|
"## **Mapitio na Kujisomea**\n",
|
|
"\n",
|
|
"Katika somo hili tulijifunza kuhusu Linear Regression. Kuna aina nyingine muhimu za Regression. Soma kuhusu mbinu za Stepwise, Ridge, Lasso na Elasticnet. Kozi nzuri ya kusoma ili kujifunza zaidi ni [Kozi ya Stanford ya Statistical Learning](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning).\n",
|
|
"\n",
|
|
"Ikiwa unataka kujifunza zaidi kuhusu jinsi ya kutumia mfumo wa ajabu wa Tidymodels, tafadhali angalia rasilimali zifuatazo:\n",
|
|
"\n",
|
|
"- Tovuti ya Tidymodels: [Anza na Tidymodels](https://www.tidymodels.org/start/)\n",
|
|
"\n",
|
|
"- Max Kuhn na Julia Silge, [*Tidy Modeling with R*](https://www.tmwr.org/)*.*\n",
|
|
"\n",
|
|
"###### **ASANTE KWA:**\n",
|
|
"\n",
|
|
"[Allison Horst](https://twitter.com/allison_horst?lang=en) kwa kuunda michoro ya ajabu inayofanya R kuwa ya kuvutia na ya kupendeza zaidi. Pata michoro zaidi kwenye [galeria yake](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).\n"
|
|
],
|
|
"metadata": {
|
|
"id": "8zOLOWqMxzk5"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia huduma ya tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n"
|
|
]
|
|
}
|
|
]
|
|
} |