You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
671 lines
23 KiB
671 lines
23 KiB
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2,
|
|
"metadata": {
|
|
"colab": {
|
|
"name": "lesson_2-R.ipynb",
|
|
"provenance": [],
|
|
"collapsed_sections": [],
|
|
"toc_visible": true
|
|
},
|
|
"kernelspec": {
|
|
"name": "ir",
|
|
"display_name": "R"
|
|
},
|
|
"language_info": {
|
|
"name": "R"
|
|
},
|
|
"coopTranslator": {
|
|
"original_hash": "f3c335f9940cfd76528b3ef918b9b342",
|
|
"translation_date": "2025-09-06T13:52:52+00:00",
|
|
"source_file": "2-Regression/2-Data/solution/R/lesson_2-R.ipynb",
|
|
"language_code": "sw"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Jenga mfano wa regression: andaa na onyesha data\n",
|
|
"\n",
|
|
"## **Regression ya Mstari kwa Malenge - Somo la 2**\n",
|
|
"#### Utangulizi\n",
|
|
"\n",
|
|
"Sasa kwa kuwa umejiandaa na zana unazohitaji kuanza kujenga mifano ya kujifunza mashine kwa kutumia Tidymodels na Tidyverse, uko tayari kuanza kuuliza maswali kuhusu data yako. Unapofanya kazi na data na kutumia suluhisho za ML, ni muhimu sana kuelewa jinsi ya kuuliza swali sahihi ili kufungua uwezo wa dataset yako ipasavyo.\n",
|
|
"\n",
|
|
"Katika somo hili, utajifunza:\n",
|
|
"\n",
|
|
"- Jinsi ya kuandaa data yako kwa ajili ya kujenga mifano.\n",
|
|
"\n",
|
|
"- Jinsi ya kutumia `ggplot2` kwa uonyeshaji wa data.\n",
|
|
"\n",
|
|
"Swali unalotaka kujibiwa litaamua ni aina gani ya algorithmi za ML utatumia. Na ubora wa jibu unalopata utategemea sana asili ya data yako.\n",
|
|
"\n",
|
|
"Hebu tuone hili kwa kufanya zoezi la vitendo.\n",
|
|
"\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/unruly_data.jpg\"\n",
|
|
" width=\"700\"/>\n",
|
|
" <figcaption>Sanaa na @allison_horst</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--<br>Sanaa na \\@allison_horst-->\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Pg5aexcOPqAZ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## 1. Kuleta data za malenge na kuitumia Tidyverse\n",
|
|
"\n",
|
|
"Tutahitaji vifurushi vifuatavyo ili kuchambua somo hili:\n",
|
|
"\n",
|
|
"- `tidyverse`: [tidyverse](https://www.tidyverse.org/) ni [mkusanyiko wa vifurushi vya R](https://www.tidyverse.org/packages) vilivyoundwa ili kufanya sayansi ya data kuwa ya haraka, rahisi, na ya kufurahisha!\n",
|
|
"\n",
|
|
"Unaweza kuvifunga kwa kutumia:\n",
|
|
"\n",
|
|
"`install.packages(c(\"tidyverse\"))`\n",
|
|
"\n",
|
|
"Skripti iliyo hapa chini inakagua kama una vifurushi vinavyohitajika kukamilisha moduli hii na kuvifunga kwako endapo baadhi havipo.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "dc5WhyVdXAjR"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\n",
|
|
"pacman::p_load(tidyverse)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "GqPYUZgfXOBt"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Sasa, wacha tuwashe baadhi ya vifurushi na kupakia [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) iliyotolewa kwa somo hili!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "kvjDTPDSXRr2"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Load the core Tidyverse packages\n",
|
|
"library(tidyverse)\n",
|
|
"\n",
|
|
"# Import the pumpkins data\n",
|
|
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
|
|
"\n",
|
|
"\n",
|
|
"# Get a glimpse and dimensions of the data\n",
|
|
"glimpse(pumpkins)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print the first 50 rows of the data set\n",
|
|
"pumpkins %>% \n",
|
|
" slice_head(n =50)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "VMri-t2zXqgD"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"`glimpse()` ya haraka inaonyesha mara moja kwamba kuna nafasi tupu na mchanganyiko wa mistari (`chr`) na data ya nambari (`dbl`). `Date` ni aina ya herufi, na pia kuna safu ya ajabu inayoitwa `Package` ambapo data ni mchanganyiko wa `sacks`, `bins`, na thamani nyinginezo. Kwa kweli, data hii ni fujo kidogo 😤.\n",
|
|
"\n",
|
|
"Kwa kweli, si jambo la kawaida kupewa seti ya data ambayo iko tayari kabisa kutumika kuunda mfano wa ML moja kwa moja. Lakini usiwe na wasiwasi, katika somo hili, utajifunza jinsi ya kuandaa seti ya data mbichi kwa kutumia maktaba za kawaida za R 🧑🔧. Pia utajifunza mbinu mbalimbali za kuona data. 📈📊\n",
|
|
"<br>\n",
|
|
"\n",
|
|
"> Kumbusho: Opereta wa bomba (`%>%`) hufanya shughuli kwa mpangilio wa kimantiki kwa kupitisha kitu mbele kwenye kazi au usemi wa simu. Unaweza kufikiria opereta wa bomba kama kusema \"na kisha\" katika msimbo wako.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "REWcIv9yX29v"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## 2. Angalia data iliyokosekana\n",
|
|
"\n",
|
|
"Moja ya changamoto za kawaida ambazo wanasayansi wa data hukutana nazo ni data isiyokamilika au iliyokosekana. R inawakilisha thamani zilizokosekana, au zisizojulikana, kwa thamani maalum: `NA` (Not Available).\n",
|
|
"\n",
|
|
"Kwa hivyo, tunawezaje kujua kwamba fremu ya data ina thamani zilizokosekana?\n",
|
|
"<br>\n",
|
|
"- Njia moja rahisi ni kutumia kazi ya msingi ya R `anyNA` ambayo inarejesha vitu vya kimantiki `TRUE` au `FALSE`.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Zxfb3AM5YbUe"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"pumpkins %>% \n",
|
|
" anyNA()"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "G--DQutAYltj"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kunaonekana kuna baadhi ya data zinazokosekana! Hapo ndipo mahali pazuri pa kuanzia.\n",
|
|
"\n",
|
|
"- Njia nyingine ni kutumia kazi `is.na()` ambayo inaonyesha ni vipengele vipi vya safu wima vinavyokosekana kwa kutumia mantiki ya `TRUE`.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "mU-7-SB6YokF"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"pumpkins %>% \n",
|
|
" is.na() %>% \n",
|
|
" head(n = 7)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "W-DxDOR4YxSW"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Sawa, kazi imekamilika lakini kwa fremu kubwa ya data kama hii, itakuwa isiyo na ufanisi na karibu haiwezekani kupitia safu zote na nguzo moja baada ya nyingine😴.\n",
|
|
"\n",
|
|
"- Njia ya kueleweka zaidi itakuwa ni kuhesabu jumla ya thamani zinazokosekana kwa kila nguzo:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "xUWxipKYY0o7"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"pumpkins %>% \n",
|
|
" is.na() %>% \n",
|
|
" colSums()"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "ZRBWV6P9ZArL"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Ni bora zaidi! Kuna data inayokosekana, lakini labda haitakuwa na umuhimu kwa kazi inayofanyika. Hebu tuone uchambuzi zaidi utakavyoleta matokeo.\n",
|
|
"\n",
|
|
"> Pamoja na seti nzuri za pakiti na kazi, R ina nyaraka bora sana. Kwa mfano, tumia `help(colSums)` au `?colSums` ili kujifunza zaidi kuhusu kazi hiyo.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "9gv-crB6ZD1Y"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## 3. Dplyr: Sarufi ya Uendeshaji wa Takwimu\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/dplyr_wrangling.png\"\n",
|
|
" width=\"569\"/>\n",
|
|
" <figcaption>Sanaa na @allison_horst</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--<br/>Sanaa na \\@allison_horst-->\n"
|
|
],
|
|
"metadata": {
|
|
"id": "o4jLY5-VZO2C"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"[`dplyr`](https://dplyr.tidyverse.org/), kifurushi katika Tidyverse, ni sarufi ya uendeshaji wa data inayotoa seti thabiti ya vitenzi vinavyokusaidia kutatua changamoto za kawaida za uendeshaji wa data. Katika sehemu hii, tutachunguza baadhi ya vitenzi vya dplyr!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "i5o33MQBZWWw"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### dplyr::select()\n",
|
|
"\n",
|
|
"`select()` ni kazi katika kifurushi cha `dplyr` ambayo hukusaidia kuchagua safu za kuweka au kuondoa.\n",
|
|
"\n",
|
|
"Ili kufanya fremu yako ya data iwe rahisi kufanya kazi nayo, ondoa safu kadhaa zake, ukitumia `select()`, ukihifadhi tu safu unazohitaji.\n",
|
|
"\n",
|
|
"Kwa mfano, katika zoezi hili, uchambuzi wetu utahusisha safu za `Package`, `Low Price`, `High Price` na `Date`. Hebu tuchague safu hizi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "x3VGMAGBZiUr"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Select desired columns\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" select(Package, `Low Price`, `High Price`, Date)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Print data set\n",
|
|
"pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "F_FgxQnVZnM0"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### dplyr::mutate()\n",
|
|
"\n",
|
|
"`mutate()` ni kazi katika kifurushi cha `dplyr` ambayo husaidia kuunda au kurekebisha safu, huku ukihifadhi safu zilizopo.\n",
|
|
"\n",
|
|
"Muundo wa jumla wa `mutate` ni:\n",
|
|
"\n",
|
|
"`data %>% mutate(new_column_name = what_it_contains)`\n",
|
|
"\n",
|
|
"Hebu tujaribu `mutate` kwa kutumia safu ya `Date` kwa kufanya shughuli zifuatazo:\n",
|
|
"\n",
|
|
"1. Badilisha tarehe (ambazo kwa sasa ni aina ya herufi) kuwa muundo wa mwezi (hizi ni tarehe za Marekani, kwa hivyo muundo ni `MM/DD/YYYY`).\n",
|
|
"\n",
|
|
"2. Toa mwezi kutoka kwa tarehe na uweke kwenye safu mpya.\n",
|
|
"\n",
|
|
"Katika R, kifurushi [lubridate](https://lubridate.tidyverse.org/) hufanya iwe rahisi kufanya kazi na data ya tarehe na muda. Kwa hivyo, hebu tutumie `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` na tuone jinsi ya kufanikisha malengo haya. Tunaweza kuondoa safu ya Date kwa kuwa hatutaihitaji tena katika shughuli zinazofuata.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "2KKo0Ed9Z1VB"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Load lubridate\n",
|
|
"library(lubridate)\n",
|
|
"\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" # Convert the Date column to a date object\n",
|
|
" mutate(Date = mdy(Date)) %>% \n",
|
|
" # Extract month from Date\n",
|
|
" mutate(Month = month(Date)) %>% \n",
|
|
" # Drop Date column\n",
|
|
" select(-Date)\n",
|
|
"\n",
|
|
"# View the first few rows\n",
|
|
"pumpkins %>% \n",
|
|
" slice_head(n = 7)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "5joszIVSZ6xe"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Woohoo! 🤩\n",
|
|
"\n",
|
|
"Sasa, hebu tuunde safu mpya `Price`, inayowakilisha bei ya wastani ya malenge. Sasa, chukua wastani wa safu za `Low Price` na `High Price` ili kujaza safu mpya ya Price.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "nIgLjNMCZ-6Y"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Create a new column Price\n",
|
|
"pumpkins <- pumpkins %>% \n",
|
|
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
|
|
"\n",
|
|
"# View the first few rows of the data\n",
|
|
"pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "Zo0BsqqtaJw2"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Ndio!💪\n",
|
|
"\n",
|
|
"\"Lakini subiri kidogo!\", utasema baada ya kuangalia haraka seti nzima ya data kwa kutumia `View(pumpkins)`, \"Kuna kitu cha ajabu hapa!\"🤔\n",
|
|
"\n",
|
|
"Ukichunguza safu ya `Package`, malenge yanauzwa katika mipangilio mbalimbali. Baadhi yanauzwa kwa kipimo cha `1 1/9 bushel`, mengine kwa kipimo cha `1/2 bushel`, mengine kwa kila malenge, mengine kwa paundi, na mengine katika masanduku makubwa yenye upana tofauti.\n",
|
|
"\n",
|
|
"Hebu tuhakikishe hili:\n"
|
|
],
|
|
"metadata": {
|
|
"id": "p77WZr-9aQAR"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Verify the distinct observations in Package column\n",
|
|
"pumpkins %>% \n",
|
|
" distinct(Package)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "XISGfh0IaUy6"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Ajabu!👏\n",
|
|
"\n",
|
|
"Malenge yanaonekana kuwa magumu sana kupima kwa uthabiti, kwa hivyo hebu tuyachuje kwa kuchagua malenge tu yenye neno *bushel* katika safu ya `Package` na kuweka haya kwenye fremu mpya ya data `new_pumpkins`.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "7sMjiVujaZxY"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### dplyr::filter() na stringr::str_detect()\n",
|
|
"\n",
|
|
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): huunda sehemu ndogo ya data inayojumuisha tu **mistari** inayokidhi masharti yako, katika hali hii, maboga yenye neno *bushel* katika safu ya `Package`.\n",
|
|
"\n",
|
|
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): hutambua uwepo au kutokuwepo kwa muundo fulani ndani ya maandishi.\n",
|
|
"\n",
|
|
"Kifurushi cha [`stringr`](https://github.com/tidyverse/stringr) kinatoa kazi rahisi kwa operesheni za kawaida za maandishi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "L8Qfcs92ageF"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Retain only pumpkins with \"bushel\"\n",
|
|
"new_pumpkins <- pumpkins %>% \n",
|
|
" filter(str_detect(Package, \"bushel\"))\n",
|
|
"\n",
|
|
"# Get the dimensions of the new data\n",
|
|
"dim(new_pumpkins)\n",
|
|
"\n",
|
|
"# View a few rows of the new data\n",
|
|
"new_pumpkins %>% \n",
|
|
" slice_head(n = 5)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "hy_SGYREampd"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Unaweza kuona kwamba tumepunguza hadi takriban safu 415 za data zinazohusiana na maboga kwa gunia.🤩\n",
|
|
"<br>\n"
|
|
],
|
|
"metadata": {
|
|
"id": "VrDwF031avlR"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### dplyr::case_when()\n",
|
|
"\n",
|
|
"**Lakini subiri! Kuna jambo moja zaidi la kufanya**\n",
|
|
"\n",
|
|
"Je, uliona kwamba kiasi cha bushel kinatofautiana kwa kila safu? Unahitaji kuweka bei sawa ili kuonyesha bei kwa bushel moja, si kwa 1 1/9 au 1/2 bushel. Ni wakati wa kufanya hesabu ili kuifanya iwe ya kawaida.\n",
|
|
"\n",
|
|
"Tutatumia kazi [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) kubadilisha safu ya Bei kulingana na masharti fulani. `case_when` inakuruhusu kuunganisha taarifa nyingi za `if_else()` kwa urahisi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "mLpw2jH4a0tx"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Convert the price if the Package contains fractional bushel values\n",
|
|
"new_pumpkins <- new_pumpkins %>% \n",
|
|
" mutate(Price = case_when(\n",
|
|
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
|
|
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
|
|
" TRUE ~ Price))\n",
|
|
"\n",
|
|
"# View the first few rows of the data\n",
|
|
"new_pumpkins %>% \n",
|
|
" slice_head(n = 30)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "P68kLVQmbM6I"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Sasa, tunaweza kuchambua bei kwa kila kipimo kulingana na kipimo chao cha busheli. Hata hivyo, utafiti huu wote wa busheli za maboga unaonyesha jinsi ilivyo `muhimu` sana `kuelewa asili ya data yako`!\n",
|
|
"\n",
|
|
"> ✅ Kulingana na [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), uzito wa busheli hutegemea aina ya mazao, kwa kuwa ni kipimo cha ujazo. \"Busheli ya nyanya, kwa mfano, inapaswa kuwa na uzito wa pauni 56... Majani na mboga za majani huchukua nafasi zaidi na uzito mdogo, hivyo busheli ya mchicha ni pauni 20 tu.\" Ni jambo gumu kidogo! Tusijisumbue na kufanya ubadilishaji wa busheli hadi pauni, badala yake tuweke bei kwa busheli. Hata hivyo, utafiti huu wote wa busheli za maboga unaonyesha jinsi ilivyo muhimu sana kuelewa asili ya data yako!\n",
|
|
"\n",
|
|
"> ✅ Je, uliona kwamba maboga yanayouzwa kwa nusu busheli ni ghali sana? Je, unaweza kubaini kwa nini? Dokezo: maboga madogo ni ghali zaidi kuliko makubwa, labda kwa sababu kuna mengi zaidi yao kwa busheli, ikizingatiwa nafasi isiyotumika inayochukuliwa na boga moja kubwa la pie lenye uwazi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "pS2GNPagbSdb"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Sasa mwisho, kwa ajili ya kujifurahisha tu 💁♀️, hebu pia tuhamishe safu ya Mwezi kwenye nafasi ya kwanza yaani `kabla` ya safu ya `Package`.\n",
|
|
"\n",
|
|
"`dplyr::relocate()` inatumika kubadilisha nafasi za safu.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "qql1SowfbdnP"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Create a new data frame new_pumpkins\n",
|
|
"new_pumpkins <- new_pumpkins %>% \n",
|
|
" relocate(Month, .before = Package)\n",
|
|
"\n",
|
|
"new_pumpkins %>% \n",
|
|
" slice_head(n = 7)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "JJ1x6kw8bixF"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Kazi nzuri!👌 Sasa una seti safi na nadhifu ya data ambayo unaweza kutumia kujenga mfano wako mpya wa regression! \n",
|
|
"<br>\n"
|
|
],
|
|
"metadata": {
|
|
"id": "y8TJ0Za_bn5Y"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## 4. Uonyeshaji wa data kwa ggplot2\n",
|
|
"\n",
|
|
"<p >\n",
|
|
" <img src=\"../../images/data-visualization.png\"\n",
|
|
" width=\"600\"/>\n",
|
|
" <figcaption>Infographic na Dasani Madipalli</figcaption>\n",
|
|
"\n",
|
|
"\n",
|
|
"<!--{width=\"600\"}-->\n",
|
|
"\n",
|
|
"Kuna msemo *mwenye busara* unaosema hivi:\n",
|
|
"\n",
|
|
"> \"Grafu rahisi imeleta taarifa zaidi kwa akili ya mchambuzi wa data kuliko kifaa kingine chochote.\" --- John Tukey\n",
|
|
"\n",
|
|
"Sehemu ya jukumu la mwanasayansi wa data ni kuonyesha ubora na asili ya data wanayofanyia kazi. Ili kufanya hivyo, mara nyingi huunda uonyeshaji wa kuvutia, au michoro, grafu, na chati, zinazoonyesha vipengele tofauti vya data. Kwa njia hii, wanaweza kuonyesha kwa taswira mahusiano na mapungufu ambayo vinginevyo ni vigumu kugundua.\n",
|
|
"\n",
|
|
"Uonyeshaji pia unaweza kusaidia kuamua mbinu ya kujifunza kwa mashine inayofaa zaidi kwa data. Kwa mfano, grafu ya alama inayofuata mstari inaweza kuonyesha kuwa data ni mgombea mzuri kwa zoezi la regression ya mstari.\n",
|
|
"\n",
|
|
"R inatoa mifumo kadhaa ya kutengeneza grafu, lakini [`ggplot2`](https://ggplot2.tidyverse.org/index.html) ni mojawapo ya mifumo maridadi na yenye uwezo mkubwa. `ggplot2` hukuruhusu kuunda grafu kwa **kuunganisha vipengele huru**.\n",
|
|
"\n",
|
|
"Tuanzie na grafu rahisi ya alama kwa safu za Price na Month.\n",
|
|
"\n",
|
|
"Kwa hivyo, katika hali hii, tutaanza na [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), tutaweka dataset na ramani ya esthetiki (kwa [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) kisha tutaongeza tabaka (kama [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) kwa grafu za alama.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "mYSH6-EtbvNa"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Set a theme for the plots\n",
|
|
"theme_set(theme_light())\n",
|
|
"\n",
|
|
"# Create a scatter plot\n",
|
|
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
|
|
"p + geom_point()"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "g2YjnGeOcLo4"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Je, huu ni mchoro wa maana 🤷? Kuna chochote kinachokushangaza kuhusu huu mchoro?\n",
|
|
"\n",
|
|
"Sio wa maana sana kwani unachofanya tu ni kuonyesha data yako kama usambazaji wa alama katika mwezi fulani. \n",
|
|
"<br>\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Ml7SDCLQcPvE"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### **Je, tunafanyaje iwe ya manufaa?**\n",
|
|
"\n",
|
|
"Ili kupata chati zinazoonyesha data ya manufaa, mara nyingi unahitaji kuunganisha data kwa namna fulani. Kwa mfano, katika hali yetu, kupata wastani wa bei ya maboga kwa kila mwezi kungeweza kutoa ufahamu zaidi kuhusu mifumo ya msingi katika data yetu. Hii inatupeleka kwenye kipengele kingine cha **dplyr**:\n",
|
|
"\n",
|
|
"#### `dplyr::group_by() %>% summarize()`\n",
|
|
"\n",
|
|
"Uchanganuzi wa vikundi katika R unaweza kufanywa kwa urahisi kwa kutumia\n",
|
|
"\n",
|
|
"`dplyr::group_by() %>% summarize()`\n",
|
|
"\n",
|
|
"- `dplyr::group_by()` hubadilisha kitengo cha uchambuzi kutoka seti nzima ya data hadi vikundi vya mtu binafsi kama vile kwa kila mwezi.\n",
|
|
"\n",
|
|
"- `dplyr::summarize()` huunda fremu mpya ya data yenye safu moja kwa kila kigezo cha kikundi na safu moja kwa kila takwimu ya muhtasari uliyoainisha.\n",
|
|
"\n",
|
|
"Kwa mfano, tunaweza kutumia `dplyr::group_by() %>% summarize()` kuunganisha maboga katika vikundi kulingana na safu ya **Month** na kisha kupata **wastani wa bei** kwa kila mwezi.\n"
|
|
],
|
|
"metadata": {
|
|
"id": "jMakvJZIcVkh"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Find the average price of pumpkins per month\r\n",
|
|
"new_pumpkins %>%\r\n",
|
|
" group_by(Month) %>% \r\n",
|
|
" summarise(mean_price = mean(Price))"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "6kVSUa2Bcilf"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Fupi!✨\n",
|
|
"\n",
|
|
"Vipengele vya kategoria kama miezi vinaonyeshwa vyema kwa kutumia mchoro wa mistari 📊. Tabaka zinazohusika na michoro ya mistari ni `geom_bar()` na `geom_col()`. Tazama `?geom_bar` ili kujifunza zaidi.\n",
|
|
"\n",
|
|
"Hebu tuunde moja!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "Kds48GUBcj3W"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Find the average price of pumpkins per month then plot a bar chart\r\n",
|
|
"new_pumpkins %>%\r\n",
|
|
" group_by(Month) %>% \r\n",
|
|
" summarise(mean_price = mean(Price)) %>% \r\n",
|
|
" ggplot(aes(x = Month, y = mean_price)) +\r\n",
|
|
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\r\n",
|
|
" ylab(\"Pumpkin Price\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "VNbU1S3BcrxO"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"🤩🤩Hii ni uwasilishaji wa data unaofaa zaidi! Inaonekana inaonyesha kwamba bei ya juu zaidi ya maboga hutokea mwezi wa Septemba na Oktoba. Je, hilo linakubaliana na matarajio yako? Kwa nini au kwa nini siyo?\n",
|
|
"\n",
|
|
"Hongera kwa kumaliza somo la pili 👏! Uliandaa data yako kwa ajili ya kujenga modeli, kisha ukagundua maarifa zaidi kwa kutumia uwasilishaji wa data!\n"
|
|
],
|
|
"metadata": {
|
|
"id": "zDm0VOzzcuzR"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n"
|
|
]
|
|
}
|
|
]
|
|
} |