You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/sw/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb

500 lines
27 KiB

{
"cells": [
{
"cell_type": "markdown",
"source": [
"## **Muziki wa Nigeria uliokusanywa kutoka Spotify - uchambuzi**\n",
"\n",
"Clustering ni aina ya [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) inayodhani kuwa seti ya data haina lebo au kwamba maingizo yake hayajafungamanishwa na matokeo yaliyoainishwa. Inatumia algorithmi mbalimbali kuchambua data isiyo na lebo na kutoa makundi kulingana na mifumo inayotambua kwenye data.\n",
"\n",
"[**Maswali ya awali ya somo**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/27/)\n",
"\n",
"### **Utangulizi**\n",
"\n",
"[Clustering](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_124) ni muhimu sana kwa uchunguzi wa data. Hebu tuone kama inaweza kusaidia kugundua mitindo na mifumo katika jinsi hadhira ya Nigeria inavyotumia muziki.\n",
"\n",
"> ✅ Chukua dakika moja kufikiria matumizi ya clustering. Katika maisha ya kila siku, clustering hutokea unapokuwa na rundo la nguo na unahitaji kupanga nguo za wanafamilia wako 🧦👕👖🩲. Katika sayansi ya data, clustering hutokea unapojaribu kuchambua mapendeleo ya mtumiaji, au kubaini sifa za seti yoyote ya data isiyo na lebo. Kwa namna fulani, clustering husaidia kuleta mpangilio katika hali ya fujo, kama droo ya soksi.\n",
"\n",
"Katika mazingira ya kitaalamu, clustering inaweza kutumika kubaini mambo kama mgawanyiko wa soko, kubaini ni makundi ya umri gani yanayonunua bidhaa fulani, kwa mfano. Matumizi mengine yanaweza kuwa kugundua hali zisizo za kawaida, labda kugundua udanganyifu kutoka kwa seti ya data ya miamala ya kadi za mkopo. Au unaweza kutumia clustering kubaini uvimbe katika kundi la skani za matibabu.\n",
"\n",
"✅ Fikiria kwa dakika moja jinsi unavyoweza kuwa umekutana na clustering 'katika mazingira halisi', katika benki, biashara ya mtandaoni, au mazingira ya kibiashara.\n",
"\n",
"> 🎓 Kwa kushangaza, uchambuzi wa makundi ulianzia katika nyanja za Anthropolojia na Saikolojia katika miaka ya 1930. Je, unaweza kufikiria jinsi ulivyotumika?\n",
"\n",
"Vinginevyo, unaweza kuitumia kwa kupanga matokeo ya utafutaji - kwa viungo vya ununuzi, picha, au hakiki, kwa mfano. Clustering ni muhimu unapokuwa na seti kubwa ya data unayotaka kupunguza na ambayo unataka kufanya uchambuzi wa kina zaidi, hivyo mbinu hii inaweza kutumika kujifunza kuhusu data kabla ya kujenga mifano mingine.\n",
"\n",
"✅ Mara data yako inapopangwa katika makundi, unaiwekea kitambulisho cha kundi, na mbinu hii inaweza kuwa muhimu katika kuhifadhi faragha ya seti ya data; badala yake unaweza kurejelea kipengele cha data kwa kitambulisho cha kundi, badala ya data inayofichua zaidi. Je, unaweza kufikiria sababu nyingine za kurejelea kitambulisho cha kundi badala ya vipengele vingine vya kundi ili kukitambua?\n",
"\n",
"### Kuanza na clustering\n",
"\n",
"> 🎓 Jinsi tunavyounda makundi inahusiana sana na jinsi tunavyokusanya vipengele vya data katika vikundi. Hebu tuchambue baadhi ya istilahi:\n",
">\n",
"> 🎓 ['Transductive' vs. 'inductive'](https://wikipedia.org/wiki/Transduction_(machine_learning))\n",
">\n",
"> Utoaji wa hitimisho wa transductive hutokana na kesi za mafunzo zilizotazamwa ambazo zinahusiana na kesi maalum za majaribio. Utoaji wa hitimisho wa inductive hutokana na kesi za mafunzo ambazo zinahusiana na sheria za jumla ambazo baadaye tu zinatumika kwa kesi za majaribio.\n",
">\n",
"> Mfano: Fikiria una seti ya data ambayo imewekwa lebo kwa sehemu tu. Baadhi ya vitu ni 'rekodi', baadhi ni 'cds', na baadhi havina lebo. Kazi yako ni kutoa lebo kwa vile visivyo na lebo. Ukichagua mbinu ya inductive, ungefundisha mfano kutafuta 'rekodi' na 'cds', na kutumia lebo hizo kwa data isiyo na lebo. Mbinu hii itakuwa na shida kuainisha vitu ambavyo kwa kweli ni 'kanda za kaseti'. Mbinu ya transductive, kwa upande mwingine, hushughulikia data isiyojulikana kwa ufanisi zaidi kwani inafanya kazi ya kuunda vikundi vya vitu vinavyofanana na kisha kutumia lebo kwa kundi. Katika kesi hii, makundi yanaweza kuonyesha 'vitu vya muziki vya mviringo' na 'vitu vya muziki vya mraba'.\n",
">\n",
"> 🎓 ['Non-flat' vs. 'flat' geometry](https://datascience.stackexchange.com/questions/52260/terminology-flat-geometry-in-the-context-of-clustering)\n",
">\n",
"> Imetokana na istilahi za hisabati, 'non-flat' vs. 'flat' geometry inahusu kipimo cha umbali kati ya vipengele kwa kutumia mbinu za kijiometri za 'flat' ([Euclidean](https://wikipedia.org/wiki/Euclidean_geometry)) au 'non-flat' (non-Euclidean).\n",
">\n",
"> 'Flat' katika muktadha huu inahusu jiometri ya Euclidean (sehemu zake hufundishwa kama jiometri ya 'plane'), na 'non-flat' inahusu jiometri isiyo ya Euclidean. Jiometri inahusiana vipi na ujifunzaji wa mashine? Kweli, kama nyanja mbili zinazotokana na hisabati, lazima kuwe na njia ya kawaida ya kupima umbali kati ya vipengele katika makundi, na hiyo inaweza kufanywa kwa njia ya 'flat' au 'non-flat', kulingana na asili ya data. [Umbali wa Euclidean](https://wikipedia.org/wiki/Euclidean_distance) hupimwa kama urefu wa sehemu ya mstari kati ya vipengele viwili. [Umbali usio wa Euclidean](https://wikipedia.org/wiki/Non-Euclidean_geometry) hupimwa kando ya mkurva. Ikiwa data yako, ikionyeshwa, inaonekana haipo kwenye ndege, unaweza kuhitaji kutumia algorithmi maalum kuishughulikia.\n",
"\n",
"<p >\n",
" <img src=\"../../images/flat-nonflat.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic na Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"\n",
"> 🎓 ['Umbali'](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)\n",
">\n",
"> Makundi yanafafanuliwa na matrix ya umbali, yaani umbali kati ya vipengele. Umbali huu unaweza kupimwa kwa njia kadhaa. Makundi ya Euclidean yanafafanuliwa na wastani wa thamani za vipengele, na yana 'centroid' au kipengele cha katikati. Umbali hupimwa kwa umbali hadi centroid hiyo. Umbali usio wa Euclidean unahusu 'clustroids', kipengele kilicho karibu zaidi na vipengele vingine. Clustroids kwa upande wake vinaweza kufafanuliwa kwa njia mbalimbali.\n",
">\n",
"> 🎓 ['Constrained'](https://wikipedia.org/wiki/Constrained_clustering)\n",
">\n",
"> [Constrained Clustering](https://web.cs.ucdavis.edu/~davidson/Publications/ICDMTutorial.pdf) huanzisha 'semi-supervised' learning katika mbinu hii isiyo na usimamizi. Mahusiano kati ya vipengele yanawekwa alama kama 'cannot link' au 'must-link' ili sheria fulani zifuatwe kwenye seti ya data.\n",
">\n",
"> Mfano: Ikiwa algorithmi inaruhusiwa kuchambua kundi la data isiyo na lebo au yenye lebo kwa sehemu, makundi inayozalisha yanaweza kuwa ya ubora duni. Katika mfano hapo juu, makundi yanaweza kuunda 'vitu vya muziki vya mviringo' na 'vitu vya muziki vya mraba' na 'vitu vya pembetatu' na 'biskuti'. Ikiwa algorithmi inapewa vikwazo, au sheria za kufuata (\"kipengele lazima kiwe cha plastiki\", \"kipengele kinahitaji kuwa na uwezo wa kutoa muziki\") hii inaweza kusaidia 'kuzuia' algorithmi kufanya chaguo bora.\n",
">\n",
"> 🎓 'Density'\n",
">\n",
"> Data iliyo na 'kelele' inachukuliwa kuwa 'dense'. Umbali kati ya vipengele katika kila moja ya makundi yake unaweza kuonyesha, kwa uchunguzi, kuwa ni zaidi au chini ya 'dense', au 'imejaa' na hivyo data hii inahitaji kuchambuliwa kwa mbinu sahihi ya clustering. [Makala hii](https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html) inaonyesha tofauti kati ya kutumia algorithmi za K-Means clustering vs. HDBSCAN kuchunguza seti ya data yenye kelele na density isiyo sawa.\n",
"\n",
"Panua uelewa wako wa mbinu za clustering katika [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-cluster-models?WT.mc_id=academic-77952-leestott)\n",
"\n",
"### **Algorithmi za clustering**\n",
"\n",
"Kuna zaidi ya algorithmi 100 za clustering, na matumizi yake yanategemea asili ya data inayoshughulikiwa. Hebu tujadili baadhi ya zile kuu:\n",
"\n",
"- **Hierarchical clustering**. Ikiwa kipengele kinaainishwa kwa ukaribu wake na kipengele kilicho karibu, badala ya kile kilicho mbali zaidi, makundi yanaundwa kulingana na umbali wa wanachama wake kutoka na kwenda kwa vipengele vingine. Hierarchical clustering inajulikana kwa kuunganisha makundi mawili mara kwa mara.\n",
"\n",
"\n",
"<p >\n",
" <img src=\"../../images/hierarchical.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic na Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"\n",
"- **Centroid clustering**. Algorithmi hii maarufu inahitaji kuchagua 'k', au idadi ya makundi ya kuunda, baada ya hapo algorithmi huamua kipengele cha katikati cha kundi na kukusanya data karibu na kipengele hicho. [K-means clustering](https://wikipedia.org/wiki/K-means_clustering) ni toleo maarufu la centroid clustering ambalo linatenganisha seti ya data katika makundi ya K yaliyoainishwa awali. Kipengele cha katikati kinaamuliwa na wastani wa karibu zaidi, hivyo jina hilo. Umbali wa mraba kutoka kwa kundi hupunguzwa.\n",
"\n",
"<p >\n",
" <img src=\"../../images/centroid.png\"\n",
" width=\"600\"/>\n",
" <figcaption>Infographic na Dasani Madipalli</figcaption>\n",
"\n",
"\n",
"\n",
"- **Distribution-based clustering**. Ikitokana na uundaji wa takwimu, distribution-based clustering inazingatia kubaini uwezekano wa kipengele cha data kuwa sehemu ya kundi, na kukipa kundi ipasavyo. Mbinu za Gaussian mixture zinahusiana na aina hii.\n",
"\n",
"- **Density-based clustering**. Vipengele vya data vinapewa makundi kulingana na density yao, au jinsi vinavyokusanyika karibu na kila kimoja. Vipengele vya data vilivyo mbali na kundi vinachukuliwa kuwa outliers au kelele. DBSCAN, Mean-shift na OPTICS vinahusiana na aina hii ya clustering.\n",
"\n",
"- **Grid-based clustering**. Kwa seti za data za vipimo vingi, gridi huundwa na data hugawanywa kati ya seli za gridi hiyo, hivyo kuunda makundi.\n",
"\n",
"Njia bora ya kujifunza kuhusu clustering ni kuijaribu mwenyewe, hivyo ndivyo utakavyofanya katika zoezi hili.\n",
"\n",
"Tutahitaji baadhi ya pakiti ili kukamilisha moduli hii. Unaweza kuzisakinisha kama: `install.packages(c('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork'))`\n",
"\n",
"Vinginevyo, script hapa chini hukagua ikiwa una pakiti zinazohitajika kukamilisha moduli hii na kuzisakinisha kwako ikiwa baadhi zinakosekana.\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"suppressWarnings(if(!require(\"pacman\")) install.packages(\"pacman\"))\r\n",
"\r\n",
"pacman::p_load('tidyverse', 'tidymodels', 'DataExplorer', 'summarytools', 'plotly', 'paletteer', 'corrplot', 'patchwork')\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Zoezi - pangisha data yako katika makundi\n",
"\n",
"Upangishaji katika makundi kama mbinu husaidiwa sana na uonyeshaji sahihi wa data, kwa hivyo hebu tuanze kwa kuonyesha data yetu ya muziki. Zoezi hili litatusaidia kuamua ni mbinu gani ya upangishaji katika makundi tunapaswa kutumia kwa ufanisi zaidi kulingana na asili ya data hii.\n",
"\n",
"Hebu tuanze mara moja kwa kuingiza data.\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Load the core tidyverse and make it available in your current R session\r\n",
"library(tidyverse)\r\n",
"\r\n",
"# Import the data into a tibble\r\n",
"df <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv\")\r\n",
"\r\n",
"# View the first 5 rows of the data set\r\n",
"df %>% \r\n",
" slice_head(n = 5)\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Wakati mwingine, tunaweza kutaka maelezo zaidi kuhusu data yetu. Tunaweza kuangalia `data` na `muundo wake` kwa kutumia [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) kazi:\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Glimpse into the data set\r\n",
"df %>% \r\n",
" glimpse()\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Kazi nzuri!💪\n",
"\n",
"Tunaweza kuona kwamba `glimpse()` itakupa jumla ya idadi ya safu (uchunguzi) na safu wima (vigezo), kisha, maingizo machache ya kwanza ya kila kigezo katika safu baada ya jina la kigezo. Zaidi ya hayo, *aina ya data* ya kigezo inatolewa mara moja baada ya jina la kila kigezo ndani ya `< >`.\n",
"\n",
"`DataExplorer::introduce()` inaweza kufupisha taarifa hii kwa urahisi:\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Describe basic information for our data\r\n",
"df %>% \r\n",
" introduce()\r\n",
"\r\n",
"# A visual display of the same\r\n",
"df %>% \r\n",
" plot_intro()\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Nzuri sana! Tumegundua kuwa data yetu haina thamani zilizokosekana.\n",
"\n",
"Wakati tukiendelea, tunaweza kuchunguza takwimu za kawaida za mwelekeo wa kati (mfano [wastani](https://en.wikipedia.org/wiki/Arithmetic_mean) na [median](https://en.wikipedia.org/wiki/Median)) na vipimo vya mtawanyiko (mfano [mkengeuko wa kawaida](https://en.wikipedia.org/wiki/Standard_deviation)) kwa kutumia `summarytools::descr()`\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Describe common statistics\r\n",
"df %>% \r\n",
" descr(stats = \"common\")\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Hebu tuangalie maadili ya jumla ya data. Kumbuka kuwa umaarufu unaweza kuwa `0`, ambayo inaonyesha nyimbo ambazo hazina daraja. Tutaziondoa hivi karibuni.\n",
"\n",
"> 🤔 Ikiwa tunafanya kazi na clustering, mbinu isiyo ya kusimamiwa ambayo haihitaji data yenye lebo, kwa nini tunaonyesha data hii ikiwa na lebo? Katika awamu ya uchunguzi wa data, zinaweza kuwa muhimu, lakini hazihitajiki kwa algorithimu za clustering kufanya kazi.\n",
"\n",
"### 1. Chunguza aina maarufu za muziki\n",
"\n",
"Twende mbele na tujue aina za muziki maarufu 🎶 kwa kuhesabu idadi ya mara zinavyoonekana.\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Popular genres\r\n",
"top_genres <- df %>% \r\n",
" count(artist_top_genre, sort = TRUE) %>% \r\n",
"# Encode to categorical and reorder the according to count\r\n",
" mutate(artist_top_genre = factor(artist_top_genre) %>% fct_inorder())\r\n",
"\r\n",
"# Print the top genres\r\n",
"top_genres\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Hiyo imeenda vizuri! Wanasema picha ina thamani ya mistari elfu moja ya fremu ya data (kwa kweli hakuna mtu anayesema hivyo 😅). Lakini unaelewa maana yake, sivyo?\n",
"\n",
"Njia moja ya kuonyesha data ya kategoria (vigezo vya herufi au sababu) ni kutumia chati za mistari. Hebu tufanye chati ya mistari ya aina 10 bora za muziki:\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Change the default gray theme\r\n",
"theme_set(theme_light())\r\n",
"\r\n",
"# Visualize popular genres\r\n",
"top_genres %>%\r\n",
" slice(1:10) %>% \r\n",
" ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
" fill = artist_top_genre)) +\r\n",
" geom_col(alpha = 0.8) +\r\n",
" paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n",
" ggtitle(\"Top genres\") +\r\n",
" theme(plot.title = element_text(hjust = 0.5),\r\n",
" # Rotates the X markers (so we can read them)\r\n",
" axis.text.x = element_text(angle = 90))\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Sasa ni rahisi zaidi kutambua kwamba tuna `missing` aina za muziki 🧐!\n",
"\n",
"> Uwasilishaji mzuri wa data utaonyesha mambo ambayo hukutarajia, au kuibua maswali mapya kuhusu data - Hadley Wickham na Garrett Grolemund, [R For Data Science](https://r4ds.had.co.nz/introduction.html)\n",
"\n",
"Kumbuka, pale ambapo aina kuu ya muziki imeelezwa kama `Missing`, inamaanisha kwamba Spotify haikuigawa, kwa hivyo tuiondoe.\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Visualize popular genres\r\n",
"top_genres %>%\r\n",
" filter(artist_top_genre != \"Missing\") %>% \r\n",
" slice(1:10) %>% \r\n",
" ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
" fill = artist_top_genre)) +\r\n",
" geom_col(alpha = 0.8) +\r\n",
" paletteer::scale_fill_paletteer_d(\"rcartocolor::Vivid\") +\r\n",
" ggtitle(\"Top genres\") +\r\n",
" theme(plot.title = element_text(hjust = 0.5),\r\n",
" # Rotates the X markers (so we can read them)\r\n",
" axis.text.x = element_text(angle = 90))\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Kutokana na uchunguzi mdogo wa data, tunajifunza kwamba aina tatu kuu za muziki zinatawala dataset hii. Hebu tuzingatie `afro dancehall`, `afropop`, na `nigerian pop`, na pia tuchuje dataset ili kuondoa chochote chenye thamani ya umaarufu ya 0 (ikimaanisha hakikuainishwa na umaarufu katika dataset na kinaweza kuchukuliwa kama kelele kwa madhumuni yetu):\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"nigerian_songs <- df %>% \r\n",
" # Concentrate on top 3 genres\r\n",
" filter(artist_top_genre %in% c(\"afro dancehall\", \"afropop\",\"nigerian pop\")) %>% \r\n",
" # Remove unclassified observations\r\n",
" filter(popularity != 0)\r\n",
"\r\n",
"\r\n",
"\r\n",
"# Visualize popular genres\r\n",
"nigerian_songs %>%\r\n",
" count(artist_top_genre) %>%\r\n",
" ggplot(mapping = aes(x = artist_top_genre, y = n,\r\n",
" fill = artist_top_genre)) +\r\n",
" geom_col(alpha = 0.8) +\r\n",
" paletteer::scale_fill_paletteer_d(\"ggsci::category10_d3\") +\r\n",
" ggtitle(\"Top genres\") +\r\n",
" theme(plot.title = element_text(hjust = 0.5))\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Hebu tuone kama kuna uhusiano wa moja kwa moja kati ya vigezo vya namba katika seti yetu ya data. Uhusiano huu hupimwa kihisabati kwa kutumia [takwimu ya uhusiano](https://en.wikipedia.org/wiki/Correlation).\n",
"\n",
"Takwimu ya uhusiano ni thamani kati ya -1 na 1 inayonyesha nguvu ya uhusiano. Thamani zilizo juu ya 0 zinaonyesha uhusiano *chanya* (thamani za juu za kigezo kimoja huwa sambamba na thamani za juu za kigezo kingine), wakati thamani zilizo chini ya 0 zinaonyesha uhusiano *hasi* (thamani za juu za kigezo kimoja huwa sambamba na thamani za chini za kigezo kingine).\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Narrow down to numeric variables and fid correlation\r\n",
"corr_mat <- nigerian_songs %>% \r\n",
" select(where(is.numeric)) %>% \r\n",
" cor()\r\n",
"\r\n",
"# Visualize correlation matrix\r\n",
"corrplot(corr_mat, order = 'AOE', col = c('white', 'black'), bg = 'gold2') \r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Data haijaonyesha uhusiano mkubwa isipokuwa kati ya `energy` na `loudness`, jambo ambalo linaeleweka, kwa kuwa muziki wenye sauti kubwa mara nyingi huwa na nguvu nyingi. `Popularity` ina uhusiano na `release date`, jambo ambalo pia lina mantiki, kwa kuwa nyimbo za hivi karibuni huenda zikawa maarufu zaidi. Urefu na nguvu pia vinaonekana kuwa na uhusiano.\n",
"\n",
"Itakuwa ya kuvutia kuona kile ambacho algorithimu ya kugawanya (clustering algorithm) inaweza kufanya na data hii!\n",
"\n",
"> 🎓 Kumbuka kwamba uhusiano hauimaanishi sababu! Tuna ushahidi wa uhusiano lakini hatuna ushahidi wa sababu. [Tovuti ya kufurahisha](https://tylervigen.com/spurious-correlations) ina michoro inayoangazia hoja hii.\n",
"\n",
"### 2. Chunguza usambazaji wa data\n",
"\n",
"Hebu tujiulize maswali ya kina zaidi. Je, aina za muziki (genres) zinatofautiana sana katika mtazamo wa uwezo wa kuchezeka (danceability), kulingana na umaarufu wao? Hebu tuchunguze usambazaji wa data wa aina zetu tatu kuu za muziki kwa umaarufu na uwezo wa kuchezeka kwenye mhimili wa x na y kwa kutumia [density plots](https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/density-curves/v/density-curves).\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Perform 2D kernel density estimation\r\n",
"density_estimate_2d <- nigerian_songs %>% \r\n",
" ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre)) +\r\n",
" geom_density_2d(bins = 5, size = 1) +\r\n",
" paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
" xlim(-20, 80) +\r\n",
" ylim(0, 1.2)\r\n",
"\r\n",
"# Density plot based on the popularity\r\n",
"density_estimate_pop <- nigerian_songs %>% \r\n",
" ggplot(mapping = aes(x = popularity, fill = artist_top_genre, color = artist_top_genre)) +\r\n",
" geom_density(size = 1, alpha = 0.5) +\r\n",
" paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
" paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
" theme(legend.position = \"none\")\r\n",
"\r\n",
"# Density plot based on the danceability\r\n",
"density_estimate_dance <- nigerian_songs %>% \r\n",
" ggplot(mapping = aes(x = danceability, fill = artist_top_genre, color = artist_top_genre)) +\r\n",
" geom_density(size = 1, alpha = 0.5) +\r\n",
" paletteer::scale_fill_paletteer_d(\"RSkittleBrewer::wildberry\") +\r\n",
" paletteer::scale_color_paletteer_d(\"RSkittleBrewer::wildberry\")\r\n",
"\r\n",
"\r\n",
"# Patch everything together\r\n",
"library(patchwork)\r\n",
"density_estimate_2d / (density_estimate_pop + density_estimate_dance)\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Tunaona kwamba kuna miduara inayozunguka kwa mduara mmoja ndani ya mwingine ambayo inalingana, bila kujali aina ya muziki. Inawezekana kwamba ladha za Wanigeria zinakubaliana kwa kiwango fulani cha uwezo wa kuchezeka kwa aina hii ya muziki?\n",
"\n",
"Kwa ujumla, aina hizi tatu za muziki zinaendana kwa umaarufu na uwezo wa kuchezeka. Kuamua makundi katika data hii isiyo na mpangilio wa moja kwa moja itakuwa changamoto. Hebu tuone kama mchoro wa kutawanyika unaweza kusaidia katika hili.\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# A scatter plot of popularity and danceability\r\n",
"scatter_plot <- nigerian_songs %>% \r\n",
" ggplot(mapping = aes(x = popularity, y = danceability, color = artist_top_genre, shape = artist_top_genre)) +\r\n",
" geom_point(size = 2, alpha = 0.8) +\r\n",
" paletteer::scale_color_paletteer_d(\"futurevisions::mars\")\r\n",
"\r\n",
"# Add a touch of interactivity\r\n",
"ggplotly(scatter_plot)\r\n"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Grafu ya kutawanyika ya mhimili sawa inaonyesha mtindo unaofanana wa muunganiko.\n",
"\n",
"Kwa ujumla, kwa ajili ya kugawanya data katika makundi, unaweza kutumia grafu za kutawanyika kuonyesha makundi ya data, hivyo kujifunza aina hii ya uwasilishaji ni muhimu sana. Katika somo lijalo, tutachukua data hii iliyochujwa na kutumia k-means clustering kugundua makundi katika data hii ambayo yanaonekana kuingiliana kwa njia za kuvutia.\n",
"\n",
"## **🚀 Changamoto**\n",
"\n",
"Kwa maandalizi ya somo lijalo, tengeneza chati kuhusu mbinu mbalimbali za kugawanya data katika makundi ambazo unaweza kugundua na kutumia katika mazingira ya uzalishaji. Ni aina gani za matatizo mbinu za kugawanya data zinajaribu kutatua?\n",
"\n",
"## [**Jaribio la baada ya somo**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/28/)\n",
"\n",
"## **Mapitio na Kujisomea**\n",
"\n",
"Kabla ya kutumia mbinu za kugawanya data, kama tulivyojifunza, ni wazo zuri kuelewa asili ya seti yako ya data. Soma zaidi kuhusu mada hii [hapa](https://www.kdnuggets.com/2019/10/right-clustering-algorithm.html)\n",
"\n",
"Kuimarisha uelewa wako wa mbinu za kugawanya data:\n",
"\n",
"- [Fanya mafunzo na tathmini ya mifano ya kugawanya data kwa kutumia Tidymodels na marafiki](https://rpubs.com/eR_ic/clustering)\n",
"\n",
"- Bradley Boehmke & Brandon Greenwell, [*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/)*.*\n",
"\n",
"## **Kazi ya Nyumbani**\n",
"\n",
"[Chunguza uwasilishaji mwingine wa kugawanya data katika makundi](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/assignment.md)\n",
"\n",
"## ASANTE KWA:\n",
"\n",
"[Jen Looper](https://www.twitter.com/jenlooper) kwa kuunda toleo la awali la moduli hii kwa Python ♥️\n",
"\n",
"[`Dasani Madipalli`](https://twitter.com/dasani_decoded) kwa kuunda michoro ya kuvutia ambayo hufanya dhana za kujifunza kwa mashine kueleweka zaidi na rahisi kufuatilia.\n",
"\n",
"Jifunze kwa furaha,\n",
"\n",
"[Eric](https://twitter.com/ericntay), Balozi wa Dhahabu wa Wanafunzi wa Microsoft Learn.\n"
],
"metadata": {}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia huduma ya tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n"
]
}
],
"metadata": {
"anaconda-cloud": "",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.1"
},
"coopTranslator": {
"original_hash": "99c36449cad3708a435f6798cfa39972",
"translation_date": "2025-09-06T14:15:48+00:00",
"source_file": "5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb",
"language_code": "sw"
}
},
"nbformat": 4,
"nbformat_minor": 1
}