{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "rQ8UhzFpgRra" }, "source": [ "# Maandalizi ya Data\n", "\n", "[Chanzo cha Notebook asili kutoka *Data Science: Utangulizi wa Kujifunza Mashine kwa Data Science Python na Machine Learning Studio na Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n", "\n", "## Kuchunguza taarifa za `DataFrame`\n", "\n", "> **Lengo la kujifunza:** Mwisho wa sehemu hii, unapaswa kuwa na uelewa wa jinsi ya kupata taarifa za jumla kuhusu data iliyohifadhiwa katika pandas DataFrames.\n", "\n", "Baada ya kupakia data yako kwenye pandas, kuna uwezekano mkubwa kuwa itakuwa katika `DataFrame`. Hata hivyo, ikiwa seti ya data katika `DataFrame` yako ina safu 60,000 na nguzo 400, utaanzaje kupata picha ya kile unachofanya kazi nacho? Kwa bahati nzuri, pandas inatoa zana rahisi za kuangalia haraka taarifa za jumla kuhusu `DataFrame` pamoja na safu chache za mwanzo na za mwisho.\n", "\n", "Ili kuchunguza utendaji huu, tutapakia maktaba ya Python scikit-learn na kutumia seti ya data maarufu ambayo kila mtaalamu wa data ameiona mara nyingi: seti ya data ya *Iris* ya mwanabiolojia wa Uingereza Ronald Fisher iliyotumika katika karatasi yake ya mwaka 1936 \"Matumizi ya vipimo vingi katika matatizo ya taxonomia\":\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "id": "hB1RofhdgRrp", "trusted": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": { "id": "AGA0A_Y8hMdz" }, "source": [ "### `DataFrame.shape`\n", "Tumeweka Dataset ya Iris kwenye kigezo `iris_df`. Kabla ya kuanza kuchambua data, itakuwa muhimu kujua idadi ya vipengele tulivyonavyo na ukubwa wa jumla wa dataset. Ni muhimu kuangalia kiasi cha data tunachoshughulikia.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LOe5jQohhulf", "outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9" }, "outputs": [ { "data": { "text/plain": [ "(150, 4)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "smE7AGzOhxk2" }, "source": [ "Kwa hivyo, tunashughulika na safu 150 na nguzo 4 za data. Kila safu inawakilisha nukta moja ya data na kila nguzo inawakilisha kipengele kimoja kinachohusiana na fremu ya data. Kwa msingi, kuna nukta 150 za data zenye vipengele 4 kila moja.\n", "\n", "`shape` hapa ni sifa ya fremu ya data na si kazi, ndiyo sababu haimalizii na jozi ya mabano.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d3AZKs0PinGP" }, "source": [ "### `DataFrame.columns`\n", "Sasa tuingie kwenye safu 4 za data. Kila moja inawakilisha nini hasa? Sifa ya `columns` itatupa majina ya safu katika dataframe.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YPGh_ziji-CY", "outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0" }, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "TsobcU_VjCC_" }, "source": [ "Kama tunavyoona, kuna safu nne (4). Sifa ya `columns` inatuambia majina ya safu na kimsingi hakuna kingine. Sifa hii inakuwa muhimu tunapotaka kutambua vipengele ambavyo seti ya data inavyo.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2UTlvkjmgRrs" }, "source": [ "### `DataFrame.info`\n", "Kiasi cha data (kinachotolewa na sifa ya `shape`) na majina ya vipengele au safu (vinavyotolewa na sifa ya `columns`) vinatupa taarifa fulani kuhusu seti ya data. Sasa, tungependa kuchunguza kwa undani zaidi seti ya data. Kazi ya `DataFrame.info()` ni muhimu sana kwa hili.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dHHRyG0_gRrt", "outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5", "trusted": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 sepal length (cm) 150 non-null float64\n", " 1 sepal width (cm) 150 non-null float64\n", " 2 petal length (cm) 150 non-null float64\n", " 3 petal width (cm) 150 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 4.8 KB\n" ] } ], "source": [ "iris_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "1XgVMpvigRru" }, "source": [ "Kutoka hapa, tunaweza kufanya uchunguzi kadhaa: \n", "1. Aina ya Data ya kila safu: Katika seti hii ya data, data yote imehifadhiwa kama nambari za desimali za biti 64. \n", "2. Idadi ya Thamani Zisizo Null: Kushughulikia thamani za null ni hatua muhimu katika maandalizi ya data. Hili litatatuliwa baadaye kwenye daftari. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "IYlyxbpWFEF4" }, "source": [ "### DataFrame.describe()\n", "Tuseme tuna data nyingi za namba kwenye seti yetu ya data. Mahesabu ya takwimu za upande mmoja kama vile wastani, mediani, robo n.k. yanaweza kufanywa kwenye kila safu moja moja. Kazi ya `DataFrame.describe()` inatupatia muhtasari wa takwimu za safu za namba kwenye seti ya data.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "tWV-CMstFIRA", "outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "zjjtW5hPGMuM" }, "source": [ "Matokeo hapo juu yanaonyesha jumla ya idadi ya pointi za data, wastani, upotofu wa kawaida, thamani ya chini, robo ya chini (25%), mediani (50%), robo ya juu (75%) na thamani ya juu ya kila safu.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-lviAu99gRrv" }, "source": [ "### `DataFrame.head`\n", "Kwa kutumia kazi na sifa zote zilizotajwa hapo juu, tumepata muhtasari wa juu wa seti ya data. Tunajua ni alama ngapi za data zipo, ni sifa ngapi zipo, aina ya data ya kila sifa, na idadi ya thamani zisizo tupu kwa kila sifa.\n", "\n", "Sasa ni wakati wa kuangalia data yenyewe. Hebu tuone jinsi safu chache za mwanzo (alama chache za mwanzo za data) za `DataFrame` yetu zinavyoonekana:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DZMJZh0OgRrw", "outputId": "d9393ee5-c106-4797-f815-218f17160e00", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EBHEimZuEFQK" }, "source": [ "Kama matokeo hapa, tunaweza kuona maingizo matano (5) ya seti ya data. Tukitazama faharasa upande wa kushoto, tunagundua kuwa hizi ni safu tano za kwanza.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oj7GkrTdgRry" }, "source": [ "### Zoezi:\n", "\n", "Kutoka kwa mfano uliotolewa hapo juu, ni wazi kwamba, kwa chaguo-msingi, `DataFrame.head` inarudisha safu tano za kwanza za `DataFrame`. Katika seli ya msimbo hapa chini, unaweza kugundua njia ya kuonyesha zaidi ya safu tano?\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "id": "EKRmRFFegRrz", "trusted": false }, "outputs": [], "source": [ "# Hint: Consult the documentation by using iris_df.head?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJ_cpZqNgRr1" }, "source": [ "### `DataFrame.tail`\n", "Njia nyingine ya kuangalia data inaweza kuwa kutoka mwisho (badala ya mwanzo). Kinyume cha `DataFrame.head` ni `DataFrame.tail`, ambayo inarudisha safu tano za mwisho za `DataFrame`:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "heanjfGWgRr2", "outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
1456.73.05.22.3
1466.32.55.01.9
1476.53.05.22.0
1486.23.45.42.3
1495.93.05.11.8
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "145 6.7 3.0 5.2 2.3\n", "146 6.3 2.5 5.0 1.9\n", "147 6.5 3.0 5.2 2.0\n", "148 6.2 3.4 5.4 2.3\n", "149 5.9 3.0 5.1 1.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "31kBWfyLgRr3" }, "source": [ "Katika mazoezi, ni muhimu kuwa na uwezo wa kuchunguza kwa urahisi safu chache za mwanzo au safu chache za mwisho za `DataFrame`, hasa unapokuwa unatafuta thamani zisizo za kawaida katika seti za data zilizo na mpangilio.\n", "\n", "Kazi zote na sifa zilizoonyeshwa hapo juu kwa msaada wa mifano ya msimbo, zinatusaidia kupata muonekano na hisia ya data.\n", "\n", "> **Mafunzo:** Hata kwa kuangalia tu metadata kuhusu taarifa katika `DataFrame` au thamani chache za mwanzo na mwisho ndani yake, unaweza kupata wazo la haraka kuhusu ukubwa, umbo, na maudhui ya data unayoshughulikia.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TvurZyLSDxq_" }, "source": [ "### Data Inayokosekana\n", "Hebu tujadili kuhusu data inayokosekana. Data inayokosekana hutokea pale ambapo hakuna thamani iliyohifadhiwa katika baadhi ya safu.\n", "\n", "Hebu tuchukue mfano: tuseme mtu fulani anajali sana kuhusu uzito wake na hawezi kujaza sehemu ya uzito katika dodoso. Basi, thamani ya uzito kwa mtu huyo itakuwa haipo.\n", "\n", "Mara nyingi, katika seti za data za ulimwengu halisi, thamani zinazokosekana hutokea.\n", "\n", "**Jinsi Pandas Inavyoshughulikia Data Inayokosekana**\n", "\n", "Pandas hushughulikia thamani zinazokosekana kwa njia mbili. Ya kwanza umeiona hapo awali katika sehemu zilizopita: `NaN`, au Not a Number. Hii ni thamani maalum ambayo ni sehemu ya vipimo vya IEEE vya namba za desimali na hutumika tu kuonyesha thamani za desimali zinazokosekana.\n", "\n", "Kwa thamani zinazokosekana ambazo si za desimali, pandas hutumia kitu cha Python `None`. Ingawa inaweza kuonekana kuchanganya kwamba utakutana na aina mbili tofauti za thamani zinazomaanisha kimsingi kitu kimoja, kuna sababu za kimipangilio za kubuni chaguo hili, na kwa vitendo, njia hii inaiwezesha pandas kutoa suluhisho bora kwa hali nyingi. Hata hivyo, `None` na `NaN` zote zina vizuizi ambavyo unapaswa kuwa makini navyo kuhusiana na jinsi zinavyoweza kutumika.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOHqUlZFgRr5" }, "source": [ "### `None`: data isiyokuwepo isiyo ya aina ya float\n", "Kwa sababu `None` inatoka kwenye Python, haiwezi kutumika katika arrays za NumPy na pandas ambazo si za aina ya data `'object'`. Kumbuka, arrays za NumPy (na miundo ya data katika pandas) zinaweza kuwa na aina moja tu ya data. Hii ndiyo inazipa nguvu kubwa kwa kazi za data na mahesabu makubwa, lakini pia inapunguza uwezo wao wa kubadilika. Arrays kama hizo lazima zibadilishwe hadi kwenye “kiwango cha chini cha kawaida,” yaani aina ya data ambayo itajumuisha kila kitu kilichopo kwenye array. Wakati `None` ipo kwenye array, inamaanisha unafanya kazi na vitu vya Python.\n", "\n", "Ili kuona hili likifanyika, zingatia mfano wa array ifuatayo (angalia `dtype` yake):\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QIoNdY4ngRr7", "outputId": "92779f18-62f4-4a03-eca2-e9a101604336", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "array([2, None, 6, 8], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "example1 = np.array([2, None, 6, 8])\n", "example1" ] }, { "cell_type": "markdown", "metadata": { "id": "pdlgPNbhgRr7" }, "source": [ "Uhalisia wa aina za data zilizopandishwa huleta athari mbili. Kwanza, operesheni zitafanyika katika kiwango cha msimbo wa Python unaotafsiriwa badala ya msimbo wa NumPy uliokusanywa. Kimsingi, hii inamaanisha kuwa operesheni yoyote inayohusisha `Series` au `DataFrames` zenye `None` ndani yake itakuwa polepole zaidi. Ingawa huenda usione athari hii ya utendaji, kwa seti kubwa za data inaweza kuwa tatizo.\n", "\n", "Athari ya pili inatokana na ya kwanza. Kwa sababu `None` kimsingi inarudisha `Series` au `DataFrame`s katika ulimwengu wa Python ya kawaida, kutumia mkusanyiko wa NumPy/pandas kama `sum()` au `min()` kwenye safu ambazo zina thamani ya ``None`` kwa ujumla kutasababisha hitilafu:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 292 }, "id": "gWbx-KB9gRr8", "outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50", "trusted": false }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "example1.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "LcEwO8UogRr9" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "pWvVHvETgRr9" }, "source": [ "### `NaN`: thamani za float zinazokosekana\n", "\n", "Tofauti na `None`, NumPy (na kwa hivyo pandas) inasaidia `NaN` kwa ajili ya operesheni zake za haraka, za vekta, na ufuncs. Habari mbaya ni kwamba hesabu yoyote inayofanywa kwenye `NaN` daima husababisha `NaN`. Kwa mfano:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rcFYfMG9gRr9", "outputId": "699e81b7-5c11-4b46-df1d-06071768690f", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan + 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BW3zQD2-gRr-", "outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan * 0" ] }, { "cell_type": "markdown", "metadata": { "id": "fU5IPRcCgRr-" }, "source": [ "Habari njema: mkusanyiko unaoendeshwa kwenye safu zenye `NaN` ndani yake hauleti makosa. Habari mbaya: matokeo si ya manufaa kwa usawa:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LCInVgSSgRr_", "outputId": "fa06495a-0930-4867-87c5-6023031ea8b5", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "(nan, nan, nan)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example2 = np.array([2, np.nan, 6, 8]) \n", "example2.sum(), example2.min(), example2.max()" ] }, { "cell_type": "markdown", "metadata": { "id": "nhlnNJT7gRr_" }, "source": [] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "id": "yan3QRaOgRr_", "trusted": false }, "outputs": [], "source": [ "# What happens if you add np.nan and None together?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_iDvIRC8gRsA" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "kj6EKdsAgRsA" }, "source": [ "### `NaN` na `None`: thamani tupu katika pandas\n", "\n", "Ingawa `NaN` na `None` zinaweza kuonyesha tabia tofauti kidogo, pandas imeundwa kushughulikia zote mbili kwa njia inayofanana. Ili kuelewa tunachomaanisha, fikiria `Series` ya nambari kamili:\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nji-KGdNgRsA", "outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_series = pd.Series([1, 2, 3], dtype=int)\n", "int_series" ] }, { "cell_type": "markdown", "metadata": { "id": "WklCzqb8gRsB" }, "source": [] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "id": "Cy-gqX5-gRsB", "trusted": false }, "outputs": [], "source": [ "# Now set an element of int_series equal to None.\n", "# How does that element show up in the Series?\n", "# What is the dtype of the Series?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "WjMQwltNgRsB" }, "source": [ "Katika mchakato wa kubadilisha aina za data ili kuanzisha usawa wa data katika `Series` na `DataFrame`s, pandas itabadilisha kwa hiari thamani zilizokosekana kati ya `None` na `NaN`. Kwa sababu ya kipengele hiki cha muundo, inaweza kuwa muhimu kufikiria `None` na `NaN` kama ladha mbili tofauti za \"null\" katika pandas. Kwa kweli, baadhi ya mbinu za msingi utakazotumia kushughulikia thamani zilizokosekana katika pandas zinaakisi wazo hili katika majina yao:\n", "\n", "- `isnull()`: Hutengeneza mask ya Boolean inayoonyesha thamani zilizokosekana\n", "- `notnull()`: Kinyume cha `isnull()`\n", "- `dropna()`: Hurejesha toleo lililochujwa la data\n", "- `fillna()`: Hurejesha nakala ya data na thamani zilizokosekana zikiwa zimejazwa au kuwekewa makadirio\n", "\n", "Hizi ni mbinu muhimu za kuzifahamu na kuzoea, kwa hivyo hebu tuzipitie kila moja kwa undani zaidi.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Yh5ifd9FgRsB" }, "source": [ "### Kugundua thamani za null\n", "\n", "Sasa kwa kuwa tumeelewa umuhimu wa thamani zinazokosekana, tunahitaji kuzitambua kwenye seti yetu ya data kabla ya kushughulika nazo. \n", "Zote `isnull()` na `notnull()` ni mbinu zako za msingi za kugundua data ya null. Zote zinarudisha maski za Boolean juu ya data yako.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "id": "e-vFp5lvgRsC", "trusted": false }, "outputs": [], "source": [ "example3 = pd.Series([0, np.nan, '', None])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XdaJJ7PgRsC", "outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "PaSZ0SQygRsC" }, "source": [ "Angalia kwa makini matokeo. Je, kuna lolote linalokushangaza? Ingawa `0` ni null ya hesabu, bado ni nambari kamili nzuri na pandas inaitendea hivyo. `''` ni kidogo zaidi ya hila. Ingawa tulitumia katika Sehemu ya 1 kuwakilisha thamani ya mnyororo tupu, bado ni kitu cha mnyororo na si uwakilishi wa null kulingana na pandas.\n", "\n", "Sasa, hebu tugeuze hili na tutumie mbinu hizi kwa namna ambayo utazitumia kwa vitendo. Unaweza kutumia vinyago vya Boolean moja kwa moja kama ``Series`` au ``DataFrame`` index, ambayo inaweza kuwa muhimu unapojaribu kufanya kazi na thamani zilizokosekana (au zilizopo) pekee.\n", "\n", "Ikiwa tunataka jumla ya idadi ya thamani zilizokosekana, tunaweza kufanya jumla juu ya kinyago kinachozalishwa na mbinu ya `isnull()`.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JCcQVoPkHDUv", "outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "PlBqEo3mgRsC" }, "source": [] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "id": "ggDVf5uygRsD", "trusted": false }, "outputs": [], "source": [ "# Try running example3[example3.notnull()].\n", "# Before you do so, what do you expect to see?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D_jWN7mHgRsD" }, "source": [ "**Mambo ya msingi**: Mbinu za `isnull()` na `notnull()` hutoa matokeo yanayofanana unapotumia katika DataFrames: zinaonyesha matokeo na faharasa ya matokeo hayo, ambayo yatakusaidia sana unaposhughulika na data yako.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BvnoojWsgRr4" }, "source": [ "### Kushughulikia data iliyokosekana\n", "\n", "> **Lengo la kujifunza:** Kufikia mwisho wa sehemu hii ndogo, unapaswa kujua jinsi na wakati wa kubadilisha au kuondoa thamani za null kutoka kwa DataFrames.\n", "\n", "Mifano ya Kujifunza Mashine haziwezi kushughulikia data iliyokosekana zenyewe. Kwa hivyo, kabla ya kupitisha data kwenye mfano, tunahitaji kushughulikia thamani hizi zilizokosekana.\n", "\n", "Jinsi data iliyokosekana inavyoshughulikiwa huleta maamuzi yenye athari ndogo, inaweza kuathiri uchambuzi wako wa mwisho na matokeo halisi ya ulimwengu.\n", "\n", "Kuna njia mbili kuu za kushughulikia data iliyokosekana:\n", "\n", "1. Kuondoa safu inayojumuisha thamani iliyokosekana\n", "2. Kubadilisha thamani iliyokosekana na thamani nyingine\n", "\n", "Tutajadili njia hizi zote mbili na faida na hasara zake kwa undani.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3VaYC1TvgRsD" }, "source": [ "### Kuondoa Thamani Zisizo na Maana (Null Values)\n", "\n", "Kiasi cha data tunachopitisha kwenye modeli yetu kina athari ya moja kwa moja kwenye utendaji wake. Kuondoa thamani zisizo na maana kunamaanisha tunapunguza idadi ya vipengele vya data, na hivyo kupunguza ukubwa wa seti ya data. Kwa hivyo, inashauriwa kuondoa safu zilizo na thamani zisizo na maana wakati seti ya data ni kubwa sana.\n", "\n", "Mfano mwingine unaweza kuwa kwamba safu fulani au safu wima ina idadi kubwa ya thamani zinazokosekana. Katika hali hiyo, zinaweza kuondolewa kwa sababu hazitachangia sana kwenye uchambuzi wetu kwani data nyingi inakosekana kwa safu/safu wima hiyo.\n", "\n", "Zaidi ya kutambua thamani zinazokosekana, pandas inatoa njia rahisi ya kuondoa thamani zisizo na maana kutoka kwa `Series` na `DataFrame`s. Ili kuona hili likifanyika, hebu turudi kwenye `example3`. Kazi ya `DataFrame.dropna()` husaidia kuondoa safu zilizo na thamani zisizo na maana.\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7uIvS097gRsD", "outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "2 \n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example3 = example3.dropna()\n", "example3" ] }, { "cell_type": "markdown", "metadata": { "id": "hil2cr64gRsD" }, "source": [ "Kumbuka kwamba hii inapaswa kuonekana kama matokeo yako kutoka `example3[example3.notnull()]`. Tofauti hapa ni kwamba, badala ya kuorodhesha tu kwenye thamani zilizofichwa, `dropna` imeondoa zile thamani zilizokosekana kutoka kwa `Series` `example3`.\n", "\n", "Kwa sababu DataFrames zina vipimo viwili, zinatoa chaguo zaidi za kuondoa data.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "an-l74sPgRsE", "outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN7
12.05.08
2NaN6.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 7\n", "1 2.0 5.0 8\n", "2 NaN 6.0 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4 = pd.DataFrame([[1, np.nan, 7], \n", " [2, 5, 8], \n", " [np.nan, 6, 9]])\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "66wwdHZrgRsE" }, "source": [ "(Je, uliona kwamba pandas ilibadilisha safu mbili kuwa nambari za desimali ili kushughulikia `NaN`?)\n", "\n", "Huwezi kuondoa thamani moja pekee kutoka kwa `DataFrame`, kwa hivyo unapaswa kuondoa safu nzima au safu wima. Kulingana na unachofanya, unaweza kutaka kufanya moja au nyingine, na hivyo pandas inakupa chaguo kwa zote mbili. Kwa sababu katika sayansi ya data, safu wima kwa kawaida huwakilisha vigezo na safu huwakilisha uchunguzi, kuna uwezekano mkubwa wa kuondoa safu za data; mpangilio wa chaguo-msingi wa `dropna()` ni kuondoa safu zote ambazo zina thamani yoyote isiyo na maana:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jAVU24RXgRsE", "outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.05.08
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 5.0 8" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "TrQRBuTDgRsE" }, "source": [ "Ikiwa ni lazima, unaweza kuondoa thamani za NA kutoka kwenye safu. Tumia `axis=1` kufanya hivyo:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "GrBhxu9GgRsE", "outputId": "ff4001f3-2e61-4509-d60e-0093d1068437", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
07
18
29
\n", "
" ], "text/plain": [ " 2\n", "0 7\n", "1 8\n", "2 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "KWXiKTfMgRsF" }, "source": [ "Kumbuka kwamba hii inaweza kuondoa data nyingi ambayo unaweza kutaka kuhifadhi, hasa katika seti ndogo za data. Je, unafanyaje ikiwa unataka tu kuondoa safu au nguzo ambazo zina thamani kadhaa au hata zote zikiwa tupu? Unaweza kubainisha mipangilio hiyo katika `dropna` kwa kutumia vigezo vya `how` na `thresh`.\n", "\n", "Kwa chaguo-msingi, `how='any'` (ikiwa ungependa kuthibitisha mwenyewe au kuona vigezo vingine ambavyo njia hii inayo, endesha `example4.dropna?` katika seli ya msimbo). Vinginevyo, unaweza kubainisha `how='all'` ili kuondoa tu safu au nguzo ambazo zina thamani zote tupu. Hebu tuongeze mfano wetu wa `DataFrame` ili kuona hili likifanyika katika zoezi lijalo.\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "Bcf_JWTsgRsF", "outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4[3] = np.nan\n", "example4" ] }, { "cell_type": "markdown", "metadata": { "id": "pNZer7q9JPNC" }, "source": [ "> Mambo ya Muhimu: \n", "1. Kuondoa thamani za null ni wazo zuri tu ikiwa seti ya data ni kubwa vya kutosha. \n", "2. Safu nzima au nguzo zinaweza kuondolewa ikiwa zina data nyingi zilizokosekana. \n", "3. Njia ya `DataFrame.dropna(axis=)` husaidia kuondoa thamani za null. Hoja ya `axis` inaonyesha kama safu (rows) zinapaswa kuondolewa au nguzo (columns). \n", "4. Hoja ya `how` pia inaweza kutumika. Kwa chaguo-msingi imewekwa kuwa `any`. Hivyo, inaondoa tu safu/nguzo ambazo zina thamani yoyote ya null. Inaweza kuwekwa kuwa `all` ili kubainisha kwamba tutaondoa tu safu/nguzo ambazo thamani zote ni null. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "oXXSfQFHgRsF" }, "source": [] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true, "id": "ExUwQRxpgRsF", "trusted": false }, "outputs": [], "source": [ "# How might you go about dropping just column 3?\n", "# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "38kwAihWgRsG" }, "source": [ "Kigezo cha `thresh` kinakupa udhibiti wa kina zaidi: unaweka idadi ya thamani *zisizo tupu* ambazo safu au safu wima inahitaji kuwa nazo ili kuhifadhiwa:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "M9dCNMaagRsG", "outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
12.05.08NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 2.0 5.0 8 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.dropna(axis='rows', thresh=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "fmSFnzZegRsG" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "mCcxLGyUgRsG" }, "source": [ "### Kujaza Thamani Zilizokosekana\n", "\n", "Wakati mwingine inafaa kujaza thamani zilizokosekana na zile ambazo zinaweza kuwa halali. Kuna mbinu kadhaa za kujaza thamani tupu. Ya kwanza ni kutumia Maarifa ya Eneo (maarifa ya mada ambayo dataset inahusu) ili kwa namna fulani kukadiria thamani zilizokosekana.\n", "\n", "Unaweza kutumia `isnull` kufanya hili moja kwa moja, lakini hilo linaweza kuwa kazi ngumu, hasa ikiwa una thamani nyingi za kujaza. Kwa sababu hii ni kazi ya kawaida sana katika sayansi ya data, pandas inatoa `fillna`, ambayo inarudisha nakala ya `Series` au `DataFrame` na thamani zilizokosekana kubadilishwa na zile unazochagua. Hebu tuunde mfano mwingine wa `Series` ili kuona jinsi hii inavyofanya kazi kwa vitendo.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CE8S7louLezV" }, "source": [ "### Data ya Kategoria (Isiyo ya Nambari)\n", "Kwanza, tuzingatie data isiyo ya nambari. Katika seti za data, tuna safu zenye data ya kategoria. Mfano: Jinsia, Kweli au Siyo Kweli, n.k.\n", "\n", "Katika hali nyingi, tunabadilisha thamani zilizokosekana kwa kutumia `mode` ya safu husika. Kwa mfano, tuna alama 100 za data ambapo 90 zimesema Kweli, 8 zimesema Siyo Kweli, na 2 hazijajazwa. Basi, tunaweza kujaza zile 2 kwa Kweli, tukizingatia safu nzima.\n", "\n", "Tena, hapa tunaweza kutumia maarifa ya uwanja. Hebu tuzingatie mfano wa kujaza kwa kutumia mode.\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "MY5faq4yLdpQ", "outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134None
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 None\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", " [3,4,None],\n", " [5,6,\"False\"],\n", " [7,8,\"True\"],\n", " [9,10,\"True\"]])\n", "\n", "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "MLAoMQOfNPlA" }, "source": [] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WKy-9Y2tN5jv", "outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf" }, "outputs": [ { "data": { "text/plain": [ "True 3\n", "False 1\n", "Name: 2, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode[2].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "6iNz_zG_OKrx" }, "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "TxPKteRvNPOs" }, "outputs": [], "source": [ "fill_with_mode[2].fillna('True',inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "tvas7c9_OPWE", "outputId": "ec3c8e44-d644-475e-9e22-c65101965850" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
012True
134True
256False
378True
4910True
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 True\n", "1 3 4 True\n", "2 5 6 False\n", "3 7 8 True\n", "4 9 10 True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mode" ] }, { "cell_type": "markdown", "metadata": { "id": "SktitLxxOR16" }, "source": [ "Kama tunavyoona, thamani ya null imebadilishwa. Bila shaka, tungeweza kuandika chochote badala ya `'True'` na kingechukua nafasi hiyo.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "heYe1I0dOmQ_" }, "source": [ "### Takwimu za Kihesabu\n", "Sasa, tukija kwenye takwimu za kihesabu. Hapa, tuna njia mbili za kawaida za kujaza thamani zilizokosekana:\n", "\n", "1. Kujaza kwa Median ya safu\n", "2. Kujaza kwa Mean ya safu\n", "\n", "Tunatumia Median pale ambapo data ina mwelekeo na ina outliers. Hii ni kwa sababu median haiguswi sana na outliers.\n", "\n", "Wakati data imesawazishwa, tunaweza kutumia mean, kwa kuwa katika hali hiyo, mean na median zitakuwa karibu sana.\n", "\n", "Kwanza, hebu tuchukue safu ambayo imegawanyika kawaida na tujaze thamani iliyokosekana kwa kutumia mean ya safu hiyo.\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "09HM_2feOj5Y", "outputId": "7e309013-9acb-411c-9b06-4de795bbeeff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
2NaN45
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 NaN 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [np.nan,4,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "ka7-wNfzSxbx" }, "source": [ "Wastani wa safu ni\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XYtYEf5BSxFL", "outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(fill_with_mean[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "oBSRGxKRS39K" }, "source": [] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "FzncQLmuS5jh", "outputId": "00f74fff-01f4-4024-c261-796f50f01d2e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-2.001
1-1.023
20.045
31.067
42.089
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2.0 0 1\n", "1 -1.0 2 3\n", "2 0.0 4 5\n", "3 1.0 6 7\n", "4 2.0 8 9" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n", "fill_with_mean" ] }, { "cell_type": "markdown", "metadata": { "id": "CwpVFCrPTC5z" }, "source": [ "Kama tunavyoona, thamani iliyokosekana imebadilishwa na wastani wake.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jIvF13a1i00Z" }, "source": [ "Sasa hebu tujaribu fremu nyingine ya data, na wakati huu tutabadilisha thamani za None na wastani wa safu.\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "DA59Bqo3jBYZ", "outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
20NaN5
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 NaN 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median = pd.DataFrame([[-2,0,1],\n", " [-1,2,3],\n", " [0,np.nan,5],\n", " [1,6,7],\n", " [2,8,9]])\n", "\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "mM1GpXYmjHnc" }, "source": [ "Median ya safu ya pili ni\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uiDy5v3xjHHX", "outputId": "564b6b74-2004-4486-90d4-b39330a64b88" }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].median()" ] }, { "cell_type": "markdown", "metadata": { "id": "z9PLF75Jj_1s" }, "source": [] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "lFKbOxCMkBbg", "outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-20.01
1-12.03
204.05
316.07
428.09
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -2 0.0 1\n", "1 -1 2.0 3\n", "2 0 4.0 5\n", "3 1 6.0 7\n", "4 2 8.0 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n", "fill_with_median" ] }, { "cell_type": "markdown", "metadata": { "id": "8JtQ53GSkKWC" }, "source": [ "Kama tunavyoona, thamani ya NaN imebadilishwa na wastani wa safu hiyo\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0ybtWLDdgRsG", "outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "example5" ] }, { "cell_type": "markdown", "metadata": { "id": "yrsigxRggRsH" }, "source": [ "Unaweza kujaza nafasi zote tupu kwa thamani moja, kama `0`:\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KXMIPsQdgRsH", "outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "id": "RRlI5f_hkfKe" }, "source": [ "> Mambo ya Muhimu:\n", "1. Kujaza thamani zilizokosekana kunapaswa kufanywa pale ambapo kuna data kidogo au kuna mkakati wa kujaza data iliyokosekana.\n", "2. Maarifa ya uwanja yanaweza kutumika kujaza thamani zilizokosekana kwa kuzikadiria.\n", "3. Kwa data ya Kategoria, mara nyingi, thamani zilizokosekana hubadilishwa na hali ya kawaida ya safu husika.\n", "4. Kwa data ya namba, thamani zilizokosekana kwa kawaida hujazwa na wastani (kwa seti za data zilizonormalishwa) au median ya safu husika.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FI9MmqFJgRsH" }, "source": [] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "id": "af-ezpXdgRsH", "trusted": false }, "outputs": [], "source": [ "# What happens if you try to fill null values with a string, like ''?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kq3hw1kLgRsI" }, "source": [] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vO3BuNrggRsI", "outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 1.0\n", "c 2.0\n", "d 2.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='ffill')" ] }, { "cell_type": "markdown", "metadata": { "id": "nDXeYuHzgRsI" }, "source": [ "Unaweza pia **kujaza nyuma** ili kusambaza thamani halali inayofuata nyuma kujaza tupu:\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4M5onHcEgRsI", "outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 2.0\n", "d 3.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example5.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "MbBzTom5gRsI" }, "source": [ "Kama unavyoweza kudhani, hii inafanya kazi vivyo hivyo na DataFrames, lakini pia unaweza kubainisha `axis` ambayo utaijaza thamani za null:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "aRpIvo4ZgRsI", "outputId": "905a980a-a808-4eca-d0ba-224bd7d85955", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.0NaN7NaN
12.05.08NaN
2NaN6.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 NaN 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 NaN 6.0 9 NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "VM1qtACAgRsI", "outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.01.07.07.0
12.05.08.08.0
2NaN6.09.09.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 1.0 7.0 7.0\n", "1 2.0 5.0 8.0 8.0\n", "2 NaN 6.0 9.0 9.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(method='ffill', axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeMc-I1EgRsI" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "eeAoOU0RgRsJ" }, "source": [] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true, "id": "e8S-CjW8gRsJ", "trusted": false }, "outputs": [], "source": [ "# What output does example4.fillna(method='bfill', axis=1) produce?\n", "# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n", "# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YHgy0lIrgRsJ" }, "source": [ "Unaweza kuwa mbunifu kuhusu jinsi unavyotumia `fillna`. Kwa mfano, hebu tuangalie tena `example4`, lakini wakati huu tujaze thamani zilizokosekana kwa wastani wa thamani zote katika `DataFrame`:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "OtYVErEygRsJ", "outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.05.57NaN
12.05.08NaN
21.56.09NaN
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 5.5 7 NaN\n", "1 2.0 5.0 8 NaN\n", "2 1.5 6.0 9 NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example4.fillna(example4.mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "zpMvCkLSgRsJ" }, "source": [ "Kumbuka kwamba safu ya 3 bado haina thamani: mwelekeo wa chaguo-msingi ni kujaza thamani kwa mpangilio wa safu.\n", "\n", "> **Mafunzo Muhimu:** Kuna njia nyingi za kushughulikia thamani zinazokosekana katika seti zako za data. Mkakati maalum unaotumia (kuondoa, kubadilisha, au hata jinsi unavyobadilisha) unapaswa kuamuliwa na maelezo ya data hiyo. Utapata uelewa bora wa jinsi ya kushughulikia thamani zinazokosekana kadri unavyoshughulikia na kuingiliana na seti za data.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bauDnESIl9FH" }, "source": [ "### Usimbaji wa Data ya Kategoria\n", "\n", "Miundo ya kujifunza kwa mashine hushughulikia tu nambari na aina yoyote ya data ya nambari. Haiwezi kutofautisha kati ya Ndiyo na Hapana, lakini inaweza kutofautisha kati ya 0 na 1. Kwa hivyo, baada ya kujaza thamani zilizokosekana, tunahitaji kusimba data ya kategoria kwa aina fulani ya nambari ili modeli iweze kuelewa.\n", "\n", "Usimbaji unaweza kufanywa kwa njia mbili. Tutazijadili baadaye.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "uDq9SxB7mu5i" }, "source": [ "**KODIFIKESHENI YA LEBELE** \n", "\n", "Kodifikesheni ya lebele ni mchakato wa kubadilisha kila kategoria kuwa namba. Kwa mfano, tuseme tuna seti ya data ya abiria wa ndege na kuna safu inayojumuisha daraja lao miongoni mwa haya ['daraja la biashara', 'daraja la uchumi', 'daraja la kwanza']. Ikiwa kodifikesheni ya lebele itafanyika hapa, itabadilishwa kuwa [0,1,2]. Hebu tuone mfano kupitia msimbo. Kwa kuwa tutajifunza `scikit-learn` katika daftari zijazo, hatutaitumia hapa.\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "1vGz7uZyoWHL", "outputId": "9e252855-d193-4103-a54d-028ea7787b34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "IDHnkwTYov-h" }, "source": [ "Ili kufanya usimbaji wa lebo kwenye safu ya kwanza, tunapaswa kwanza kuelezea ramani kutoka kwa kila darasa hadi nambari, kabla ya kubadilisha\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZC5URJG3o1ES", "outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
0100
1202
2301
3401
4501
5600
\n", "
" ], "text/plain": [ " ID class\n", "0 10 0\n", "1 20 2\n", "2 30 1\n", "3 40 1\n", "4 50 1\n", "5 60 0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_labels = {'business class':0,'economy class':1,'first class':2}\n", "label['class'] = label['class'].replace(class_labels)\n", "label" ] }, { "cell_type": "markdown", "metadata": { "id": "ftnF-TyapOPt" }, "source": [ "Kama tunavyoona, matokeo yanalingana na tulivyotarajia. Kwa hivyo, ni lini tunatumia label encoding? Label encoding hutumika katika hali moja au zote zifuatazo: \n", "1. Wakati idadi ya kategoria ni kubwa \n", "2. Wakati kategoria ziko katika mpangilio. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "eQPAPVwsqWT7" }, "source": [ "**UJUMUISHAJI WA ONE HOT**\n", "\n", "Aina nyingine ya ujumishaji ni Ujumishaji wa One Hot. Katika aina hii ya ujumishaji, kila kategoria ya safu huongezwa kama safu tofauti, na kila kipengele cha data hupata 0 au 1 kulingana na kama kina kategoria hiyo au la. Kwa hivyo, ikiwa kuna kategoria n tofauti, safu n zitaongezwa kwenye dataframe.\n", "\n", "Kwa mfano, hebu tuchukue mfano ule ule wa daraja la ndege. Kategoria zilikuwa: ['business class', 'economy class', 'first class']. Kwa hivyo, tukifanya ujumishaji wa one hot, safu tatu zifuatazo zitaongezwa kwenye seti ya data: ['class_business class', 'class_economy class', 'class_first class'].\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "ZM0eVh0ArKUL", "outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass
010business class
120first class
230economy class
340economy class
450economy class
560business class
\n", "
" ], "text/plain": [ " ID class\n", "0 10 business class\n", "1 20 first class\n", "2 30 economy class\n", "3 40 economy class\n", "4 50 economy class\n", "5 60 business class" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = pd.DataFrame([\n", " [10,'business class'],\n", " [20,'first class'],\n", " [30, 'economy class'],\n", " [40, 'economy class'],\n", " [50, 'economy class'],\n", " [60, 'business class']\n", "],columns=['ID','class'])\n", "one_hot" ] }, { "cell_type": "markdown", "metadata": { "id": "aVnZ7paDrWmb" }, "source": [ "Tufanye usimbaji wa one hot kwenye safu ya kwanza\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "RUPxf7egrYKr" }, "outputs": [], "source": [ "one_hot_data = pd.get_dummies(one_hot,columns=['class'])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "TM37pHsFr4ge", "outputId": "7be15f53-79b2-447a-979c-822658339a9e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDclass_business classclass_economy classclass_first class
010100
120001
230010
340010
450010
560100
\n", "
" ], "text/plain": [ " ID class_business class class_economy class class_first class\n", "0 10 1 0 0\n", "1 20 0 0 1\n", "2 30 0 1 0\n", "3 40 0 1 0\n", "4 50 0 1 0\n", "5 60 1 0 0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_data" ] }, { "cell_type": "markdown", "metadata": { "id": "_zXRLOjXujdA" }, "source": [ "Kila safu iliyo na hot encoding ina 0 au 1, ambayo inaonyesha kama jamii hiyo ipo kwa data hiyo.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bDnC4NQOu0qr" }, "source": [ "Tunatumia one hot encoding lini? One hot encoding hutumika katika mojawapo au zote za hali zifuatazo:\n", "\n", "1. Wakati idadi ya kategoria na ukubwa wa seti ya data ni ndogo.\n", "2. Wakati kategoria hazifuati mpangilio maalum.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnUmci_4uvyu" }, "source": [ "> Mambo Muhimu:\n", "1. Usimbaji hufanywa kubadilisha data isiyo ya nambari kuwa data ya nambari.\n", "2. Kuna aina mbili za usimbaji: Usimbaji wa Lebo na Usimbaji wa One Hot, zote zinaweza kufanywa kulingana na mahitaji ya seti ya data.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "K8UXOJYRgRsJ" }, "source": [ "## Kuondoa data inayojirudia\n", "\n", "> **Lengo la kujifunza:** Kufikia mwisho wa sehemu hii, unapaswa kuwa na ujuzi wa kutambua na kuondoa thamani zinazojirudia kutoka kwa DataFrames.\n", "\n", "Mbali na data inayokosekana, mara nyingi utakutana na data inayojirudia katika seti za data za maisha halisi. Kwa bahati nzuri, pandas inatoa njia rahisi ya kugundua na kuondoa maingizo yanayojirudia.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qrEG-Wa0gRsJ" }, "source": [ "### Kutambua nakala: `duplicated`\n", "\n", "Unaweza kutambua kwa urahisi thamani zinazojirudia kwa kutumia mbinu ya `duplicated` katika pandas, ambayo inarudisha maski ya Boolean inayoonyesha kama ingizo katika `DataFrame` ni nakala ya ingizo la awali. Hebu tuunde mfano mwingine wa `DataFrame` ili kuona hii ikifanya kazi.\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "ZLu6FEnZgRsJ", "outputId": "376512d1-d842-4db1-aea3-71052aeeecaf", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
2A1
3B3
4B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "2 A 1\n", "3 B 3\n", "4 B 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n", " 'numbers': [1, 2, 1, 3, 3]})\n", "example6" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cIduB5oBgRsK", "outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2", "trusted": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "4 True\n", "dtype: bool" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.duplicated()" ] }, { "cell_type": "markdown", "metadata": { "id": "0eDRJD4SgRsK" }, "source": [ "### Kuondoa nakala: `drop_duplicates`\n", "`drop_duplicates` inarudisha tu nakala ya data ambayo thamani zote za `duplicated` ni `False`:\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "w_YPpqIqgRsK", "outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
3B3
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2\n", "3 B 3" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": { "id": "69AqoCZAgRsK" }, "source": [ "Zote `duplicated` na `drop_duplicates` kwa chaguo-msingi huzingatia safu zote lakini unaweza kubainisha kwamba zichunguze tu sehemu ndogo ya safu katika `DataFrame` yako:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "BILjDs67gRsK", "outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3", "trusted": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lettersnumbers
0A1
1B2
\n", "
" ], "text/plain": [ " letters numbers\n", "0 A 1\n", "1 B 2" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example6.drop_duplicates(['letters'])" ] }, { "cell_type": "markdown", "metadata": { "id": "GvX4og1EgRsL" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n---\n\n**Kanusho**: \nHati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia tafsiri ya kitaalamu ya binadamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.\n" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "name": "notebook.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "coopTranslator": { "original_hash": "8533b3a2230311943339963fc7f04c21", "translation_date": "2025-09-02T08:02:30+00:00", "source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb", "language_code": "sw" } }, "nbformat": 4, "nbformat_minor": 0 }