You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/translations/sk/2-Working-With-Data/08-data-preparation/notebook.ipynb

3695 lines
103 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "rQ8UhzFpgRra"
},
"source": [
"# Príprava dát\n",
"\n",
"[Originálny zdroj notebooku z *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio od Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n",
"\n",
"## Preskúmanie informácií o `DataFrame`\n",
"\n",
"> **Cieľ učenia:** Na konci tejto podsekcie by ste mali byť schopní pohodlne nájsť všeobecné informácie o dátach uložených v pandas DataFrame.\n",
"\n",
"Keď načítate svoje dáta do pandas, je veľmi pravdepodobné, že budú uložené v `DataFrame`. Ak však má dátová sada vo vašom `DataFrame` 60 000 riadkov a 400 stĺpcov, ako vôbec začať získavať prehľad o tom, s čím pracujete? Našťastie pandas poskytuje niekoľko praktických nástrojov na rýchle získanie celkových informácií o `DataFrame`, ako aj na zobrazenie prvých a posledných niekoľkých riadkov.\n",
"\n",
"Aby sme preskúmali túto funkcionalitu, importujeme knižnicu Python scikit-learn a použijeme ikonickú dátovú sadu, ktorú každý dátový vedec videl už stokrát: dátovú sadu britského biológa Ronalda Fishera *Iris*, použitú v jeho článku z roku 1936 \"Použitie viacerých meraní v taxonomických problémoch\":\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"id": "hB1RofhdgRrp",
"trusted": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.datasets import load_iris\n",
"\n",
"iris = load_iris()\n",
"iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AGA0A_Y8hMdz"
},
"source": [
"### `DataFrame.shape`\n",
"Načítali sme dataset Iris do premennej `iris_df`. Predtým, než sa pustíme do analýzy údajov, by bolo užitočné zistiť, koľko dátových bodov máme a aká je celková veľkosť datasetu. Je užitočné pozrieť sa na objem údajov, s ktorými pracujeme.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LOe5jQohhulf",
"outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9"
},
"outputs": [
{
"data": {
"text/plain": [
"(150, 4)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "smE7AGzOhxk2"
},
"source": [
"Takže máme do činenia so 150 riadkami a 4 stĺpcami údajov. Každý riadok predstavuje jeden dátový bod a každý stĺpec predstavuje jednu vlastnosť spojenú s dátovým rámcom. V podstate teda ide o 150 dátových bodov, z ktorých každý obsahuje 4 vlastnosti.\n",
"\n",
"`shape` je tu atribútom dátového rámca, nie funkciou, a preto nekončí dvojicou zátvoriek.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d3AZKs0PinGP"
},
"source": [
"### `DataFrame.columns`\n",
"Pozrime sa teraz na 4 stĺpce údajov. Čo presne každý z nich predstavuje? Atribút `columns` nám poskytne názvy stĺpcov v dataframe.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "YPGh_ziji-CY",
"outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n",
" 'petal width (cm)'],\n",
" dtype='object')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TsobcU_VjCC_"
},
"source": [
"Ako môžeme vidieť, sú tu štyri (4) stĺpce. Atribút `columns` nám hovorí názvy stĺpcov a v podstate nič viac. Tento atribút nadobúda význam, keď chceme identifikovať vlastnosti, ktoré dataset obsahuje.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2UTlvkjmgRrs"
},
"source": [
"### `DataFrame.info`\n",
"Množstvo údajov (dané atribútom `shape`) a názvy vlastností alebo stĺpcov (dané atribútom `columns`) nám poskytujú určitý pohľad na dataset. Teraz by sme chceli ísť hlbšie do datasetu. Funkcia `DataFrame.info()` je na to veľmi užitočná.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dHHRyG0_gRrt",
"outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5",
"trusted": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 150 entries, 0 to 149\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 sepal length (cm) 150 non-null float64\n",
" 1 sepal width (cm) 150 non-null float64\n",
" 2 petal length (cm) 150 non-null float64\n",
" 3 petal width (cm) 150 non-null float64\n",
"dtypes: float64(4)\n",
"memory usage: 4.8 KB\n"
]
}
],
"source": [
"iris_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1XgVMpvigRru"
},
"source": [
"Z tohto môžeme urobiť niekoľko pozorovaní: \n",
"1. Dátový typ každého stĺpca: V tomto datasete sú všetky údaje uložené ako 64-bitové čísla s pohyblivou desatinnou čiarkou. \n",
"2. Počet nenulových hodnôt: Práca s nulovými hodnotami je dôležitým krokom pri príprave údajov. Týmto sa budeme zaoberať neskôr v notebooku. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IYlyxbpWFEF4"
},
"source": [
"### DataFrame.describe()\n",
"Predstavme si, že máme veľa číselných údajov v našej dátovej sade. Jednovariabilné štatistické výpočty, ako napríklad priemer, medián, kvartily a podobne, je možné vykonať na každom stĺpci samostatne. Funkcia `DataFrame.describe()` nám poskytuje štatistický prehľad číselných stĺpcov dátovej sady.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"id": "tWV-CMstFIRA",
"outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>5.843333</td>\n",
" <td>3.057333</td>\n",
" <td>3.758000</td>\n",
" <td>1.199333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.828066</td>\n",
" <td>0.435866</td>\n",
" <td>1.765298</td>\n",
" <td>0.762238</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>4.300000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.100000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>5.100000</td>\n",
" <td>2.800000</td>\n",
" <td>1.600000</td>\n",
" <td>0.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>5.800000</td>\n",
" <td>3.000000</td>\n",
" <td>4.350000</td>\n",
" <td>1.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.400000</td>\n",
" <td>3.300000</td>\n",
" <td>5.100000</td>\n",
" <td>1.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.900000</td>\n",
" <td>4.400000</td>\n",
" <td>6.900000</td>\n",
" <td>2.500000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"count 150.000000 150.000000 150.000000 150.000000\n",
"mean 5.843333 3.057333 3.758000 1.199333\n",
"std 0.828066 0.435866 1.765298 0.762238\n",
"min 4.300000 2.000000 1.000000 0.100000\n",
"25% 5.100000 2.800000 1.600000 0.300000\n",
"50% 5.800000 3.000000 4.350000 1.300000\n",
"75% 6.400000 3.300000 5.100000 1.800000\n",
"max 7.900000 4.400000 6.900000 2.500000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zjjtW5hPGMuM"
},
"source": [
"Výstup vyššie zobrazuje celkový počet dátových bodov, priemer, smerodajnú odchýlku, minimum, dolný kvartil (25 %), medián (50 %), horný kvartil (75 %) a maximálnu hodnotu každého stĺpca.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-lviAu99gRrv"
},
"source": [
"### `DataFrame.head`\n",
"S použitím všetkých vyššie uvedených funkcií a atribútov sme získali prehľad o dátovom súbore. Vieme, koľko dátových bodov obsahuje, koľko má vlastností, aký je dátový typ každej vlastnosti a koľko neprázdnych hodnôt má každá vlastnosť.\n",
"\n",
"Teraz je čas pozrieť sa na samotné dáta. Pozrime sa, ako vyzerajú prvé riadky (prvé dátové body) nášho `DataFrame`:\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "DZMJZh0OgRrw",
"outputId": "d9393ee5-c106-4797-f815-218f17160e00",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"0 5.1 3.5 1.4 0.2\n",
"1 4.9 3.0 1.4 0.2\n",
"2 4.7 3.2 1.3 0.2\n",
"3 4.6 3.1 1.5 0.2\n",
"4 5.0 3.6 1.4 0.2"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EBHEimZuEFQK"
},
"source": [
"Ako výstup tu vidíme päť (5) záznamov z datasetu. Ak sa pozrieme na index vľavo, zistíme, že ide o prvých päť riadkov.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oj7GkrTdgRry"
},
"source": [
"### Cvičenie:\n",
"\n",
"Z vyššie uvedeného príkladu je zrejmé, že predvolene `DataFrame.head` vráti prvých päť riadkov `DataFrame`. V bunkách kódu nižšie, dokážete nájsť spôsob, ako zobraziť viac ako päť riadkov?\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"id": "EKRmRFFegRrz",
"trusted": false
},
"outputs": [],
"source": [
"# Hint: Consult the documentation by using iris_df.head?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BJ_cpZqNgRr1"
},
"source": [
"### `DataFrame.tail`\n",
"Ďalší spôsob, ako sa pozrieť na údaje, môže byť od konca (namiesto od začiatku). Opačnou funkciou k `DataFrame.head` je `DataFrame.tail`, ktorá vráti posledných päť riadkov `DataFrame`:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
},
"id": "heanjfGWgRr2",
"outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" <td>1.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" <td>2.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" <td>1.8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"145 6.7 3.0 5.2 2.3\n",
"146 6.3 2.5 5.0 1.9\n",
"147 6.5 3.0 5.2 2.0\n",
"148 6.2 3.4 5.4 2.3\n",
"149 5.9 3.0 5.1 1.8"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "31kBWfyLgRr3"
},
"source": [
"V praxi je užitočné mať možnosť jednoducho prezrieť prvých pár riadkov alebo posledných pár riadkov `DataFrame`, najmä keď hľadáte odľahlé hodnoty v usporiadaných datasetoch.\n",
"\n",
"Všetky funkcie a atribúty uvedené vyššie pomocou príkladov kódu nám pomáhajú získať pohľad na dáta.\n",
"\n",
"> **Poučenie:** Už len pohľad na metadáta o informáciách v `DataFrame` alebo na prvé a posledné hodnoty v ňom vám môže okamžite poskytnúť predstavu o veľkosti, tvare a obsahu dát, s ktorými pracujete.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TvurZyLSDxq_"
},
"source": [
"### Chýbajúce údaje\n",
"Pozrime sa na chýbajúce údaje. Chýbajúce údaje vznikajú, keď v niektorých stĺpcoch nie je uložená žiadna hodnota.\n",
"\n",
"Uveďme si príklad: povedzme, že niekto je citlivý na svoju váhu a nevyplní pole s váhou v prieskume. Potom bude hodnota váhy pre túto osobu chýbať.\n",
"\n",
"Vo väčšine prípadov sa v reálnych datasetoch vyskytujú chýbajúce hodnoty.\n",
"\n",
"**Ako Pandas spracováva chýbajúce údaje**\n",
"\n",
"Pandas spracováva chýbajúce hodnoty dvoma spôsobmi. Prvý ste už videli v predchádzajúcich sekciách: `NaN`, alebo Not a Number. Toto je špeciálna hodnota, ktorá je súčasťou špecifikácie IEEE pre čísla s pohyblivou desatinnou čiarkou a používa sa iba na označenie chýbajúcich hodnôt typu float.\n",
"\n",
"Pre chýbajúce hodnoty, ktoré nie sú typu float, pandas používa objekt `None` z Pythonu. Aj keď sa môže zdať mätúce, že sa stretnete s dvoma rôznymi typmi hodnôt, ktoré v podstate vyjadrujú to isté, existujú dobré programátorské dôvody pre toto rozhodnutie. V praxi tento prístup umožňuje pandas dosiahnuť dobrý kompromis vo väčšine prípadov. Napriek tomu majú `None` aj `NaN` určité obmedzenia, na ktoré si musíte dávať pozor, pokiaľ ide o ich použitie.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lOHqUlZFgRr5"
},
"source": [
"### `None`: nečíselné chýbajúce údaje\n",
"Keďže `None` pochádza z Pythonu, nemôže byť použitý v poliach NumPy a pandas, ktoré nemajú dátový typ `'object'`. Pamätajte, že polia NumPy (a dátové štruktúry v pandas) môžu obsahovať iba jeden typ údajov. Práve toto im dáva obrovskú silu pri práci s veľkými dátami a výpočtami, no zároveň to obmedzuje ich flexibilitu. Takéto polia musia byť \"povýšené\" na „najnižší spoločný menovateľ“, teda dátový typ, ktorý dokáže obsiahnuť všetko v poli. Ak je v poli `None`, znamená to, že pracujete s Python objektmi.\n",
"\n",
"Aby ste to videli v praxi, zvážte nasledujúce príkladové pole (všimnite si jeho `dtype`):\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QIoNdY4ngRr7",
"outputId": "92779f18-62f4-4a03-eca2-e9a101604336",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, None, 6, 8], dtype=object)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"example1 = np.array([2, None, 6, 8])\n",
"example1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pdlgPNbhgRr7"
},
"source": [
"Realita povýšených dátových typov prináša so sebou dva vedľajšie účinky. Po prvé, operácie sa budú vykonávať na úrovni interpretovaného Python kódu namiesto skompilovaného NumPy kódu. V podstate to znamená, že akékoľvek operácie zahŕňajúce `Series` alebo `DataFrames`, ktoré obsahujú `None`, budú pomalšie. Aj keď si tento pokles výkonu pravdepodobne nevšimnete, pri veľkých datasetoch by to mohlo predstavovať problém.\n",
"\n",
"Druhý vedľajší účinok vyplýva z prvého. Keďže `None` v podstate vracia `Series` alebo `DataFrame`s späť do sveta základného Pythonu, použitie agregácií NumPy/pandas, ako napríklad `sum()` alebo `min()`, na poliach obsahujúcich hodnotu `None` vo všeobecnosti spôsobí chybu:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 292
},
"id": "gWbx-KB9gRr8",
"outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50",
"trusted": false
},
"outputs": [
{
"ename": "TypeError",
"evalue": "ignored",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-10-ce9901ad18bd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n",
"\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'"
]
}
],
"source": [
"example1.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LcEwO8UogRr9"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "pWvVHvETgRr9"
},
"source": [
"### `NaN`: chýbajúce hodnoty typu float\n",
"\n",
"Na rozdiel od `None`, NumPy (a teda aj pandas) podporuje `NaN` pre svoje rýchle, vektorové operácie a ufuncs. Zlou správou je, že akákoľvek aritmetická operácia vykonaná na `NaN` vždy vedie k výsledku `NaN`. Napríklad:\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rcFYfMG9gRr9",
"outputId": "699e81b7-5c11-4b46-df1d-06071768690f",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan + 1"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BW3zQD2-gRr-",
"outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan * 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fU5IPRcCgRr-"
},
"source": [
"Dobrá správa: agregácie spustené na poliach s `NaN` v nich nespôsobujú chyby. Zlá správa: výsledky nie sú jednotne užitočné:\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LCInVgSSgRr_",
"outputId": "fa06495a-0930-4867-87c5-6023031ea8b5",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"(nan, nan, nan)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example2 = np.array([2, np.nan, 6, 8]) \n",
"example2.sum(), example2.min(), example2.max()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nhlnNJT7gRr_"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true,
"id": "yan3QRaOgRr_",
"trusted": false
},
"outputs": [],
"source": [
"# What happens if you add np.nan and None together?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_iDvIRC8gRsA"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "kj6EKdsAgRsA"
},
"source": [
"### `NaN` a `None`: nulové hodnoty v pandas\n",
"\n",
"Aj keď sa `NaN` a `None` môžu správať trochu odlišne, pandas je navrhnutý tak, aby ich spracovával zameniteľne. Aby sme to ukázali, pozrime sa na `Series` celých čísel:\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Nji-KGdNgRsA",
"outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"int_series = pd.Series([1, 2, 3], dtype=int)\n",
"int_series"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WklCzqb8gRsB"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true,
"id": "Cy-gqX5-gRsB",
"trusted": false
},
"outputs": [],
"source": [
"# Now set an element of int_series equal to None.\n",
"# How does that element show up in the Series?\n",
"# What is the dtype of the Series?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WjMQwltNgRsB"
},
"source": [
"Pri upgradovaní dátových typov na zabezpečenie homogénnosti údajov v `Series` a `DataFrame`s, pandas ochotne prepína chýbajúce hodnoty medzi `None` a `NaN`. Kvôli tejto dizajnovej vlastnosti môže byť užitočné vnímať `None` a `NaN` ako dva rôzne typy \"null\" hodnôt v pandas. V skutočnosti niektoré základné metódy, ktoré budete používať na prácu s chýbajúcimi hodnotami v pandas, túto myšlienku odrážajú vo svojich názvoch:\n",
"\n",
"- `isnull()`: Generuje Booleovskú masku označujúcu chýbajúce hodnoty\n",
"- `notnull()`: Opak metódy `isnull()`\n",
"- `dropna()`: Vráti filtrovanú verziu údajov\n",
"- `fillna()`: Vráti kópiu údajov s vyplnenými alebo imputovanými chýbajúcimi hodnotami\n",
"\n",
"Tieto metódy sú dôležité na zvládnutie a získanie istoty pri ich používaní, preto si ich prejdime podrobnejšie.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yh5ifd9FgRsB"
},
"source": [
"### Detekcia nulových hodnôt\n",
"\n",
"Teraz, keď sme pochopili dôležitosť chýbajúcich hodnôt, musíme ich v našej dátovej sade najskôr odhaliť, než s nimi začneme pracovať. \n",
"Metódy `isnull()` a `notnull()` sú vaše hlavné nástroje na detekciu nulových údajov. Obe vracajú Booleovské masky nad vašimi údajmi.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true,
"id": "e-vFp5lvgRsC",
"trusted": false
},
"outputs": [],
"source": [
"example3 = pd.Series([0, np.nan, '', None])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1XdaJJ7PgRsC",
"outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 True\n",
"2 False\n",
"3 True\n",
"dtype: bool"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example3.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PaSZ0SQygRsC"
},
"source": [
"Pozorne sa pozrite na výstup. Prekvapilo vás niečo? Aj keď `0` je aritmetická nula, stále je to úplne platné celé číslo a pandas ho tak aj považuje. `''` je o niečo jemnejšie. Aj keď sme ho v časti 1 použili na reprezentáciu prázdnej textovej hodnoty, stále ide o objekt typu string a nie o reprezentáciu nulovej hodnoty z pohľadu pandas.\n",
"\n",
"Teraz to otočme a použime tieto metódy spôsobom, akým ich budete používať v praxi. Boolean masky môžete použiť priamo ako index ``Series`` alebo ``DataFrame``, čo môže byť užitočné pri práci s izolovanými chýbajúcimi (alebo prítomnými) hodnotami.\n",
"\n",
"Ak chceme zistiť celkový počet chýbajúcich hodnôt, stačí nám urobiť súčet nad maskou, ktorú vytvorí metóda `isnull()`.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JCcQVoPkHDUv",
"outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d"
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example3.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PlBqEo3mgRsC"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true,
"id": "ggDVf5uygRsD",
"trusted": false
},
"outputs": [],
"source": [
"# Try running example3[example3.notnull()].\n",
"# Before you do so, what do you expect to see?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D_jWN7mHgRsD"
},
"source": [
"**Hlavný bod**: Metódy `isnull()` a `notnull()` prinášajú podobné výsledky, keď ich použijete v DataFrame: zobrazujú výsledky a index týchto výsledkov, čo vám nesmierne pomôže pri práci s vašimi údajmi.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BvnoojWsgRr4"
},
"source": [
"### Riešenie chýbajúcich údajov\n",
"\n",
"> **Cieľ učenia:** Na konci tejto podsekcie by ste mali vedieť, ako a kedy nahradiť alebo odstrániť nulové hodnoty z DataFrames.\n",
"\n",
"Modely strojového učenia nedokážu samy pracovať s chýbajúcimi údajmi. Preto je potrebné pred odovzdaním údajov do modelu vyriešiť tieto chýbajúce hodnoty.\n",
"\n",
"Spôsob, akým sa chýbajúce údaje riešia, prináša jemné kompromisy a môže ovplyvniť vašu konečnú analýzu a výsledky v reálnom svete.\n",
"\n",
"Existujú dva hlavné spôsoby riešenia chýbajúcich údajov:\n",
"\n",
"1. Odstrániť riadok obsahujúci chýbajúcu hodnotu\n",
"2. Nahradiť chýbajúcu hodnotu inou hodnotou\n",
"\n",
"Obe tieto metódy a ich výhody a nevýhody si podrobne rozoberieme.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3VaYC1TvgRsD"
},
"source": [
"### Odstraňovanie nulových hodnôt\n",
"\n",
"Množstvo údajov, ktoré posúvame do nášho modelu, má priamy vplyv na jeho výkon. Odstraňovanie nulových hodnôt znamená, že znižujeme počet dátových bodov, a tým aj veľkosť datasetu. Preto je vhodné odstrániť riadky s nulovými hodnotami, keď je dataset pomerne veľký.\n",
"\n",
"Ďalším prípadom môže byť situácia, keď určitý riadok alebo stĺpec obsahuje veľa chýbajúcich hodnôt. V takom prípade ich možno odstrániť, pretože by nepriniesli veľkú hodnotu do našej analýzy, keďže väčšina údajov pre daný riadok/stĺpec chýba.\n",
"\n",
"Okrem identifikácie chýbajúcich hodnôt poskytuje pandas pohodlný spôsob na odstránenie nulových hodnôt zo `Series` a `DataFrame`. Aby sme to videli v praxi, vráťme sa k `example3`. Funkcia `DataFrame.dropna()` pomáha pri odstraňovaní riadkov s nulovými hodnotami.\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7uIvS097gRsD",
"outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"2 \n",
"dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example3 = example3.dropna()\n",
"example3"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hil2cr64gRsD"
},
"source": [
"Všimnite si, že toto by malo vyzerať ako váš výstup z `example3[example3.notnull()]`. Rozdiel je v tom, že namiesto indexovania na základe maskovaných hodnôt, `dropna` odstránil tieto chýbajúce hodnoty zo `Series` `example3`.\n",
"\n",
"Keďže DataFrames majú dve dimenzie, ponúkajú viac možností na odstraňovanie údajov.\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "an-l74sPgRsE",
"outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 7\n",
"1 2.0 5.0 8\n",
"2 NaN 6.0 9"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4 = pd.DataFrame([[1, np.nan, 7], \n",
" [2, 5, 8], \n",
" [np.nan, 6, 9]])\n",
"example4"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "66wwdHZrgRsE"
},
"source": [
"(Všimli ste si, že pandas zmenil typ dvoch stĺpcov na float, aby mohli obsahovať hodnoty `NaN`?)\n",
"\n",
"Nemôžete odstrániť jednu hodnotu z `DataFrame`, takže musíte odstrániť celé riadky alebo stĺpce. V závislosti od toho, čo robíte, môžete chcieť vykonať jedno alebo druhé, a preto vám pandas poskytuje možnosti pre obe. Keďže v dátovej vede stĺpce zvyčajne predstavujú premenné a riadky predstavujú pozorovania, je pravdepodobnejšie, že budete odstraňovať riadky dát; predvolené nastavenie pre `dropna()` je odstrániť všetky riadky, ktoré obsahujú akékoľvek nulové hodnoty:\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "jAVU24RXgRsE",
"outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"1 2.0 5.0 8"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TrQRBuTDgRsE"
},
"source": [
"Ak je to potrebné, môžete odstrániť hodnoty NA zo stĺpcov. Použite `axis=1` na to:\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "GrBhxu9GgRsE",
"outputId": "ff4001f3-2e61-4509-d60e-0093d1068437",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 2\n",
"0 7\n",
"1 8\n",
"2 9"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4.dropna(axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KWXiKTfMgRsF"
},
"source": [
"Upozorňujeme, že toto môže odstrániť veľa údajov, ktoré by ste mohli chcieť zachovať, najmä v menších datasetoch. Čo ak chcete odstrániť iba riadky alebo stĺpce, ktoré obsahujú niekoľko alebo dokonca všetky nulové hodnoty? Tieto nastavenia môžete špecifikovať v `dropna` pomocou parametrov `how` a `thresh`.\n",
"\n",
"Predvolene je nastavené `how='any'` (ak by ste si to chceli overiť sami alebo zistiť, aké ďalšie parametre táto metóda má, spustite `example4.dropna?` v bunkách kódu). Alternatívne môžete špecifikovať `how='all'`, aby sa odstránili iba riadky alebo stĺpce, ktoré obsahujú všetky nulové hodnoty. Rozšírme náš príklad `DataFrame`, aby sme si to ukázali v praxi v nasledujúcom cvičení.\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "Bcf_JWTsgRsF",
"outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 NaN 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 NaN 6.0 9 NaN"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4[3] = np.nan\n",
"example4"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pNZer7q9JPNC"
},
"source": [
"> Kľúčové poznatky: \n",
"1. Odstraňovanie nulových hodnôt je dobrý nápad iba v prípade, že dataset je dostatočne veľký. \n",
"2. Celé riadky alebo stĺpce môžu byť odstránené, ak im chýba väčšina údajov. \n",
"3. Metóda `DataFrame.dropna(axis=)` pomáha pri odstraňovaní nulových hodnôt. Argument `axis` určuje, či sa majú odstrániť riadky alebo stĺpce. \n",
"4. Môže sa použiť aj argument `how`. Predvolene je nastavený na hodnotu `any`. To znamená, že odstráni iba tie riadky/stĺpce, ktoré obsahujú akékoľvek nulové hodnoty. Môže byť nastavený na hodnotu `all`, aby sa špecifikovalo, že odstránime iba tie riadky/stĺpce, kde sú všetky hodnoty nulové. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oXXSfQFHgRsF"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true,
"id": "ExUwQRxpgRsF",
"trusted": false
},
"outputs": [],
"source": [
"# How might you go about dropping just column 3?\n",
"# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "38kwAihWgRsG"
},
"source": [
"Parameter `thresh` vám poskytuje jemnejšiu kontrolu: nastavíte počet *nenulových* hodnôt, ktoré musí riadok alebo stĺpec obsahovať, aby bol zachovaný:\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "M9dCNMaagRsG",
"outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"1 2.0 5.0 8 NaN"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4.dropna(axis='rows', thresh=3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fmSFnzZegRsG"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "mCcxLGyUgRsG"
},
"source": [
"### Vyplnenie nulových hodnôt\n",
"\n",
"Niekedy má zmysel vyplniť chýbajúce hodnoty takými, ktoré by mohli byť platné. Existuje niekoľko techník na vyplnenie nulových hodnôt. Prvou je použitie doménových znalostí (znalostí o téme, na ktorej je dataset založený) na približné určenie chýbajúcich hodnôt.\n",
"\n",
"Môžete použiť `isnull` na vykonanie tejto operácie priamo, ale to môže byť zdĺhavé, najmä ak máte veľa hodnôt na vyplnenie. Keďže ide o tak bežnú úlohu v dátovej vede, pandas poskytuje funkciu `fillna`, ktorá vráti kópiu `Series` alebo `DataFrame` s chýbajúcimi hodnotami nahradenými hodnotami podľa vášho výberu. Vytvorme si ďalší príklad `Series`, aby sme videli, ako to funguje v praxi.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CE8S7louLezV"
},
"source": [
"### Kategorické údaje (Nenumerické)\n",
"Najprv sa pozrime na nenumerické údaje. V datasetoch máme stĺpce s kategorizovanými údajmi, napr. Pohlavie, Pravda alebo Nepravda atď.\n",
"\n",
"Vo väčšine týchto prípadov nahrádzame chýbajúce hodnoty `modom` stĺpca. Povedzme, že máme 100 dátových bodov, z ktorých 90 uviedlo Pravda, 8 uviedlo Nepravda a 2 nevyplnili. Potom môžeme tieto 2 nahradiť hodnotou Pravda, berúc do úvahy celý stĺpec.\n",
"\n",
"Opäť tu môžeme využiť znalosti z danej oblasti. Pozrime sa na príklad vyplnenia pomocou modu.\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "MY5faq4yLdpQ",
"outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 2 True\n",
"1 3 4 None\n",
"2 5 6 False\n",
"3 7 8 True\n",
"4 9 10 True"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n",
" [3,4,None],\n",
" [5,6,\"False\"],\n",
" [7,8,\"True\"],\n",
" [9,10,\"True\"]])\n",
"\n",
"fill_with_mode"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MLAoMQOfNPlA"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WKy-9Y2tN5jv",
"outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf"
},
"outputs": [
{
"data": {
"text/plain": [
"True 3\n",
"False 1\n",
"Name: 2, dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_mode[2].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6iNz_zG_OKrx"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "TxPKteRvNPOs"
},
"outputs": [],
"source": [
"fill_with_mode[2].fillna('True',inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "tvas7c9_OPWE",
"outputId": "ec3c8e44-d644-475e-9e22-c65101965850"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 2 True\n",
"1 3 4 True\n",
"2 5 6 False\n",
"3 7 8 True\n",
"4 9 10 True"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_mode"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SktitLxxOR16"
},
"source": [
"Ako vidíme, nulová hodnota bola nahradená. Netreba dodávať, že sme mohli napísať čokoľvek namiesto `'True'` a bolo by to nahradené.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "heYe1I0dOmQ_"
},
"source": [
"### Číselné údaje\n",
"Teraz sa pozrime na číselné údaje. Tu máme dva bežné spôsoby nahrádzania chýbajúcich hodnôt:\n",
"\n",
"1. Nahradiť mediánom riadku\n",
"2. Nahradiť priemerom riadku\n",
"\n",
"Medián používame v prípade, že údaje sú skreslené a obsahujú odľahlé hodnoty. Je to preto, že medián je odolný voči odľahlým hodnotám.\n",
"\n",
"Keď sú údaje normalizované, môžeme použiť priemer, pretože v takom prípade budú priemer a medián veľmi podobné.\n",
"\n",
"Najprv si vezmime stĺpec, ktorý má normálne rozdelenie, a vyplňme chýbajúce hodnoty priemerom stĺpca.\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "09HM_2feOj5Y",
"outputId": "7e309013-9acb-411c-9b06-4de795bbeeff"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-2.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.0</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 -2.0 0 1\n",
"1 -1.0 2 3\n",
"2 NaN 4 5\n",
"3 1.0 6 7\n",
"4 2.0 8 9"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_mean = pd.DataFrame([[-2,0,1],\n",
" [-1,2,3],\n",
" [np.nan,4,5],\n",
" [1,6,7],\n",
" [2,8,9]])\n",
"\n",
"fill_with_mean"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ka7-wNfzSxbx"
},
"source": [
"Priemer stĺpca je\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XYtYEf5BSxFL",
"outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70"
},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(fill_with_mean[0])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oBSRGxKRS39K"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "FzncQLmuS5jh",
"outputId": "00f74fff-01f4-4024-c261-796f50f01d2e"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-2.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.0</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 -2.0 0 1\n",
"1 -1.0 2 3\n",
"2 0.0 4 5\n",
"3 1.0 6 7\n",
"4 2.0 8 9"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n",
"fill_with_mean"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CwpVFCrPTC5z"
},
"source": [
"Ako môžeme vidieť, chýbajúca hodnota bola nahradená jej priemerom.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jIvF13a1i00Z"
},
"source": [
"Teraz vyskúšajme ďalší dataframe a tentokrát nahradíme hodnoty None mediánom stĺpca.\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "DA59Bqo3jBYZ",
"outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-2</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>8.0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 -2 0.0 1\n",
"1 -1 2.0 3\n",
"2 0 NaN 5\n",
"3 1 6.0 7\n",
"4 2 8.0 9"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_median = pd.DataFrame([[-2,0,1],\n",
" [-1,2,3],\n",
" [0,np.nan,5],\n",
" [1,6,7],\n",
" [2,8,9]])\n",
"\n",
"fill_with_median"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mM1GpXYmjHnc"
},
"source": [
"Medián druhého stĺpca je\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uiDy5v3xjHHX",
"outputId": "564b6b74-2004-4486-90d4-b39330a64b88"
},
"outputs": [
{
"data": {
"text/plain": [
"4.0"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_median[1].median()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z9PLF75Jj_1s"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "lFKbOxCMkBbg",
"outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-2</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>4.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>8.0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 -2 0.0 1\n",
"1 -1 2.0 3\n",
"2 0 4.0 5\n",
"3 1 6.0 7\n",
"4 2 8.0 9"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n",
"fill_with_median"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8JtQ53GSkKWC"
},
"source": [
"Ako môžeme vidieť, hodnota NaN bola nahradená mediánom stĺpca\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0ybtWLDdgRsG",
"outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.0\n",
"b NaN\n",
"c 2.0\n",
"d NaN\n",
"e 3.0\n",
"dtype: float64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n",
"example5"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yrsigxRggRsH"
},
"source": [
"Všetky prázdne položky môžete vyplniť jednou hodnotou, napríklad `0`:\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KXMIPsQdgRsH",
"outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.0\n",
"b 0.0\n",
"c 2.0\n",
"d 0.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example5.fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RRlI5f_hkfKe"
},
"source": [
"> Hlavné body:\n",
"1. Dopĺňanie chýbajúcich hodnôt by sa malo vykonávať buď v prípade, že je málo údajov, alebo ak existuje stratégia na doplnenie chýbajúcich údajov.\n",
"2. Na doplnenie chýbajúcich hodnôt je možné využiť znalosti z danej oblasti a odhadnúť ich.\n",
"3. Pri kategóriálnych údajoch sa chýbajúce hodnoty najčastejšie nahrádzajú módou stĺpca.\n",
"4. Pri číselných údajoch sa chýbajúce hodnoty zvyčajne dopĺňajú priemerom (pre normalizované datasety) alebo mediánom stĺpcov.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FI9MmqFJgRsH"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true,
"id": "af-ezpXdgRsH",
"trusted": false
},
"outputs": [],
"source": [
"# What happens if you try to fill null values with a string, like ''?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kq3hw1kLgRsI"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vO3BuNrggRsI",
"outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.0\n",
"b 1.0\n",
"c 2.0\n",
"d 2.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example5.fillna(method='ffill')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nDXeYuHzgRsI"
},
"source": [
"Môžete tiež **spätne doplniť**, aby ste spätne propagovali ďalšiu platnú hodnotu na vyplnenie null:\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4M5onHcEgRsI",
"outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.0\n",
"b 2.0\n",
"c 2.0\n",
"d 3.0\n",
"e 3.0\n",
"dtype: float64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example5.fillna(method='bfill')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"id": "MbBzTom5gRsI"
},
"source": [
"Ako môžete hádať, toto funguje rovnako s DataFrames, ale môžete tiež špecifikovať `axis`, pozdĺž ktorého chcete vyplniť nulové hodnoty:\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "aRpIvo4ZgRsI",
"outputId": "905a980a-a808-4eca-d0ba-224bd7d85955",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 NaN 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 NaN 6.0 9 NaN"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "VM1qtACAgRsI",
"outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>7.0</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 1.0 7.0 7.0\n",
"1 2.0 5.0 8.0 8.0\n",
"2 NaN 6.0 9.0 9.0"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4.fillna(method='ffill', axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZeMc-I1EgRsI"
},
"source": [
"Všimnite si, že keď nie je k dispozícii predchádzajúca hodnota na doplnenie, zostáva nulová hodnota.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eeAoOU0RgRsJ"
},
"source": []
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true,
"id": "e8S-CjW8gRsJ",
"trusted": false
},
"outputs": [],
"source": [
"# What output does example4.fillna(method='bfill', axis=1) produce?\n",
"# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n",
"# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YHgy0lIrgRsJ"
},
"source": [
"Môžete byť kreatívni pri používaní `fillna`. Napríklad sa pozrime znova na `example4`, ale tentokrát vyplňme chýbajúce hodnoty priemerom všetkých hodnôt v `DataFrame`:\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "OtYVErEygRsJ",
"outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>5.5</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.5</td>\n",
" <td>6.0</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1.0 5.5 7 NaN\n",
"1 2.0 5.0 8 NaN\n",
"2 1.5 6.0 9 NaN"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example4.fillna(example4.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zpMvCkLSgRsJ"
},
"source": [
"Všimnite si, že stĺpec 3 je stále bez hodnoty: predvolený smer je vyplniť hodnoty po riadkoch.\n",
"\n",
"> **Poučenie:** Existuje viacero spôsobov, ako sa vysporiadať s chýbajúcimi hodnotami vo vašich dátových súboroch. Konkrétna stratégia, ktorú použijete (odstránenie, nahradenie alebo spôsob, akým ich nahradíte), by mala byť určená špecifikami daných dát. Lepší zmysel pre prácu s chýbajúcimi hodnotami si vyviniete čím viac budete pracovať a interagovať s dátovými súbormi.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bauDnESIl9FH"
},
"source": [
"### Kódovanie kategóriálnych údajov\n",
"\n",
"Modely strojového učenia pracujú iba s číslami a akoukoľvek formou číselných údajov. Nebudú schopné rozlíšiť medzi Áno a Nie, ale dokážu rozlíšiť medzi 0 a 1. Preto po doplnení chýbajúcich hodnôt musíme kategóriálne údaje zakódovať do nejakej číselnej formy, aby ich model pochopil.\n",
"\n",
"Kódovanie je možné vykonať dvoma spôsobmi. Budeme ich diskutovať ďalej.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uDq9SxB7mu5i"
},
"source": [
"**KÓDOVANIE ŠTÍTKA**\n",
"\n",
"Kódovanie štítka v podstate znamená prevod každej kategórie na číslo. Napríklad, povedzme, že máme dataset leteckých pasažierov a je tu stĺpec obsahujúci ich triedu medzi nasledujúcimi ['business class', 'economy class', 'first class']. Ak sa vykoná kódovanie štítka, bude to transformované na [0,1,2]. Pozrime sa na príklad pomocou kódu. Keďže sa budeme učiť `scikit-learn` v nadchádzajúcich poznámkových blokoch, tu ho nepoužijeme.\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "1vGz7uZyoWHL",
"outputId": "9e252855-d193-4103-a54d-028ea7787b34"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10</td>\n",
" <td>business class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>first class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>40</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>50</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>60</td>\n",
" <td>business class</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID class\n",
"0 10 business class\n",
"1 20 first class\n",
"2 30 economy class\n",
"3 40 economy class\n",
"4 50 economy class\n",
"5 60 business class"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"label = pd.DataFrame([\n",
" [10,'business class'],\n",
" [20,'first class'],\n",
" [30, 'economy class'],\n",
" [40, 'economy class'],\n",
" [50, 'economy class'],\n",
" [60, 'business class']\n",
"],columns=['ID','class'])\n",
"label"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IDHnkwTYov-h"
},
"source": [
"Na vykonanie kódovania štítkov v 1. stĺpci musíme najprv opísať mapovanie z každej triedy na číslo, predtým ako nahradíme\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "ZC5URJG3o1ES",
"outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>40</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>60</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID class\n",
"0 10 0\n",
"1 20 2\n",
"2 30 1\n",
"3 40 1\n",
"4 50 1\n",
"5 60 0"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class_labels = {'business class':0,'economy class':1,'first class':2}\n",
"label['class'] = label['class'].replace(class_labels)\n",
"label"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ftnF-TyapOPt"
},
"source": [
"Ako vidíme, výsledok zodpovedá tomu, čo sme očakávali. Takže, kedy používame kódovanie štítkov? Kódovanie štítkov sa používa v jednej alebo oboch z nasledujúcich situácií:\n",
"1. Keď je počet kategórií veľký\n",
"2. Keď sú kategórie usporiadané.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eQPAPVwsqWT7"
},
"source": [
"**JEDNOHORÚCE KÓDOVANIE**\n",
"\n",
"Ďalším typom kódovania je Jednohorúce kódovanie. Pri tomto type kódovania sa každá kategória stĺpca pridá ako samostatný stĺpec a každému dátovému bodu sa priradí 0 alebo 1 podľa toho, či obsahuje danú kategóriu. Ak teda existuje n rôznych kategórií, do dátového rámca sa pridá n stĺpcov.\n",
"\n",
"Napríklad, vezmime si ten istý príklad triedy lietadla. Kategórie boli: ['business class', 'economy class', 'first class']. Ak vykonáme jednohorúce kódovanie, do datasetu sa pridajú nasledujúce tri stĺpce: ['class_business class', 'class_economy class', 'class_first class'].\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "ZM0eVh0ArKUL",
"outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10</td>\n",
" <td>business class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>first class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>40</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>50</td>\n",
" <td>economy class</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>60</td>\n",
" <td>business class</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID class\n",
"0 10 business class\n",
"1 20 first class\n",
"2 30 economy class\n",
"3 40 economy class\n",
"4 50 economy class\n",
"5 60 business class"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_hot = pd.DataFrame([\n",
" [10,'business class'],\n",
" [20,'first class'],\n",
" [30, 'economy class'],\n",
" [40, 'economy class'],\n",
" [50, 'economy class'],\n",
" [60, 'business class']\n",
"],columns=['ID','class'])\n",
"one_hot"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aVnZ7paDrWmb"
},
"source": [
"Poďme vykonať one hot encoding na 1. stĺpci\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"id": "RUPxf7egrYKr"
},
"outputs": [],
"source": [
"one_hot_data = pd.get_dummies(one_hot,columns=['class'])"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "TM37pHsFr4ge",
"outputId": "7be15f53-79b2-447a-979c-822658339a9e"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>class_business class</th>\n",
" <th>class_economy class</th>\n",
" <th>class_first class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>40</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>50</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>60</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID class_business class class_economy class class_first class\n",
"0 10 1 0 0\n",
"1 20 0 0 1\n",
"2 30 0 1 0\n",
"3 40 0 1 0\n",
"4 50 0 1 0\n",
"5 60 1 0 0"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_hot_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_zXRLOjXujdA"
},
"source": [
"Každý stĺpec s jedným horúcim kódovaním obsahuje 0 alebo 1, čo určuje, či daná kategória existuje pre daný dátový bod.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bDnC4NQOu0qr"
},
"source": [
"Kedy používame one hot encoding? One hot encoding sa používa v jednej alebo oboch z nasledujúcich situácií:\n",
"\n",
"1. Keď je počet kategórií a veľkosť datasetu menšia.\n",
"2. Keď kategórie nemajú žiadne konkrétne poradie.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XnUmci_4uvyu"
},
"source": [
"> Kľúčové poznatky: \n",
"1. Kódovanie sa vykonáva na prevod nenumerických údajov na numerické údaje. \n",
"2. Existujú dva typy kódovania: Label encoding a One Hot encoding, pričom oba môžu byť použité podľa požiadaviek datasetu. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K8UXOJYRgRsJ"
},
"source": [
"## Odstraňovanie duplicitných údajov\n",
"\n",
"> **Cieľ učenia:** Na konci tejto podsekcie by ste mali byť schopní identifikovať a odstrániť duplicitné hodnoty z DataFrames.\n",
"\n",
"Okrem chýbajúcich údajov sa často stretnete s duplicitnými údajmi v reálnych datasetoch. Našťastie, pandas poskytuje jednoduchý spôsob na detekciu a odstránenie duplicitných záznamov.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qrEG-Wa0gRsJ"
},
"source": [
"### Identifikácia duplicitných hodnôt: `duplicated`\n",
"\n",
"Duplicitné hodnoty môžete jednoducho identifikovať pomocou metódy `duplicated` v pandas, ktorá vracia Booleovskú masku označujúcu, či je záznam v `DataFrame` duplicitou skoršieho záznamu. Vytvorme ďalší príklad `DataFrame`, aby sme si to ukázali v praxi.\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "ZLu6FEnZgRsJ",
"outputId": "376512d1-d842-4db1-aea3-71052aeeecaf",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2\n",
"2 A 1\n",
"3 B 3\n",
"4 B 3"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n",
" 'numbers': [1, 2, 1, 3, 3]})\n",
"example6"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cIduB5oBgRsK",
"outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2",
"trusted": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 True\n",
"dtype: bool"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example6.duplicated()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0eDRJD4SgRsK"
},
"source": [
"### Odstránenie duplicitných hodnôt: `drop_duplicates`\n",
"`drop_duplicates` jednoducho vráti kópiu údajov, pre ktoré sú všetky hodnoty označené ako `duplicated` nastavené na `False`:\n"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "w_YPpqIqgRsK",
"outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2\n",
"3 B 3"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example6.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "69AqoCZAgRsK"
},
"source": [
"Oba `duplicated` a `drop_duplicates` štandardne zohľadňujú všetky stĺpce, ale môžete špecifikovať, aby skúmali iba podmnožinu stĺpcov vo vašom `DataFrame`:\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "BILjDs67gRsK",
"outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3",
"trusted": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letters</th>\n",
" <th>numbers</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letters numbers\n",
"0 A 1\n",
"1 B 2"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example6.drop_duplicates(['letters'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GvX4og1EgRsL"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n---\n\n**Upozornenie**: \nTento dokument bol preložený pomocou služby na automatický preklad [Co-op Translator](https://github.com/Azure/co-op-translator). Hoci sa snažíme o presnosť, upozorňujeme, že automatické preklady môžu obsahovať chyby alebo nepresnosti. Pôvodný dokument v jeho pôvodnom jazyku by mal byť považovaný za autoritatívny zdroj. Pre kritické informácie sa odporúča profesionálny ľudský preklad. Nezodpovedáme za akékoľvek nedorozumenia alebo nesprávne interpretácie vyplývajúce z použitia tohto prekladu.\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"colab": {
"name": "notebook.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
},
"coopTranslator": {
"original_hash": "8533b3a2230311943339963fc7f04c21",
"translation_date": "2025-09-02T07:49:02+00:00",
"source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb",
"language_code": "sk"
}
},
"nbformat": 4,
"nbformat_minor": 0
}