You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3715 lines
103 KiB
3715 lines
103 KiB
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "rQ8UhzFpgRra"
|
|
},
|
|
"source": [
|
|
"# Příprava dat\n",
|
|
"\n",
|
|
"[Originální zdroj notebooku z *Data Science: Úvod do strojového učení pro datovou vědu Python a Machine Learning Studio od Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)\n",
|
|
"\n",
|
|
"## Zkoumání informací o `DataFrame`\n",
|
|
"\n",
|
|
"> **Cíl učení:** Na konci této podsekce byste měli být schopni pohodlně najít obecné informace o datech uložených v pandas DataFrames.\n",
|
|
"\n",
|
|
"Jakmile načtete svá data do pandas, je velmi pravděpodobné, že budou uložena v `DataFrame`. Ale pokud má datová sada ve vašem `DataFrame` 60 000 řádků a 400 sloupců, jak vůbec začít získávat přehled o tom, s čím pracujete? Naštěstí pandas poskytuje několik praktických nástrojů, které umožňují rychle získat celkový přehled o `DataFrame`, kromě zobrazení prvních a posledních několika řádků.\n",
|
|
"\n",
|
|
"Abychom prozkoumali tuto funkcionalitu, importujeme knihovnu Python scikit-learn a použijeme ikonickou datovou sadu, kterou každý datový vědec viděl už stokrát: datovou sadu *Iris* britského biologa Ronalda Fishera, použitou v jeho článku z roku 1936 \"Použití více měření v taxonomických problémech\":\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "hB1RofhdgRrp",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"from sklearn.datasets import load_iris\n",
|
|
"\n",
|
|
"iris = load_iris()\n",
|
|
"iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "AGA0A_Y8hMdz"
|
|
},
|
|
"source": [
|
|
"### `DataFrame.shape`\n",
|
|
"Načetli jsme dataset Iris do proměnné `iris_df`. Než se pustíme do analýzy dat, bylo by užitečné zjistit počet datových bodů, které máme, a celkovou velikost datasetu. Je dobré mít přehled o objemu dat, se kterými pracujeme.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "LOe5jQohhulf",
|
|
"outputId": "fb0577ac-3b4a-4623-cb41-20e1b264b3e9"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(150, 4)"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "smE7AGzOhxk2"
|
|
},
|
|
"source": [
|
|
"Takže máme 150 řádků a 4 sloupce dat. Každý řádek představuje jeden datový bod a každý sloupec představuje jednu vlastnost spojenou s datovým rámcem. V podstatě tedy máme 150 datových bodů, z nichž každý obsahuje 4 vlastnosti.\n",
|
|
"\n",
|
|
"`shape` je zde atributem datového rámce, nikoliv funkcí, což je důvod, proč nekončí párem závorek.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "d3AZKs0PinGP"
|
|
},
|
|
"source": [
|
|
"### `DataFrame.columns`\n",
|
|
"Nyní se podíváme na 4 sloupce dat. Co přesně každý z nich představuje? Atribut `columns` nám poskytne názvy sloupců v datovém rámci.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "YPGh_ziji-CY",
|
|
"outputId": "74e7a43a-77cc-4c80-da56-7f50767c37a0"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n",
|
|
" 'petal width (cm)'],\n",
|
|
" dtype='object')"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "TsobcU_VjCC_"
|
|
},
|
|
"source": [
|
|
"Jak vidíme, jsou zde čtyři (4) sloupce. Atribut `columns` nám říká názvy sloupců a v podstatě nic jiného. Tento atribut nabývá na důležitosti, když chceme identifikovat vlastnosti, které datová sada obsahuje.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "2UTlvkjmgRrs"
|
|
},
|
|
"source": [
|
|
"### `DataFrame.info`\n",
|
|
"Množství dat (určené atributem `shape`) a názvy vlastností nebo sloupců (určené atributem `columns`) nám poskytují určité informace o datové sadě. Nyní bychom chtěli jít hlouběji do datové sady. Funkce `DataFrame.info()` je pro tento účel velmi užitečná.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "dHHRyG0_gRrt",
|
|
"outputId": "d8fb0c40-4f18-4e19-da48-c8db77d1d3a5",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|
"RangeIndex: 150 entries, 0 to 149\n",
|
|
"Data columns (total 4 columns):\n",
|
|
" # Column Non-Null Count Dtype \n",
|
|
"--- ------ -------------- ----- \n",
|
|
" 0 sepal length (cm) 150 non-null float64\n",
|
|
" 1 sepal width (cm) 150 non-null float64\n",
|
|
" 2 petal length (cm) 150 non-null float64\n",
|
|
" 3 petal width (cm) 150 non-null float64\n",
|
|
"dtypes: float64(4)\n",
|
|
"memory usage: 4.8 KB\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "1XgVMpvigRru"
|
|
},
|
|
"source": [
|
|
"Zde můžeme udělat několik pozorování:\n",
|
|
"1. Datový typ každého sloupce: V tomto datasetu jsou všechna data uložena jako 64bitová čísla s plovoucí desetinnou čárkou.\n",
|
|
"2. Počet nenulových hodnot: Práce s nulovými hodnotami je důležitým krokem při přípravě dat. Tím se budeme zabývat později v notebooku.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "IYlyxbpWFEF4"
|
|
},
|
|
"source": [
|
|
"### DataFrame.describe()\n",
|
|
"Řekněme, že máme v našem datasetu hodně číselných dat. Jednovariabilní statistické výpočty, jako je průměr, medián, kvartily atd., lze provádět na jednotlivých sloupcích. Funkce `DataFrame.describe()` nám poskytuje statistický přehled číselných sloupců datasetu.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 297
|
|
},
|
|
"id": "tWV-CMstFIRA",
|
|
"outputId": "4fc49941-bc13-4b0c-a412-cb39e7d3f289"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>sepal length (cm)</th>\n",
|
|
" <th>sepal width (cm)</th>\n",
|
|
" <th>petal length (cm)</th>\n",
|
|
" <th>petal width (cm)</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>count</th>\n",
|
|
" <td>150.000000</td>\n",
|
|
" <td>150.000000</td>\n",
|
|
" <td>150.000000</td>\n",
|
|
" <td>150.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>mean</th>\n",
|
|
" <td>5.843333</td>\n",
|
|
" <td>3.057333</td>\n",
|
|
" <td>3.758000</td>\n",
|
|
" <td>1.199333</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>std</th>\n",
|
|
" <td>0.828066</td>\n",
|
|
" <td>0.435866</td>\n",
|
|
" <td>1.765298</td>\n",
|
|
" <td>0.762238</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>min</th>\n",
|
|
" <td>4.300000</td>\n",
|
|
" <td>2.000000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.100000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>25%</th>\n",
|
|
" <td>5.100000</td>\n",
|
|
" <td>2.800000</td>\n",
|
|
" <td>1.600000</td>\n",
|
|
" <td>0.300000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>50%</th>\n",
|
|
" <td>5.800000</td>\n",
|
|
" <td>3.000000</td>\n",
|
|
" <td>4.350000</td>\n",
|
|
" <td>1.300000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>75%</th>\n",
|
|
" <td>6.400000</td>\n",
|
|
" <td>3.300000</td>\n",
|
|
" <td>5.100000</td>\n",
|
|
" <td>1.800000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>max</th>\n",
|
|
" <td>7.900000</td>\n",
|
|
" <td>4.400000</td>\n",
|
|
" <td>6.900000</td>\n",
|
|
" <td>2.500000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
|
|
"count 150.000000 150.000000 150.000000 150.000000\n",
|
|
"mean 5.843333 3.057333 3.758000 1.199333\n",
|
|
"std 0.828066 0.435866 1.765298 0.762238\n",
|
|
"min 4.300000 2.000000 1.000000 0.100000\n",
|
|
"25% 5.100000 2.800000 1.600000 0.300000\n",
|
|
"50% 5.800000 3.000000 4.350000 1.300000\n",
|
|
"75% 6.400000 3.300000 5.100000 1.800000\n",
|
|
"max 7.900000 4.400000 6.900000 2.500000"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "zjjtW5hPGMuM"
|
|
},
|
|
"source": [
|
|
"Výstup výše ukazuje celkový počet datových bodů, průměr, směrodatnou odchylku, minimum, dolní kvartil (25 %), medián (50 %), horní kvartil (75 %) a maximální hodnotu každého sloupce.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "-lviAu99gRrv"
|
|
},
|
|
"source": [
|
|
"### `DataFrame.head`\n",
|
|
"S využitím všech výše uvedených funkcí a atributů jsme získali přehled na nejvyšší úrovni o datové sadě. Víme, kolik datových bodů je k dispozici, kolik je vlastností, jaký je datový typ každé vlastnosti a kolik hodnot není nulových pro každou vlastnost.\n",
|
|
"\n",
|
|
"Teď je čas podívat se na samotná data. Podívejme se, jak vypadají první řádky (první datové body) našeho `DataFrame`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "DZMJZh0OgRrw",
|
|
"outputId": "d9393ee5-c106-4797-f815-218f17160e00",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>sepal length (cm)</th>\n",
|
|
" <th>sepal width (cm)</th>\n",
|
|
" <th>petal length (cm)</th>\n",
|
|
" <th>petal width (cm)</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>5.1</td>\n",
|
|
" <td>3.5</td>\n",
|
|
" <td>1.4</td>\n",
|
|
" <td>0.2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>4.9</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>1.4</td>\n",
|
|
" <td>0.2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>4.7</td>\n",
|
|
" <td>3.2</td>\n",
|
|
" <td>1.3</td>\n",
|
|
" <td>0.2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>4.6</td>\n",
|
|
" <td>3.1</td>\n",
|
|
" <td>1.5</td>\n",
|
|
" <td>0.2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>3.6</td>\n",
|
|
" <td>1.4</td>\n",
|
|
" <td>0.2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
|
|
"0 5.1 3.5 1.4 0.2\n",
|
|
"1 4.9 3.0 1.4 0.2\n",
|
|
"2 4.7 3.2 1.3 0.2\n",
|
|
"3 4.6 3.1 1.5 0.2\n",
|
|
"4 5.0 3.6 1.4 0.2"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "EBHEimZuEFQK"
|
|
},
|
|
"source": [
|
|
"Jak vidíme v tomto výstupu, máme pět (5) záznamů datové sady. Pokud se podíváme na index vlevo, zjistíme, že se jedná o prvních pět řádků.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "oj7GkrTdgRry"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n",
|
|
"\n",
|
|
"Z výše uvedeného příkladu je zřejmé, že `DataFrame.head` ve výchozím nastavení vrací prvních pět řádků `DataFrame`. V níže uvedené buňce kódu, dokážete najít způsob, jak zobrazit více než pět řádků?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "EKRmRFFegRrz",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Hint: Consult the documentation by using iris_df.head?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "BJ_cpZqNgRr1"
|
|
},
|
|
"source": [
|
|
"### `DataFrame.tail`\n",
|
|
"Další způsob, jak se podívat na data, může být od konce (namísto od začátku). Opakem `DataFrame.head` je `DataFrame.tail`, který vrací posledních pět řádků `DataFrame`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 0
|
|
},
|
|
"id": "heanjfGWgRr2",
|
|
"outputId": "6ae09a21-fe09-4110-b0d7-1a1fbf34d7f3",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>sepal length (cm)</th>\n",
|
|
" <th>sepal width (cm)</th>\n",
|
|
" <th>petal length (cm)</th>\n",
|
|
" <th>petal width (cm)</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>145</th>\n",
|
|
" <td>6.7</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5.2</td>\n",
|
|
" <td>2.3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>146</th>\n",
|
|
" <td>6.3</td>\n",
|
|
" <td>2.5</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>1.9</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>147</th>\n",
|
|
" <td>6.5</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5.2</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>148</th>\n",
|
|
" <td>6.2</td>\n",
|
|
" <td>3.4</td>\n",
|
|
" <td>5.4</td>\n",
|
|
" <td>2.3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>149</th>\n",
|
|
" <td>5.9</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5.1</td>\n",
|
|
" <td>1.8</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
|
|
"145 6.7 3.0 5.2 2.3\n",
|
|
"146 6.3 2.5 5.0 1.9\n",
|
|
"147 6.5 3.0 5.2 2.0\n",
|
|
"148 6.2 3.4 5.4 2.3\n",
|
|
"149 5.9 3.0 5.1 1.8"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"iris_df.tail()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "31kBWfyLgRr3"
|
|
},
|
|
"source": [
|
|
"V praxi je užitečné mít možnost snadno prohlédnout prvních několik řádků nebo posledních několik řádků `DataFrame`, zejména když hledáte odlehlé hodnoty v uspořádaných datových sadách.\n",
|
|
"\n",
|
|
"Všechny funkce a atributy uvedené výše pomocí příkladů kódu nám pomáhají získat přehled o datech.\n",
|
|
"\n",
|
|
"> **Shrnutí:** Už jen pohledem na metadata o informacích v `DataFrame` nebo na prvních a posledních několik hodnot v něm můžete okamžitě získat představu o velikosti, tvaru a obsahu dat, se kterými pracujete.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "TvurZyLSDxq_"
|
|
},
|
|
"source": [
|
|
"### Chybějící data\n",
|
|
"Pojďme se podívat na chybějící data. Chybějící data nastávají, když v některých sloupcích nejsou uloženy žádné hodnoty.\n",
|
|
"\n",
|
|
"Uveďme si příklad: řekněme, že někdo je citlivý na svou váhu a nevyplní pole s váhou v dotazníku. Potom bude hodnota váhy pro tuto osobu chybět.\n",
|
|
"\n",
|
|
"Ve většině případů se v reálných datových sadách chybějící hodnoty vyskytují.\n",
|
|
"\n",
|
|
"**Jak Pandas zpracovává chybějící data**\n",
|
|
"\n",
|
|
"Pandas zpracovává chybějící hodnoty dvěma způsoby. První způsob jste již viděli v předchozích sekcích: `NaN`, neboli Not a Number. Toto je ve skutečnosti speciální hodnota, která je součástí specifikace IEEE pro čísla s plovoucí desetinnou čárkou, a používá se pouze k označení chybějících hodnot typu float.\n",
|
|
"\n",
|
|
"Pro chybějící hodnoty jiného typu než float používá pandas objekt `None` z Pythonu. I když se může zdát matoucí, že se setkáte se dvěma různými typy hodnot, které v podstatě znamenají totéž, existují logické programátorské důvody pro toto rozhodnutí. V praxi tento přístup umožňuje pandas nabídnout dobrý kompromis pro drtivou většinu případů. Přesto však jak `None`, tak `NaN` mají určitá omezení, na která je třeba dávat pozor, pokud jde o jejich použití.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "lOHqUlZFgRr5"
|
|
},
|
|
"source": [
|
|
"### `None`: nečíselná chybějící data\n",
|
|
"Protože `None` pochází z Pythonu, nemůže být použito v polích NumPy a pandas, která nemají datový typ `'object'`. Pamatujte, že NumPy pole (a datové struktury v pandas) mohou obsahovat pouze jeden typ dat. To jim dává obrovskou sílu pro práci s velkými objemy dat a výpočty, ale zároveň omezuje jejich flexibilitu. Taková pole musí být převedena na „nejnižší společný jmenovatel“, tedy datový typ, který zahrne vše v poli. Když je v poli `None`, znamená to, že pracujete s Python objekty.\n",
|
|
"\n",
|
|
"Pro ukázku, jak to funguje, zvažte následující příklad pole (všimněte si jeho `dtype`):\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "QIoNdY4ngRr7",
|
|
"outputId": "92779f18-62f4-4a03-eca2-e9a101604336",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([2, None, 6, 8], dtype=object)"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"example1 = np.array([2, None, 6, 8])\n",
|
|
"example1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "pdlgPNbhgRr7"
|
|
},
|
|
"source": [
|
|
"Realita upcast datových typů s sebou nese dva vedlejší efekty. Za prvé, operace budou prováděny na úrovni interpretovaného Python kódu namísto zkompilovaného NumPy kódu. To v podstatě znamená, že jakékoli operace zahrnující `Series` nebo `DataFrames` s hodnotou `None` budou pomalejší. I když si tohoto zpomalení pravděpodobně nevšimnete, u velkých datových sad by to mohlo představovat problém.\n",
|
|
"\n",
|
|
"Druhý vedlejší efekt vyplývá z prvního. Protože `None` v podstatě vrací `Series` nebo `DataFrame` zpět do světa základního Pythonu, použití agregací NumPy/pandas, jako je `sum()` nebo `min()`, na polích obsahujících hodnotu ``None`` obvykle vyvolá chybu:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 292
|
|
},
|
|
"id": "gWbx-KB9gRr8",
|
|
"outputId": "ecba710a-22ec-41d5-a39c-11f67e645b50",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "ignored",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"\u001b[0;32m<ipython-input-10-ce9901ad18bd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mexample1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
|
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 45\u001b[0m def _sum(a, axis=None, dtype=None, out=None, keepdims=False,\n\u001b[1;32m 46\u001b[0m initial=_NoValue, where=True):\n\u001b[0;32m---> 47\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minitial\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 48\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 49\u001b[0m def _prod(a, axis=None, dtype=None, out=None, keepdims=False,\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"example1.sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "LcEwO8UogRr9"
|
|
},
|
|
"source": [
|
|
"**Klíčová myšlenka**: Sčítání (a jiné operace) mezi celými čísly a hodnotami `None` není definováno, což může omezit možnosti práce s datovými sadami, které je obsahují.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "pWvVHvETgRr9"
|
|
},
|
|
"source": [
|
|
"### `NaN`: chybějící hodnoty typu float\n",
|
|
"\n",
|
|
"Na rozdíl od `None` podporuje NumPy (a tím pádem i pandas) `NaN` pro své rychlé, vektorové operace a ufuncs. Špatnou zprávou je, že jakákoli aritmetická operace provedená na `NaN` vždy vrátí `NaN`. Například:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "rcFYfMG9gRr9",
|
|
"outputId": "699e81b7-5c11-4b46-df1d-06071768690f",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"nan"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"np.nan + 1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "BW3zQD2-gRr-",
|
|
"outputId": "4525b6c4-495d-4f7b-a979-efce1dae9bd0",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"nan"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"np.nan * 0"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "fU5IPRcCgRr-"
|
|
},
|
|
"source": [
|
|
"Dobrá zpráva: agregace prováděné na polích obsahujících `NaN` nevyvolávají chyby. Špatná zpráva: výsledky nejsou jednotně užitečné:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "LCInVgSSgRr_",
|
|
"outputId": "fa06495a-0930-4867-87c5-6023031ea8b5",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(nan, nan, nan)"
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example2 = np.array([2, np.nan, 6, 8]) \n",
|
|
"example2.sum(), example2.min(), example2.max()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "nhlnNJT7gRr_"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "yan3QRaOgRr_",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# What happens if you add np.nan and None together?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "_iDvIRC8gRsA"
|
|
},
|
|
"source": [
|
|
"Pamatujte: `NaN` je pouze pro chybějící hodnoty s plovoucí desetinnou čárkou; neexistuje žádný ekvivalent `NaN` pro celá čísla, řetězce nebo Booleovské hodnoty.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "kj6EKdsAgRsA"
|
|
},
|
|
"source": [
|
|
"### `NaN` a `None`: nulové hodnoty v pandas\n",
|
|
"\n",
|
|
"I když se `NaN` a `None` mohou chovat poněkud odlišně, pandas je navrženo tak, aby s nimi zacházelo zaměnitelně. Pro lepší pochopení si vezměme `Series` celých čísel:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "Nji-KGdNgRsA",
|
|
"outputId": "36aa14d2-8efa-4bfd-c0ed-682991288822",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 1\n",
|
|
"1 2\n",
|
|
"2 3\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"int_series = pd.Series([1, 2, 3], dtype=int)\n",
|
|
"int_series"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "WklCzqb8gRsB"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "Cy-gqX5-gRsB",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Now set an element of int_series equal to None.\n",
|
|
"# How does that element show up in the Series?\n",
|
|
"# What is the dtype of the Series?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "WjMQwltNgRsB"
|
|
},
|
|
"source": [
|
|
"Při procesu převodu datových typů za účelem dosažení homogenity dat v `Series` a `DataFrame`, pandas ochotně přepíná chybějící hodnoty mezi `None` a `NaN`. Díky této vlastnosti návrhu může být užitečné považovat `None` a `NaN` za dvě různé podoby \"null\" v pandas. Ve skutečnosti některé základní metody, které budete používat k práci s chybějícími hodnotami v pandas, tuto myšlenku odrážejí ve svých názvech:\n",
|
|
"\n",
|
|
"- `isnull()`: Vytváří Booleovskou masku označující chybějící hodnoty\n",
|
|
"- `notnull()`: Opak metody `isnull()`\n",
|
|
"- `dropna()`: Vrací filtrovanou verzi dat\n",
|
|
"- `fillna()`: Vrací kopii dat s vyplněnými nebo imputovanými chybějícími hodnotami\n",
|
|
"\n",
|
|
"Tyto metody jsou důležité zvládnout a osvojit si, takže si je podrobněji projdeme jednu po druhé.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Yh5ifd9FgRsB"
|
|
},
|
|
"source": [
|
|
"### Detekce nulových hodnot\n",
|
|
"\n",
|
|
"Nyní, když jsme pochopili důležitost chybějících hodnot, musíme je v našem datovém souboru nejprve detekovat, než s nimi začneme pracovat. \n",
|
|
"Metody `isnull()` a `notnull()` jsou vaše hlavní nástroje pro detekci nulových dat. Obě vracejí Booleovské masky nad vašimi daty.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "e-vFp5lvgRsC",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"example3 = pd.Series([0, np.nan, '', None])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "1XdaJJ7PgRsC",
|
|
"outputId": "92fc363a-1874-471f-846d-f4f9ce1f51d0",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 False\n",
|
|
"1 True\n",
|
|
"2 False\n",
|
|
"3 True\n",
|
|
"dtype: bool"
|
|
]
|
|
},
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example3.isnull()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "PaSZ0SQygRsC"
|
|
},
|
|
"source": [
|
|
"Podívejte se pozorně na výstup. Překvapuje vás něco? Zatímco `0` je aritmetická nula, je to přesto zcela platné celé číslo a pandas ho takto i považuje. `''` je o něco jemnější. Zatímco jsme ho v sekci 1 použili k reprezentaci prázdné hodnoty řetězce, je to přesto objekt typu řetězec a nikoli reprezentace nulové hodnoty, pokud jde o pandas.\n",
|
|
"\n",
|
|
"Teď to otočíme a použijeme tyto metody způsobem, jakým je pravděpodobně budete používat v praxi. Boolean masky můžete použít přímo jako index ``Series`` nebo ``DataFrame``, což může být užitečné při práci s izolovanými chybějícími (nebo přítomnými) hodnotami.\n",
|
|
"\n",
|
|
"Pokud chceme celkový počet chybějících hodnot, stačí provést součet přes masku vytvořenou metodou `isnull()`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "JCcQVoPkHDUv",
|
|
"outputId": "001daa72-54f8-4bd5-842a-4df627a79d4d"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"2"
|
|
]
|
|
},
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example3.isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "PlBqEo3mgRsC"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "ggDVf5uygRsD",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try running example3[example3.notnull()].\n",
|
|
"# Before you do so, what do you expect to see?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "D_jWN7mHgRsD"
|
|
},
|
|
"source": [
|
|
"**Klíčové poznatky**: Metody `isnull()` a `notnull()` produkují podobné výsledky, když je použijete v DataFramech: zobrazují výsledky a index těchto výsledků, což vám může nesmírně pomoci při práci s vašimi daty.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "BvnoojWsgRr4"
|
|
},
|
|
"source": [
|
|
"### Práce s chybějícími daty\n",
|
|
"\n",
|
|
"> **Cíl učení:** Na konci této podkapitoly byste měli vědět, jak a kdy nahradit nebo odstranit nulové hodnoty z DataFrames.\n",
|
|
"\n",
|
|
"Modely strojového učení si samy nedokážou poradit s chybějícími daty. Proto je potřeba tyto chybějící hodnoty zpracovat ještě předtím, než data předáme modelu.\n",
|
|
"\n",
|
|
"Způsob, jakým se chybějící data zpracovávají, s sebou nese jemné kompromisy a může ovlivnit vaši finální analýzu i reálné výsledky.\n",
|
|
"\n",
|
|
"Existují především dva způsoby, jak se vypořádat s chybějícími daty:\n",
|
|
"\n",
|
|
"1. Odstranit řádek obsahující chybějící hodnotu \n",
|
|
"2. Nahradit chybějící hodnotu jinou hodnotou \n",
|
|
"\n",
|
|
"Oba tyto přístupy a jejich výhody a nevýhody si podrobně rozebereme.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "3VaYC1TvgRsD"
|
|
},
|
|
"source": [
|
|
"### Odstranění nulových hodnot\n",
|
|
"\n",
|
|
"Množství dat, které předáváme našemu modelu, má přímý vliv na jeho výkon. Odstranění nulových hodnot znamená, že snižujeme počet datových bodů, a tím i velikost datové sady. Proto je vhodné odstranit řádky s nulovými hodnotami, pokud je datová sada poměrně velká.\n",
|
|
"\n",
|
|
"Dalším případem může být situace, kdy určitý řádek nebo sloupec obsahuje mnoho chybějících hodnot. V takovém případě je možné je odstranit, protože by nepřidaly příliš hodnotu naší analýze, jelikož většina dat pro daný řádek/sloupec chybí.\n",
|
|
"\n",
|
|
"Kromě identifikace chybějících hodnot poskytuje pandas pohodlný způsob, jak odstranit nulové hodnoty z `Series` a `DataFrame`. Abychom si to ukázali v praxi, vraťme se k `example3`. Funkce `DataFrame.dropna()` pomáhá odstranit řádky s nulovými hodnotami.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "7uIvS097gRsD",
|
|
"outputId": "c13fc117-4ca1-4145-a0aa-42ac89e6e218",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 0\n",
|
|
"2 \n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example3 = example3.dropna()\n",
|
|
"example3"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "hil2cr64gRsD"
|
|
},
|
|
"source": [
|
|
"Všimněte si, že by to mělo vypadat jako váš výstup z `example3[example3.notnull()]`. Rozdíl zde spočívá v tom, že místo pouhého indexování na základě maskovaných hodnot metoda `dropna` odstranila tyto chybějící hodnoty ze `Series` `example3`.\n",
|
|
"\n",
|
|
"Protože DataFrames mají dvě dimenze, nabízejí více možností pro odstranění dat.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "an-l74sPgRsE",
|
|
"outputId": "340876a0-63ad-40f6-bd54-6240cdae50ab",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 7\n",
|
|
"1 2.0 5.0 8\n",
|
|
"2 NaN 6.0 9"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4 = pd.DataFrame([[1, np.nan, 7], \n",
|
|
" [2, 5, 8], \n",
|
|
" [np.nan, 6, 9]])\n",
|
|
"example4"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "66wwdHZrgRsE"
|
|
},
|
|
"source": [
|
|
"(Všimli jste si, že pandas převedl dva sloupce na typ float, aby mohl pracovat s hodnotami `NaN`?)\n",
|
|
"\n",
|
|
"Nemůžete odstranit jednotlivou hodnotu z `DataFrame`, takže musíte odstranit celé řádky nebo sloupce. V závislosti na tom, co děláte, můžete chtít provést jedno nebo druhé, a proto vám pandas nabízí možnosti pro obě varianty. Protože ve vědě o datech sloupce obvykle představují proměnné a řádky představují pozorování, je pravděpodobnější, že budete odstraňovat řádky dat; výchozí nastavení pro `dropna()` je odstranit všechny řádky, které obsahují jakékoli nulové hodnoty:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 80
|
|
},
|
|
"id": "jAVU24RXgRsE",
|
|
"outputId": "0b5e5aee-7187-4d3f-b583-a44136ae5f80",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"1 2.0 5.0 8"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4.dropna()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "TrQRBuTDgRsE"
|
|
},
|
|
"source": [
|
|
"Pokud je to nutné, můžete odstranit hodnoty NA ze sloupců. Použijte `axis=1` k tomu:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "GrBhxu9GgRsE",
|
|
"outputId": "ff4001f3-2e61-4509-d60e-0093d1068437",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>8</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 2\n",
|
|
"0 7\n",
|
|
"1 8\n",
|
|
"2 9"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4.dropna(axis='columns')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "KWXiKTfMgRsF"
|
|
},
|
|
"source": [
|
|
"Všimněte si, že to může odstranit mnoho dat, která byste mohli chtít zachovat, zejména u menších datových sad. Co když chcete odstranit pouze řádky nebo sloupce, které obsahují několik nebo dokonce všechny nulové hodnoty? Tyto nastavení určíte v `dropna` pomocí parametrů `how` a `thresh`.\n",
|
|
"\n",
|
|
"Ve výchozím nastavení je `how='any'` (pokud si to chcete ověřit sami nebo zjistit, jaké další parametry metoda má, spusťte `example4.dropna?` v buňce kódu). Alternativně můžete specifikovat `how='all`, abyste odstranili pouze řádky nebo sloupce, které obsahují všechny nulové hodnoty. Rozšíříme náš příklad `DataFrame`, abychom to viděli v praxi v následujícím cvičení.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "Bcf_JWTsgRsF",
|
|
"outputId": "72e0b1b8-52fa-4923-98ce-b6fbed6e44b1",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" <th>3</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>7</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2 3\n",
|
|
"0 1.0 NaN 7 NaN\n",
|
|
"1 2.0 5.0 8 NaN\n",
|
|
"2 NaN 6.0 9 NaN"
|
|
]
|
|
},
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4[3] = np.nan\n",
|
|
"example4"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "pNZer7q9JPNC"
|
|
},
|
|
"source": [
|
|
"> Klíčové poznatky: \n",
|
|
"1. Odstranění nulových hodnot je dobrý nápad pouze tehdy, pokud je datová sada dostatečně velká. \n",
|
|
"2. Celé řádky nebo sloupce lze odstranit, pokud jim chybí většina dat. \n",
|
|
"3. Metoda `DataFrame.dropna(axis=)` pomáhá při odstraňování nulových hodnot. Argument `axis` určuje, zda mají být odstraněny řádky nebo sloupce. \n",
|
|
"4. Lze použít také argument `how`. Ve výchozím nastavení je nastaven na `any`. To znamená, že odstraní pouze ty řádky/sloupce, které obsahují jakékoli nulové hodnoty. Může být nastaven na `all`, což specifikuje, že odstraníme pouze ty řádky/sloupce, kde jsou všechny hodnoty nulové. \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "oXXSfQFHgRsF"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "ExUwQRxpgRsF",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# How might you go about dropping just column 3?\n",
|
|
"# Hint: remember that you will need to supply both the axis parameter and the how parameter.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "38kwAihWgRsG"
|
|
},
|
|
"source": [
|
|
"Parametr `thresh` vám poskytuje jemnější kontrolu: nastavíte počet *ne-null* hodnot, které řádek nebo sloupec musí mít, aby byl zachován:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 80
|
|
},
|
|
"id": "M9dCNMaagRsG",
|
|
"outputId": "8093713a-54d2-4e54-c73f-4eea315cb6f2",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" <th>3</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2 3\n",
|
|
"1 2.0 5.0 8 NaN"
|
|
]
|
|
},
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4.dropna(axis='rows', thresh=3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "fmSFnzZegRsG"
|
|
},
|
|
"source": [
|
|
"Zde byly první a poslední řádek odstraněny, protože obsahují pouze dvě nenulové hodnoty.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "mCcxLGyUgRsG"
|
|
},
|
|
"source": [
|
|
"### Vyplňování nulových hodnot\n",
|
|
"\n",
|
|
"Někdy má smysl doplnit chybějící hodnoty takovými, které by mohly být platné. Existuje několik technik, jak vyplnit nulové hodnoty. První z nich je použití znalostí domény (znalostí o tématu, na kterém je datová sada založena) k přibližnému odhadu chybějících hodnot.\n",
|
|
"\n",
|
|
"Můžete použít `isnull` k provedení této operace přímo, ale to může být zdlouhavé, zejména pokud máte mnoho hodnot k vyplnění. Protože se jedná o tak běžný úkol v datové vědě, pandas poskytuje funkci `fillna`, která vrací kopii `Series` nebo `DataFrame` s chybějícími hodnotami nahrazenými hodnotami podle vašeho výběru. Pojďme vytvořit další příklad `Series`, abychom viděli, jak to funguje v praxi.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "CE8S7louLezV"
|
|
},
|
|
"source": [
|
|
"### Kategorická data (Nenumerická)\n",
|
|
"Nejprve se podívejme na nenumerická data. V datových sadách máme sloupce s kategorickými daty. Například pohlaví, Pravda nebo Nepravda atd.\n",
|
|
"\n",
|
|
"Ve většině těchto případů nahrazujeme chybějící hodnoty `modem` sloupce. Řekněme, že máme 100 datových bodů, z nichž 90 uvedlo Pravda, 8 uvedlo Nepravda a 2 nevyplnili. Poté můžeme tyto 2 nahradit hodnotou Pravda, pokud zohledníme celý sloupec.\n",
|
|
"\n",
|
|
"I zde můžeme využít znalosti z dané oblasti. Podívejme se na příklad vyplnění pomocí modu.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "MY5faq4yLdpQ",
|
|
"outputId": "19ab472e-1eed-4de8-f8a7-db2a3af3cb1a"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>None</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>False</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>7</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>9</td>\n",
|
|
" <td>10</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1 2 True\n",
|
|
"1 3 4 None\n",
|
|
"2 5 6 False\n",
|
|
"3 7 8 True\n",
|
|
"4 9 10 True"
|
|
]
|
|
},
|
|
"execution_count": 28,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n",
|
|
" [3,4,None],\n",
|
|
" [5,6,\"False\"],\n",
|
|
" [7,8,\"True\"],\n",
|
|
" [9,10,\"True\"]])\n",
|
|
"\n",
|
|
"fill_with_mode"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "MLAoMQOfNPlA"
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "WKy-9Y2tN5jv",
|
|
"outputId": "8da9fa16-e08c-447e-dea1-d4b1db2feebf"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"True 3\n",
|
|
"False 1\n",
|
|
"Name: 2, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_mode[2].value_counts()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "6iNz_zG_OKrx"
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {
|
|
"id": "TxPKteRvNPOs"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"fill_with_mode[2].fillna('True',inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "tvas7c9_OPWE",
|
|
"outputId": "ec3c8e44-d644-475e-9e22-c65101965850"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>False</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>7</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>9</td>\n",
|
|
" <td>10</td>\n",
|
|
" <td>True</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1 2 True\n",
|
|
"1 3 4 True\n",
|
|
"2 5 6 False\n",
|
|
"3 7 8 True\n",
|
|
"4 9 10 True"
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_mode"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "SktitLxxOR16"
|
|
},
|
|
"source": [
|
|
"Jak vidíme, nulová hodnota byla nahrazena. Netřeba dodávat, že jsme mohli napsat cokoli místo `'True'` a bylo by to nahrazeno.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "heYe1I0dOmQ_"
|
|
},
|
|
"source": [
|
|
"### Číselná data\n",
|
|
"Nyní se podíváme na číselná data. Zde máme dva běžné způsoby nahrazování chybějících hodnot:\n",
|
|
"\n",
|
|
"1. Nahrazení mediánem řádku\n",
|
|
"2. Nahrazení průměrem řádku\n",
|
|
"\n",
|
|
"Medián používáme v případě, že data jsou zkreslená a obsahují odlehlé hodnoty. Je to proto, že medián je vůči odlehlým hodnotám odolný.\n",
|
|
"\n",
|
|
"Když jsou data normalizovaná, můžeme použít průměr, protože v takovém případě jsou průměr a medián velmi blízko.\n",
|
|
"\n",
|
|
"Nejprve si vezměme sloupec, který má normální rozdělení, a vyplňme chybějící hodnoty průměrem sloupce.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "09HM_2feOj5Y",
|
|
"outputId": "7e309013-9acb-411c-9b06-4de795bbeeff"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>-2.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-1.0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 -2.0 0 1\n",
|
|
"1 -1.0 2 3\n",
|
|
"2 NaN 4 5\n",
|
|
"3 1.0 6 7\n",
|
|
"4 2.0 8 9"
|
|
]
|
|
},
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_mean = pd.DataFrame([[-2,0,1],\n",
|
|
" [-1,2,3],\n",
|
|
" [np.nan,4,5],\n",
|
|
" [1,6,7],\n",
|
|
" [2,8,9]])\n",
|
|
"\n",
|
|
"fill_with_mean"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ka7-wNfzSxbx"
|
|
},
|
|
"source": [
|
|
"Průměr sloupce je\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "XYtYEf5BSxFL",
|
|
"outputId": "68a78d18-f0e5-4a9a-a959-2c3676a57c70"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0.0"
|
|
]
|
|
},
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"np.mean(fill_with_mean[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "oBSRGxKRS39K"
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "FzncQLmuS5jh",
|
|
"outputId": "00f74fff-01f4-4024-c261-796f50f01d2e"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>-2.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-1.0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 -2.0 0 1\n",
|
|
"1 -1.0 2 3\n",
|
|
"2 0.0 4 5\n",
|
|
"3 1.0 6 7\n",
|
|
"4 2.0 8 9"
|
|
]
|
|
},
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)\n",
|
|
"fill_with_mean"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "CwpVFCrPTC5z"
|
|
},
|
|
"source": [
|
|
"Jak vidíme, chybějící hodnota byla nahrazena jejím průměrem.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "jIvF13a1i00Z"
|
|
},
|
|
"source": [
|
|
"Nyní zkusme další dataframe a tentokrát nahradíme hodnoty None mediánem sloupce.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "DA59Bqo3jBYZ",
|
|
"outputId": "85dae6ec-7394-4c36-fda0-e04769ec4a32"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>-2</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-1</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>8.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 -2 0.0 1\n",
|
|
"1 -1 2.0 3\n",
|
|
"2 0 NaN 5\n",
|
|
"3 1 6.0 7\n",
|
|
"4 2 8.0 9"
|
|
]
|
|
},
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_median = pd.DataFrame([[-2,0,1],\n",
|
|
" [-1,2,3],\n",
|
|
" [0,np.nan,5],\n",
|
|
" [1,6,7],\n",
|
|
" [2,8,9]])\n",
|
|
"\n",
|
|
"fill_with_median"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "mM1GpXYmjHnc"
|
|
},
|
|
"source": [
|
|
"Medián druhého sloupce je\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "uiDy5v3xjHHX",
|
|
"outputId": "564b6b74-2004-4486-90d4-b39330a64b88"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"4.0"
|
|
]
|
|
},
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_median[1].median()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "z9PLF75Jj_1s"
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "lFKbOxCMkBbg",
|
|
"outputId": "a8bd18fb-2765-47d4-e5fe-e965f57ed1f4"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>-2</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-1</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>8.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 -2 0.0 1\n",
|
|
"1 -1 2.0 3\n",
|
|
"2 0 4.0 5\n",
|
|
"3 1 6.0 7\n",
|
|
"4 2 8.0 9"
|
|
]
|
|
},
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)\n",
|
|
"fill_with_median"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "8JtQ53GSkKWC"
|
|
},
|
|
"source": [
|
|
"Jak vidíme, hodnota NaN byla nahrazena mediánem sloupce\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "0ybtWLDdgRsG",
|
|
"outputId": "b8c238ef-6024-4ee2-be2b-aa1f0fcac61d",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 1.0\n",
|
|
"b NaN\n",
|
|
"c 2.0\n",
|
|
"d NaN\n",
|
|
"e 3.0\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 38,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n",
|
|
"example5"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "yrsigxRggRsH"
|
|
},
|
|
"source": [
|
|
"Můžete vyplnit všechny prázdné položky jednou hodnotou, například `0`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "KXMIPsQdgRsH",
|
|
"outputId": "aeedfa0a-a421-4c2f-cb0d-183ce8f0c91d",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 1.0\n",
|
|
"b 0.0\n",
|
|
"c 2.0\n",
|
|
"d 0.0\n",
|
|
"e 3.0\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 39,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example5.fillna(0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "RRlI5f_hkfKe"
|
|
},
|
|
"source": [
|
|
"> Klíčové poznatky: \n",
|
|
"1. Vyplňování chybějících hodnot by mělo být prováděno buď tehdy, když je méně dat, nebo když existuje strategie pro doplnění chybějících dat. \n",
|
|
"2. K doplnění chybějících hodnot lze využít znalosti z dané oblasti tím, že se hodnoty přibližně odhadnou. \n",
|
|
"3. U kategoriálních dat se chybějící hodnoty většinou nahrazují módem sloupce. \n",
|
|
"4. U číselných dat se chybějící hodnoty obvykle doplňují průměrem (pro normalizované datové sady) nebo mediánem sloupců. \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "FI9MmqFJgRsH"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "af-ezpXdgRsH",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# What happens if you try to fill null values with a string, like ''?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "kq3hw1kLgRsI"
|
|
},
|
|
"source": [
|
|
"Můžete **doplnit** nulové hodnoty, což znamená použít poslední platnou hodnotu k vyplnění nulové hodnoty:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "vO3BuNrggRsI",
|
|
"outputId": "e2bc591b-0b48-4e88-ee65-754f2737c196",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 1.0\n",
|
|
"b 1.0\n",
|
|
"c 2.0\n",
|
|
"d 2.0\n",
|
|
"e 3.0\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example5.fillna(method='ffill')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "nDXeYuHzgRsI"
|
|
},
|
|
"source": [
|
|
"Můžete také **zpětně vyplnit**, abyste zpětně propagovali další platnou hodnotu a vyplnili null:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "4M5onHcEgRsI",
|
|
"outputId": "8f32b185-40dd-4a9f-bd85-54d6b6a414fe",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 1.0\n",
|
|
"b 2.0\n",
|
|
"c 2.0\n",
|
|
"d 3.0\n",
|
|
"e 3.0\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example5.fillna(method='bfill')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "MbBzTom5gRsI"
|
|
},
|
|
"source": [
|
|
"Jak můžete tušit, toto funguje stejně s DataFrames, ale můžete také specifikovat `axis`, podél kterého chcete vyplnit nulové hodnoty:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "aRpIvo4ZgRsI",
|
|
"outputId": "905a980a-a808-4eca-d0ba-224bd7d85955",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" <th>3</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>7</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2 3\n",
|
|
"0 1.0 NaN 7 NaN\n",
|
|
"1 2.0 5.0 8 NaN\n",
|
|
"2 NaN 6.0 9 NaN"
|
|
]
|
|
},
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 44,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "VM1qtACAgRsI",
|
|
"outputId": "71f2ad28-9b4e-4ff4-f5c3-e731eb489ade",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" <th>3</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>7.0</td>\n",
|
|
" <td>7.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8.0</td>\n",
|
|
" <td>8.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2 3\n",
|
|
"0 1.0 1.0 7.0 7.0\n",
|
|
"1 2.0 5.0 8.0 8.0\n",
|
|
"2 NaN 6.0 9.0 9.0"
|
|
]
|
|
},
|
|
"execution_count": 44,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4.fillna(method='ffill', axis=1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ZeMc-I1EgRsI"
|
|
},
|
|
"source": [
|
|
"Všimněte si, že když není k dispozici předchozí hodnota pro doplnění, zůstává nulová hodnota.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "eeAoOU0RgRsJ"
|
|
},
|
|
"source": [
|
|
"### Cvičení:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"id": "e8S-CjW8gRsJ",
|
|
"trusted": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# What output does example4.fillna(method='bfill', axis=1) produce?\n",
|
|
"# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?\n",
|
|
"# Can you think of a longer code snippet to write that can fill all of the null values in example4?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "YHgy0lIrgRsJ"
|
|
},
|
|
"source": [
|
|
"Můžete být kreativní při používání `fillna`. Například se podívejme znovu na `example4`, ale tentokrát vyplňme chybějící hodnoty průměrem všech hodnot v `DataFrame`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "OtYVErEygRsJ",
|
|
"outputId": "708b1e67-45ca-44bf-a5ee-8b2de09ece73",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" <th>3</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>5.5</td>\n",
|
|
" <td>7</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>1.5</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>9</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2 3\n",
|
|
"0 1.0 5.5 7 NaN\n",
|
|
"1 2.0 5.0 8 NaN\n",
|
|
"2 1.5 6.0 9 NaN"
|
|
]
|
|
},
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example4.fillna(example4.mean())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "zpMvCkLSgRsJ"
|
|
},
|
|
"source": [
|
|
"Všimněte si, že sloupec 3 je stále bez hodnoty: výchozí směr je vyplňovat hodnoty po řádcích.\n",
|
|
"\n",
|
|
"> **Shrnutí:** Existuje několik způsobů, jak se vypořádat s chybějícími hodnotami ve vašich datových sadách. Konkrétní strategie, kterou použijete (odstranění, nahrazení nebo způsob, jakým je nahradíte), by měla být určena specifiky těchto dat. Čím více budete s datovými sadami pracovat a interagovat, tím lepší smysl pro řešení chybějících hodnot si vyvinete.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "bauDnESIl9FH"
|
|
},
|
|
"source": [
|
|
"### Kódování kategorických dat\n",
|
|
"\n",
|
|
"Modely strojového učení pracují pouze s čísly a jakoukoli formou číselných dat. Nejsou schopny rozlišit mezi Ano a Ne, ale dokážou rozlišit mezi 0 a 1. Proto je po doplnění chybějících hodnot potřeba zakódovat kategorická data do číselné podoby, aby jim model porozuměl.\n",
|
|
"\n",
|
|
"Kódování lze provést dvěma způsoby. Ty si nyní rozebereme.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "uDq9SxB7mu5i"
|
|
},
|
|
"source": [
|
|
"**KÓDOVÁNÍ ŠTÍTKŮ**\n",
|
|
"\n",
|
|
"Kódování štítků spočívá v převodu každé kategorie na číslo. Například, řekněme, že máme dataset cestujících letecké společnosti a je zde sloupec obsahující jejich třídu mezi následujícími ['business class', 'economy class', 'first class']. Pokud by bylo provedeno kódování štítků, bylo by to převedeno na [0,1,2]. Podívejme se na příklad pomocí kódu. Protože se budeme učit `scikit-learn` v nadcházejících poznámkových blocích, zde ho nepoužijeme.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 235
|
|
},
|
|
"id": "1vGz7uZyoWHL",
|
|
"outputId": "9e252855-d193-4103-a54d-028ea7787b34"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ID</th>\n",
|
|
" <th>class</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>10</td>\n",
|
|
" <td>business class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>20</td>\n",
|
|
" <td>first class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>40</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>50</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>60</td>\n",
|
|
" <td>business class</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ID class\n",
|
|
"0 10 business class\n",
|
|
"1 20 first class\n",
|
|
"2 30 economy class\n",
|
|
"3 40 economy class\n",
|
|
"4 50 economy class\n",
|
|
"5 60 business class"
|
|
]
|
|
},
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"label = pd.DataFrame([\n",
|
|
" [10,'business class'],\n",
|
|
" [20,'first class'],\n",
|
|
" [30, 'economy class'],\n",
|
|
" [40, 'economy class'],\n",
|
|
" [50, 'economy class'],\n",
|
|
" [60, 'business class']\n",
|
|
"],columns=['ID','class'])\n",
|
|
"label"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "IDHnkwTYov-h"
|
|
},
|
|
"source": [
|
|
"Pro provedení kódování štítků na prvním sloupci musíme nejprve popsat mapování každé třídy na číslo, než provedeme nahrazení\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 235
|
|
},
|
|
"id": "ZC5URJG3o1ES",
|
|
"outputId": "aab0f1e7-e0f3-4c14-8459-9f9168c85437"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ID</th>\n",
|
|
" <th>class</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>10</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>20</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>40</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>50</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>60</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ID class\n",
|
|
"0 10 0\n",
|
|
"1 20 2\n",
|
|
"2 30 1\n",
|
|
"3 40 1\n",
|
|
"4 50 1\n",
|
|
"5 60 0"
|
|
]
|
|
},
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"class_labels = {'business class':0,'economy class':1,'first class':2}\n",
|
|
"label['class'] = label['class'].replace(class_labels)\n",
|
|
"label"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ftnF-TyapOPt"
|
|
},
|
|
"source": [
|
|
"Jak vidíme, výsledek odpovídá tomu, co jsme očekávali. Kdy tedy použít label encoding? Label encoding se používá v jednom nebo obou z následujících případů:\n",
|
|
"1. Když je počet kategorií velký\n",
|
|
"2. Když jsou kategorie v pořadí.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "eQPAPVwsqWT7"
|
|
},
|
|
"source": [
|
|
"**JEDNOHOTOVÉ KÓDOVÁNÍ**\n",
|
|
"\n",
|
|
"Dalším typem kódování je Jednohotové kódování (One Hot Encoding). Při tomto typu kódování se každá kategorie ve sloupci přidá jako samostatný sloupec a každý datový bod dostane hodnotu 0 nebo 1 podle toho, zda obsahuje danou kategorii. Pokud tedy existuje n různých kategorií, do datového rámce bude přidáno n sloupců.\n",
|
|
"\n",
|
|
"Například vezměme stejný příklad s třídami v letadle. Kategorie byly: ['business class', 'economy class', 'first class']. Pokud tedy provedeme jednohotové kódování, do datové sady budou přidány následující tři sloupce: ['class_business class', 'class_economy class', 'class_first class'].\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 235
|
|
},
|
|
"id": "ZM0eVh0ArKUL",
|
|
"outputId": "83238a76-b3a5-418d-c0b6-605b02b6891b"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ID</th>\n",
|
|
" <th>class</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>10</td>\n",
|
|
" <td>business class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>20</td>\n",
|
|
" <td>first class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>40</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>50</td>\n",
|
|
" <td>economy class</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>60</td>\n",
|
|
" <td>business class</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ID class\n",
|
|
"0 10 business class\n",
|
|
"1 20 first class\n",
|
|
"2 30 economy class\n",
|
|
"3 40 economy class\n",
|
|
"4 50 economy class\n",
|
|
"5 60 business class"
|
|
]
|
|
},
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"one_hot = pd.DataFrame([\n",
|
|
" [10,'business class'],\n",
|
|
" [20,'first class'],\n",
|
|
" [30, 'economy class'],\n",
|
|
" [40, 'economy class'],\n",
|
|
" [50, 'economy class'],\n",
|
|
" [60, 'business class']\n",
|
|
"],columns=['ID','class'])\n",
|
|
"one_hot"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "aVnZ7paDrWmb"
|
|
},
|
|
"source": [
|
|
"Proveďme one hot encoding na 1. sloupci\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {
|
|
"id": "RUPxf7egrYKr"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"one_hot_data = pd.get_dummies(one_hot,columns=['class'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 235
|
|
},
|
|
"id": "TM37pHsFr4ge",
|
|
"outputId": "7be15f53-79b2-447a-979c-822658339a9e"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ID</th>\n",
|
|
" <th>class_business class</th>\n",
|
|
" <th>class_economy class</th>\n",
|
|
" <th>class_first class</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>10</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>20</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>40</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>50</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>60</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ID class_business class class_economy class class_first class\n",
|
|
"0 10 1 0 0\n",
|
|
"1 20 0 0 1\n",
|
|
"2 30 0 1 0\n",
|
|
"3 40 0 1 0\n",
|
|
"4 50 0 1 0\n",
|
|
"5 60 1 0 0"
|
|
]
|
|
},
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"one_hot_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "_zXRLOjXujdA"
|
|
},
|
|
"source": [
|
|
"Každý sloupec s jedním horkým kódováním obsahuje 0 nebo 1, což určuje, zda daná kategorie existuje pro daný datový bod.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "bDnC4NQOu0qr"
|
|
},
|
|
"source": [
|
|
"Kdy používáme one hot encoding? One hot encoding se používá v jednom nebo obou z následujících případů:\n",
|
|
"\n",
|
|
"1. Když je počet kategorií a velikost datového souboru menší.\n",
|
|
"2. Když kategorie nemají žádné konkrétní pořadí.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "XnUmci_4uvyu"
|
|
},
|
|
"source": [
|
|
"> Klíčové poznatky:\n",
|
|
"1. Kódování se provádí za účelem převodu nenumerických dat na numerická data.\n",
|
|
"2. Existují dva typy kódování: Label encoding a One Hot encoding, které lze použít podle požadavků datové sady.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "K8UXOJYRgRsJ"
|
|
},
|
|
"source": [
|
|
"## Odstranění duplicitních dat\n",
|
|
"\n",
|
|
"> **Cíl učení:** Na konci této podsekce byste měli být schopni identifikovat a odstranit duplicitní hodnoty z DataFrames.\n",
|
|
"\n",
|
|
"Kromě chybějících dat se v reálných datových sadách často setkáte s duplicitními daty. Naštěstí pandas nabízí jednoduchý způsob, jak detekovat a odstranit duplicitní záznamy.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "qrEG-Wa0gRsJ"
|
|
},
|
|
"source": [
|
|
"### Identifikace duplicit: `duplicated`\n",
|
|
"\n",
|
|
"Duplicitní hodnoty můžete snadno identifikovat pomocí metody `duplicated` v pandas, která vrací Booleovskou masku označující, zda je záznam v `DataFrame` duplicitou dřívějšího záznamu. Vytvořme další příklad `DataFrame`, abychom si to ukázali v praxi.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 204
|
|
},
|
|
"id": "ZLu6FEnZgRsJ",
|
|
"outputId": "376512d1-d842-4db1-aea3-71052aeeecaf",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>letters</th>\n",
|
|
" <th>numbers</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" letters numbers\n",
|
|
"0 A 1\n",
|
|
"1 B 2\n",
|
|
"2 A 1\n",
|
|
"3 B 3\n",
|
|
"4 B 3"
|
|
]
|
|
},
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],\n",
|
|
" 'numbers': [1, 2, 1, 3, 3]})\n",
|
|
"example6"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "cIduB5oBgRsK",
|
|
"outputId": "3da27b3d-4d69-4e1d-bb52-0af21bae87f2",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 False\n",
|
|
"1 False\n",
|
|
"2 True\n",
|
|
"3 False\n",
|
|
"4 True\n",
|
|
"dtype: bool"
|
|
]
|
|
},
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example6.duplicated()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "0eDRJD4SgRsK"
|
|
},
|
|
"source": [
|
|
"### Odstranění duplicit: `drop_duplicates`\n",
|
|
"`drop_duplicates` jednoduše vrátí kopii dat, u kterých jsou všechny hodnoty označené jako `duplicated` nastaveny na `False`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 54,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 142
|
|
},
|
|
"id": "w_YPpqIqgRsK",
|
|
"outputId": "ac66bd2f-8671-4744-87f5-8b8d96553dea",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>letters</th>\n",
|
|
" <th>numbers</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" letters numbers\n",
|
|
"0 A 1\n",
|
|
"1 B 2\n",
|
|
"3 B 3"
|
|
]
|
|
},
|
|
"execution_count": 54,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example6.drop_duplicates()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "69AqoCZAgRsK"
|
|
},
|
|
"source": [
|
|
"Oba `duplicated` a `drop_duplicates` standardně zvažují všechny sloupce, ale můžete specifikovat, že mají zkoumat pouze podmnožinu sloupců ve vašem `DataFrame`:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 55,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 111
|
|
},
|
|
"id": "BILjDs67gRsK",
|
|
"outputId": "ef6dcc08-db8b-4352-c44e-5aa9e2bec0d3",
|
|
"trusted": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>letters</th>\n",
|
|
" <th>numbers</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>B</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" letters numbers\n",
|
|
"0 A 1\n",
|
|
"1 B 2"
|
|
]
|
|
},
|
|
"execution_count": 55,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"example6.drop_duplicates(['letters'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "GvX4og1EgRsL"
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n---\n\n**Prohlášení**: \nTento dokument byl přeložen pomocí služby pro automatický překlad [Co-op Translator](https://github.com/Azure/co-op-translator). I když se snažíme o přesnost, mějte prosím na paměti, že automatické překlady mohou obsahovat chyby nebo nepřesnosti. Původní dokument v jeho původním jazyce by měl být považován za autoritativní zdroj. Pro důležité informace se doporučuje profesionální lidský překlad. Neodpovídáme za žádné nedorozumění nebo nesprávné interpretace vyplývající z použití tohoto překladu.\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"anaconda-cloud": {},
|
|
"colab": {
|
|
"name": "notebook.ipynb",
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.4"
|
|
},
|
|
"coopTranslator": {
|
|
"original_hash": "8533b3a2230311943339963fc7f04c21",
|
|
"translation_date": "2025-09-01T21:26:59+00:00",
|
|
"source_file": "2-Working-With-Data/08-data-preparation/notebook.ipynb",
|
|
"language_code": "cs"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
} |