diff --git a/2-Working-With-Data/08-data-preparation/notebook.ipynb b/2-Working-With-Data/08-data-preparation/notebook.ipynb index ac9bab8..3e8ae01 100644 --- a/2-Working-With-Data/08-data-preparation/notebook.ipynb +++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb @@ -1614,6 +1614,300 @@ "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." ] }, + { + "cell_type": "markdown", + "metadata": { + "id": "CE8S7louLezV" + }, + "source": [ + "First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n", + "\n", + "In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n", + "\n", + "Again, here we can use domain knowledge here. Let us consider an example of filling with the mode." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MY5faq4yLdpQ", + "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", + " [3,4,None],\n", + " [5,6,\"False\"],\n", + " [7,8,\"True\"],\n", + " [9,10,\"True\"]])\n", + "\n", + "fill_with_mode" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
012True
134None
256False
378True
4910True
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1 2 True\n", + "1 3 4 None\n", + "2 5 6 False\n", + "3 7 8 True\n", + "4 9 10 True" + ] + }, + "metadata": {}, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MLAoMQOfNPlA" + }, + "source": [ + "Now, lets first find the mode before filling the `None` value with the mode." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "WKy-9Y2tN5jv", + "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "fill_with_mode[2].value_counts()" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "True 3\n", + "False 1\n", + "Name: 2, dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 29 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6iNz_zG_OKrx" + }, + "source": [ + "So, we will replace None with True" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "TxPKteRvNPOs" + }, + "source": [ + "fill_with_mode[2].fillna('True',inplace=True)" + ], + "execution_count": 30, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tvas7c9_OPWE", + "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "fill_with_mode" + ], + "execution_count": 31, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012
012True
134True
256False
378True
4910True
\n", + "
" + ], + "text/plain": [ + " 0 1 2\n", + "0 1 2 True\n", + "1 3 4 True\n", + "2 5 6 False\n", + "3 7 8 True\n", + "4 9 10 True" + ] + }, + "metadata": {}, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SktitLxxOR16" + }, + "source": [ + "As we can see, the null value has been replaced. Needless to say, we could have written anything in place or `'True'` and it would have got substituted." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "heYe1I0dOmQ_" + }, + "source": [ + "Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n", + "\n", + "1. Replace with Median of the row\n", + "2. Replace with Mean of the row \n", + "\n", + "We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n", + "\n", + "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "09HM_2feOj5Y" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + }, { "cell_type": "code", "metadata": {