You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/1-Introduction/01-defining-data-science/notebook.ipynb

419 lines
494 KiB

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Challenge: Analyzing Text about Data Science\r\n",
"\r\n",
"In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \r\n",
"\r\n",
"## Goal\r\n",
"\r\n",
"In this lesson, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\r\n",
"\r\n",
"As a text, I will use the page on Data Science from Wikipedia:"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 62,
"source": [
"url = 'https://en.wikipedia.org/wiki/Data_science'"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 1: Getting the Data\r\n",
"\r\n",
"First step in every data science process is getting the data. We will use `requests` library to do that:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 63,
"source": [
"import requests\r\n",
"\r\n",
"text = requests.get(url).content.decode('utf-8')\r\n",
"print(text[:1000])"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<!DOCTYPE html>\n",
"<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n",
"<head>\n",
"<meta charset=\"UTF-8\"/>\n",
"<title>Data science - Wikipedia</title>\n",
"<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":!1,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"1a104647-90de-485a-b88a-1406e889a5d1\",\"wgCSPNonce\":!1,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"Data_science\",\"wgTitle\":\"Data science\",\"wgCurRevisionId\":1038046078,\"wgRevisionId\":1038046078,\"wgArticleId\":35458904,\"wgIsArticle\":!0,\"wgIsRedirect\":!1,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 maint: others\",\"Articles with short description\",\"Short description matches Wikidata\",\"Use dmy dates from December 2012\",\"Information science\",\"Computer occupations\"\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 2: Transforming the Data\r\n",
"\r\n",
"The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
"\r\n",
"There are many ways this can be done. We will use the simplest build-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 64,
"source": [
"from html.parser import HTMLParser\r\n",
"\r\n",
"class MyHTMLParser(HTMLParser):\r\n",
" script = False\r\n",
" res = \"\"\r\n",
" def handle_starttag(self, tag, attrs):\r\n",
" if tag.lower() in [\"script\",\"style\"]:\r\n",
" self.script = True\r\n",
" def handle_endtag(self, tag):\r\n",
" if tag.lower() in [\"script\",\"style\"]:\r\n",
" self.script = False\r\n",
" def handle_data(self, data):\r\n",
" if str.strip(data)==\"\" or self.script:\r\n",
" return\r\n",
" self.res += ' '+data.replace('[ edit ]','')\r\n",
"\r\n",
"parser = MyHTMLParser()\r\n",
"parser.feed(text)\r\n",
"text = parser.res\r\n",
"print(text[:1000])"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" Data science - Wikipedia Data science From Wikipedia, the free encyclopedia Jump to navigation Jump to search Interdisciplinary field of study focused on deriving knowledge and insights from data Not to be confused with information science . The existence of Comet NEOWISE (here depicted as a series of red dots) was discovered by analyzing astronomical survey data acquired by a space telescope , the Wide-field Infrared Survey Explorer . Part of a series on Machine learning and data mining Problems Classification Clustering Regression Anomaly detection AutoML Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction Supervised learning ( classification  • regression ) Decision trees Ensembles Bagging Boosting Random forest k -NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine \n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 3: Getting Insights\r\n",
"\r\n",
"The most important step is to turn our data into some for from which we can draw insights. In our case, we want to extract keywords from the text, and see which keywords are more meaningful.\r\n",
"\r\n",
"We will use Python library called [RAKE](https://github.com/aneesha/RAKE) for keyword extraction. First, let's install this library in case it is not present: "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 65,
"source": [
"import sys\r\n",
"!{sys.executable} -m pip install nlp_rake"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: nlp_rake in c:\\winapp\\miniconda3\\lib\\site-packages (0.0.2)\n",
"Requirement already satisfied: numpy>=1.14.4 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (1.19.5)\n",
"Requirement already satisfied: pyrsistent>=0.14.2 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (0.17.3)\n",
"Requirement already satisfied: regex>=2018.6.6 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (2021.8.3)\n",
"Requirement already satisfied: langdetect>=1.0.8 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (1.0.9)\n",
"Requirement already satisfied: six in c:\\winapp\\miniconda3\\lib\\site-packages (from langdetect>=1.0.8->nlp_rake) (1.16.0)\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"C:\\winapp\\Miniconda3\\lib\\site-packages\\secretstorage\\dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"C:\\winapp\\Miniconda3\\lib\\site-packages\\secretstorage\\util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"The main functionality is available from `Rake` object, which we can customize using some parameters. In our case, we will set the minimum length of a keyword to 5 characters, minimum frequency of a keyword in the document to 3, and maximum number of words in a keyword - to 2. Feel free to play around with other values and observe the result."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 66,
"source": [
"import nlp_rake\r\n",
"extractor = nlp_rake.Rake(max_words=2,min_freq=3,min_chars=5)\r\n",
"res = extractor.apply(text)\r\n",
"res"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('machine learning', 4.0),\n",
" ('big data', 4.0),\n",
" ('data scientist', 4.0),\n",
" ('21st century', 4.0),\n",
" ('data science', 3.909090909090909),\n",
" ('computer science', 3.909090909090909),\n",
" ('information science', 3.797979797979798),\n",
" ('data analysis', 3.666666666666667),\n",
" ('application domains', 3.6),\n",
" ('science', 1.9090909090909092),\n",
" ('field', 1.25),\n",
" ('statistics', 1.2272727272727273),\n",
" ('classification', 1.2),\n",
" ('techniques', 1.1666666666666667),\n",
" ('datasets', 1.0),\n",
" ('education', 1.0),\n",
" ('archived', 1.0),\n",
" ('original', 1.0),\n",
" ('chikio', 1.0),\n",
" ('forbes', 1.0)]"
]
},
"metadata": {},
"execution_count": 66
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"\r\n",
"We obtained a list terms together with associated degree of importance. As you can see, the most relevant disciplines, such as machine learning and big data, are present in the list at top positions.\r\n",
"\r\n",
"## Step 4: Visualizing the Result\r\n",
"\r\n",
"People can interpret the data best in the visual form. Thus it often makes sense to visualize the data in order to draw some insights. We can use `matplotlib` library in Python to plot simple distribution of the keywords with their relevance:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 67,
"source": [
"import matplotlib.pyplot as plt\r\n",
"\r\n",
"def plot(pair_list):\r\n",
" k,v = zip(*pair_list)\r\n",
" plt.bar(range(len(k)),v)\r\n",
" plt.xticks(range(len(k)),k,rotation='vertical')\r\n",
" plt.show()\r\n",
"\r\n",
"plot(res)"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
],
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAFTCAYAAAApyvfdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAA7QElEQVR4nO2debgkVXn/P9+BQVD2MCKyDRLELYI4IghBRFRAEVRQCIhBCaKo+NNoNAYBjfsSBSITFIksLiAuqCAigoAKMgMzIFskiDIBZVzYBET0/f1xTs+t29P39qmuvvdO13w/z1PP7aqut87pvtVvnfOed1FEYIwxZvSZNdMdMMYYMxys0I0xpiVYoRtjTEuwQjfGmJZghW6MMS1h1ZlqeIMNNoi5c+fOVPPGGDOSLFy48LcRMafXezOm0OfOncuCBQtmqnljjBlJJP1yovdscjHGmJZghW6MMS3BCt0YY1qCFboxxrQEK3RjjGkJVujGGNMSihW6pFUkXSPp2z3ek6TjJd0i6VpJ2w23m8YYY/pRZ4R+FHDjBO/tCWyVt8OBkxr2yxhjTE2KFLqkTYAXA5+b4JR9gNMicQWwrqSNhtRHY4wxBZRGin4KeCew1gTvbwzcXtlfko/dWT1J0uGkETybbbZZnX6OY+67vlNb5rYPv3hg+WHJDkPeGGMmou8IXdJLgLsiYuFkp/U4tlwppIg4OSLmRcS8OXN6piIwxhgzICUml52Al0q6DfgysJukM7rOWQJsWtnfBLhjKD00xhhTRF+FHhHvjohNImIucADwg4g4uOu0c4FDsrfLDsA9EXFn97WMMcZMHQNnW5R0BEBEzAfOA/YCbgEeAA4dSu+MMcYUU0uhR8QlwCX59fzK8QCOHGbHzPJM52Jut7wxZsXHkaLGGNMSrNCNMaYlWKEbY0xLsEI3xpiWMGM1Rc1o4QhXY1Z8PEI3xpiWYIVujDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEuy2aKYcuzwaMz14hG6MMS3BCt0YY1qCFboxxrQE29DNCo3zuBtTTkmR6NUl/VTSYknXSzquxzm7SrpH0qK8vXdqumuMMWYiSkbofwJ2i4j7Jc0GLpd0fkRc0XXeZRHxkuF30RhjTAl9FXouL3d/3p2dt5jKThljjKlP0aKopFUkLQLuAi6MiCt7nLZjNsucL+mpw+ykMcaY/hQp9Ij4S0RsC2wCbC/paV2nXA1sHhHbACcA3+h1HUmHS1ogacHSpUsH77UxxpjlqOW2GBF3A5cAe3Qdvzci7s+vzwNmS9qgh/zJETEvIubNmTNn4E4bY4xZnhIvlzmS1s2v1wB2B27qOudxkpRfb5+v+7uh99YYY8yElHi5bAR8QdIqJEV9VkR8W9IRABExH9gPeIOkR4AHgQPyYqoxxphposTL5VrgGT2Oz6+8PhE4cbhdM8YYUweH/htjTEuwQjfGmJZghW6MMS3BCt0YY1qCFboxxrQEK3RjjGkJVujGGNMSrNCNMaYlWKEbY0xLsEI3xpiWYIVujDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEqzQjTGmJVihG2NMSyipKbq6pJ9KWizpeknH9ThHko6XdIukayVtNzXdNcYYMxElNUX/BOwWEfdLmg1cLun8iLiics6ewFZ5ezZwUv5rjDFmmug7Qo/E/Xl3dt66C0DvA5yWz70CWFfSRsPtqjHGmMkosqFLWkXSIuAu4MKIuLLrlI2B2yv7S/IxY4wx00SRQo+Iv0TEtsAmwPaSntZ1inqJdR+QdLikBZIWLF26tHZnjTHGTEwtL5eIuBu4BNij660lwKaV/U2AO3rInxwR8yJi3pw5c+r11BhjzKSUeLnMkbRufr0GsDtwU9dp5wKHZG+XHYB7IuLOYXfWGGPMxJR4uWwEfEHSKqQHwFkR8W1JRwBExHzgPGAv4BbgAeDQKeqvMcaYCeir0CPiWuAZPY7Pr7wO4Mjhds0YY0wdHClqjDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEqzQjTGmJVihG2NMS7BCN8aYlmCFbowxLcEK3RhjWoIVujHGtAQrdGOMaQlW6MYY0xKs0I0xpiVYoRtjTEuwQjfGmJZghW6MMS2hpKboppIulnSjpOslHdXjnF0l3SNpUd7eOzXdNcYYMxElNUUfAd4eEVdLWgtYKOnCiLih67zLIuIlw++iMcaYEvqO0CPizoi4Or++D7gR2HiqO2aMMaYetWzokuaSCkZf2ePtHSUtlnS+pKdOIH+4pAWSFixdurR+b40xxkxIsUKXtCZwDvDWiLi36+2rgc0jYhvgBOAbva4RESdHxLyImDdnzpwBu2yMMaYXRQpd0mySMj8zIr7W/X5E3BsR9+fX5wGzJW0w1J4aY4yZlBIvFwGnADdGxCcnOOdx+TwkbZ+v+7thdtQYY8zklHi57AS8GrhO0qJ87F+BzQAiYj6wH/AGSY8ADwIHREQMv7vGGGMmoq9Cj4jLAfU550TgxGF1yhhjTH0cKWqMMS3BCt0YY1qCFboxxrQEK3RjjGkJVujGGNMSrNCNMaYlWKEbY0xLsEI3xpiWYIVujDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEqzQjTGmJVihG2NMS7BCN8aYlmCFbowxLaGkpuimki6WdKOk6yUd1eMcSTpe0i2SrpW03dR01xhjzESU1BR9BHh7RFwtaS1goaQLI+KGyjl7Alvl7dnASfmvMcaYaaLvCD0i7oyIq/Pr+4AbgY27TtsHOC0SVwDrStpo6L01xhgzIbVs6JLmAs8Arux6a2Pg9sr+EpZX+kg6XNICSQuWLl1as6vGGGMmo1ihS1oTOAd4a0Tc2/12D5FY7kDEyRExLyLmzZkzp15PjTHGTEqRQpc0m6TMz4yIr/U4ZQmwaWV/E+CO5t0zxhhTSomXi4BTgBsj4pMTnHYucEj2dtkBuCci7hxiP40xxvShxMtlJ+DVwHWSFuVj/wpsBhAR84HzgL2AW4AHgEOH3lNjjDGT0lehR8Tl9LaRV88J4MhhdcqYYTH3Xd+pdf5tH37xFPXEmKnHkaLGGNMSrNCNMaYlWKEbY0xLsEI3xpiWYIVujDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEqzQjTGmJVihG2NMS7BCN8aYlmCFbowxLcEK3RhjWoIVujHGtAQrdGOMaQlW6MYY0xJKaop+XtJdkn42wfu7SrpH0qK8vXf43TTGGNOPkpqi/w2cCJw2yTmXRcRLhtIjY4wxA9F3hB4RlwK/n4a+GGOMacCwbOg7Slos6XxJT53oJEmHS1ogacHSpUuH1LQxxhgYjkK/Gtg8IrYBTgC+MdGJEXFyRMyLiHlz5swZQtPGGGM6NFboEXFvRNyfX58HzJa0QeOeGWOMqUVjhS7pcZKUX2+fr/m7ptc1xhhTj75eLpK+BOwKbCBpCXAMMBsgIuYD+wFvkPQI8CBwQETElPXYGGNMT/oq9Ig4sM/7J5LcGo0xxswgjhQ1xpiWYIVujDEtwQrdGGNaghW6Mca0BCt0Y4xpCVboxhjTEqzQjTGmJZSkzzVmpWTuu75T6/zbPvziKeqJMWVYoRszRfiBYKYbm1yMMaYleIRuzAqIR/dmEDxCN8aYluARujEto+7oHjzCbwseoRtjTEuwQjfGmJZghW6MMS3BCt0YY1pCX4Uu6fOS7pL0swnel6TjJd0i6VpJ2w2/m8YYY/pRMkL/b2CPSd7fE9gqb4cDJzXvljHGmLr0VegRcSnw+0lO2Qc4LRJXAOtK2mhYHTTGGFPGMPzQNwZur+wvycfu7D5R0uGkUTybbbbZEJo2xgybJlGq0yk7k20Ps9/DZBiLoupxLHqdGBEnR8S8iJg3Z86cITRtjDGmwzAU+hJg08r+JsAdQ7iuMcaYGgxDoZ8LHJK9XXYA7omI5cwtxhhjppa+NnRJXwJ2BTaQtAQ4BpgNEBHzgfOAvYBbgAeAQ6eqs8YYYyamr0KPiAP7vB/AkUPrkTHGmIFwpKgxxrQEK3RjjGkJVujGGNMSrNCNMaYlWKEbY0xLsEI
},
"metadata": {
"needs_background": "light"
}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"There is, however, even better way to visualize word frequencies - using **Word Cloud**. We will need to install another library to plot the word cloud from our keyword list."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 71,
"source": [
"!{sys.executable} -m pip install wordcloud"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"`WordCloud` object is responsible for taking in either original text, or pre-computed list of words with their frequencies, and returns and image, which can then be displayed using `matplotlib`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 69,
"source": [
"from wordcloud import WordCloud\r\n",
"import matplotlib.pyplot as plt\r\n",
"\r\n",
"wc = WordCloud(background_color='white',width=800,height=600)\r\n",
"plt.figure(figsize=(15,7))\r\n",
"plt.imshow(wc.generate_from_frequencies({ k:v for k,v in res }))"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<matplotlib.image.AxesImage at 0x224b8677400>"
]
},
"metadata": {},
"execution_count": 69
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x504 with 1 Axes>"
],
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGfCAYAAACNytIiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAEAAElEQVR4nOyddZwcx5mwn+ru4VlmEDOzJZkZYjuG2A44udAF7nKhL8lhLnA5yF3uwnzhnGPHzAyyLYuZtZJWWi1pmYanu+v7o2dHO4uzq13trtSPf2vt9vRUV3V1V7311gtCSomNjY2NjY2NzXihjHcFbGxsbGxsbC5ubGHExsbGxsbGZlyxhREbGxsbGxubccUWRmxsbGxsbGzGFVsYsbGxsbGxsRlXbGHExsbGxsbGZlwZM2FECHGzEOKoEOK4EOLvx+o6NjY2NjY2NpMbMRZxRoQQKlAB3ADUANuB90spD436xWxsbGxsbGwmNWOlGbkEOC6lrJRSxoCHgDvG6Fo2NjY2NjY2kxhtjMotA6p7/F0DrB3o5Pz8fDl9+vQxqoqNjY2NjY3NRGDnzp3NUsqC3sfHShgR/RxL2Q8SQnwS+CTA1KlT2bFjxxhVxcbGxsbGxmYiIISo6u/4WG3T1ABTevxdDtT1PEFK+Usp5Wop5eqCgj5Cko2NjY2Njc1FwlgJI9uBOUKIGUIIJ/A+4OkxupaNjY2NjY3NJGZMtmmklLoQ4m+AlwAV+I2U8uBYXMvGxmZ0OR1s5JWGnUSMGAAKgvX5i1ieM2uca3YW3TTY33GSPW3HkcDKnNksyZ6JgkCI/naJbc4HZ8KtPFe3lYh59tlZkzeP1bnzxrlmNhOdsbIZQUr5PPD8WJVvY2MzNjRE2ni6ZhOdegiwJpR8V9aEEka2th7hPw89SECPAPBc3Ra+PP9e1uUtHOeaXdy0xjp5qnYTXT2eHZ/msYURmyGxI7Da2NhMOt5q3EuXHkYm/muNdbG15Sgmox83ycbGZuyxhREbG5tJR1c83OdYUA+DLYzY2ExKxmybxsbGxmasmO4rYnPL2YDOCoKZ/hJEv1EFbGyGj5SSI53VvNG4G12aXFGwhBU5s8e7WhcstjBiY2Mz6bi1dB3HA3Xsbz+JIgSXFyzhuqKVtjBiM2qY0uSd5gM8Wv0WIChx59rCyBhiCyM2NjaTjlJPHl9ddD8d8RACyHFm4FVdtieNzagRNmMc6qhCAsLe/htzbGHExsZm0iGEINPhI9PhG++q2FygdMVDVHTVjHc1LhpsA1YbGxsbG5teHOyoImREx7saFw22MGJjY2NjY9MDKWUioJ69PXO+mPTbNDHdYNuR0zS0ByjOyWDt/Kloqi1jXSxIKalp7mDb0eo+n2V4nFy5ZCZup2McajY6GKZke0U17YEQq+aUU5DlH5d62LYYNhcT7fEAxwN1Q59oM2pMemEkEovzwOu72HLkNJctms7K2WW2MHKRcfh0I//18Bvohokpz65kphXmsGJ2+aQWRqoaWvn6H16iPRDmQ9ev4tO3rh+X51sTavL3uKnTEu3kQMcpDndWURduIWLEcCoaBe5s5mVMYXnOLApdOTgVbcSCTMzUaYq0p3WuEJDrzMStOkd0rYGQUqJLg5ZoJ5WBeg51VnEm0kpXPIQuTTIcXko9eSzPnkmpJ39Ibx6v5iLb4Z/wwp2UElOatMUDHOqo4kDHKWrCTYT1KJqiUejKZm5GGSty5lDozsalOBBCIIRyzv5MpjQJ6VHa4wFOBRs4EaijNtxMIB4mZERRhMCnusl3ZTHFW8CCzKmUePLIcHhRxcjeDVNKYmacqBEnYsbY03acM5HWlHM64kFqQ81pled3eMjUvMPqZyklESNGRzxIXbiZiq5aqkNNdMSDBI0ISIlHdZHj9FPuLWBuRjnTfEVkOXw4lEk/lU9+YWQ8CISjnG5qpywviyyfe7yrc9GzaFoRf3vf1XSFonQEI7xz8BTH6tIbNCY6naEInaEIMd2gpTOYImydPwRORUNKK9LpM7WbeaF+G03R9n6V2M+xlSyHjysLl3LvlCsp91hZuYc7AVcFG/j09u+lFVVVFQr/suQjXJq/aFjXGAiZuM+14Waert3EW437aYi2DXj+g1Wvp1XuzSVr+NK8e1OEu4mGlJKAHuH5+q08X7eV6lBTv9sVz9dDpsPLFQVLuHfKVUz1FuJRnCgjELSktCLpVoeaeKf5IHvbTnCw45Q1CQ+BJlRm+Uu5vnglNxStItORnhAQN3VOBxupCjVQG26mNtRMdaiJ2nAznfFgSosl8EDVazxQ9Vpa7blvylV8cvZtqEOIZt3PWXO0gy0tR9jddox97ZW0xDqHvIaCoNSTxxUFS7mtdC0lnrwJL+QOhi2MjIC39lfy02c28bf3XcOVS2aOd3UuaoQQlOVn8Z7LlwIQjet0haMXjDAyr7yQuy9bQn1rJ3dftgTHOGhFhAC36qQq1MD3jz7GwY4qdGkM+p2OeJBna7dwsP0Un5j1Li7JWzCiFXO6e/ZylIU0E8nm5oP8tvIlTgbPjJrtwLjIksOkNtzMz44/w/bWo8RNfdBzO+Mhnq/bxv72k3xy1q3MyShDHYGg1aWH+eXxZ9nWeoTWWBeGNNP+ri4NjnZVUxmoZ1vLET47907KPQVDTsyNkXa+X/EYx7vqiJrxUbUPSbekmKnzRM1Gnq3bTFO0g9gQ97snJpKacDN/Pr2BzS2H+OycO1meMwtlhNqh8cYWRoZJXDfYW1lPQ3uAmD74gGxz/hFCjGhlNlHxuBx85d6rx7satMeC/KjiSfa2VyaPKYiUgU8iUyYRiaQyWM8PKh7nHxZ+gCVZM4a1clOEwK26khOiRCYm8+6MNGNDt/Hi948+nrJCFQhynH7mZkyh0J2FQNAa6+JYVw2NkfZ+NTgKCooQOIRKniuTqd6hJ8nxQkpJc7SDnxx7ii0th/t83t2WbkysrRyJ5HSokR9WPMHn5t41ovdPEwrHArU0RTv6fCaw3umeW2Ayee2zxKXOjtaj/LjiSf5x0f1kDeH2bUiTrniYqBlPXucs/T9f6QbVS/cOKEJQE26iNtzS77X6b3fq029icip4hh8de4KvLvogM30lE/YZGwxbGBkmHcEIh083jHc1bGzOG6Y0eaLm7eSAmenwsTJnNmty5zHDX0ym5sPA5HSwkc3Nh9jScojWWFfy+2cibfzs+NP886IPUerJS/u6031F/Gn9PxLUI3TqITrjQTrj1r/720/yZtO+UW+rlJLGaDu/qXwxRRDxaW7eXbqeO8ovI8vhS9omGNIkqEd4vWE3D51+I6Xdha5sPjT9Bmb5S5N2FU5F49ytKsaGmKnz+1Mvs63lSMpxj+pkefYsripczmx/KR7NRUAPUxNqYlvLEXa2HaM52kFjtJ0fVjxBRzw47Gt7VBe3l67nBxWPo0sDh9Ao9eQx3VfEnIwyZvpLyXdl4dPcxMw4DZE2drcdZ1PzIWpCjcmpWQI7Wit49cwu3jPlikGvmenwclvZOjpigT6f1YVb2di8P0VTsSpnDgsyp6bVnsVZM9ISXByKxm2l63ijYY9lD4NCoTubqd4C5mVOYYavhFJPHj7NMgdojnZwqPM0bzfu43igLkVDWRVs4JHTb/Kl+ffiEJNvap8UNZZSEjdMojEdwzQRQuDUVFzORPXTeLellMR0g7huoBvW6k0I0FQVp6aiqUq/0qSUEsOU6IZBXDepPNPCifoWpIRgJEZboG/CLo/TgdvZ/62VUqIbJrFEPaSUIKw9b6fDqsv5kGotAzVJLG4QNwxM03qdFUXgUFVcDg1F6b8eumESjMQASYbHinppmpJIXEc3DKS0ynFpGk7H0O1J1kM3MMzuvhFoqoLLoaEqYkzuiZSScDROVDcQAjI97gHb3H1+KBonphsoAjK97gGfmWhcT7RHprTHoSo4BuljKSVd4Wjyez1xOzQ8rvSNcbvfm1hcRzdNpLSeeVVRcKjdz9rQthwSqAlb215TvYV8YtatrMmdh0tNrctUbyFr8+azp20ZP6x4gppwU/KzI53VPF+3jY/NvDntlbMqVLKdfrKdfsp6fZbl8I2
},
"metadata": {
"needs_background": "light"
}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"We can also pass in the original text to `WordCloud` - let's see if we are able to get similar result:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 70,
"source": [
"plt.figure(figsize=(15,7))\r\n",
"plt.imshow(wc.generate(text))"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<matplotlib.image.AxesImage at 0x224b9e5fbb0>"
]
},
"metadata": {},
"execution_count": 70
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x504 with 1 Axes>"
],
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGfCAYAAACNytIiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAEAAElEQVR4nOy9dXxdyXn//55zzmXSlXTFLFmSZdmyzOy1117wMmYDG2iapE3aYJu0aZNvCikkbVNOkzS8STZZZvCi18woWxZYzHgZDvz+uLLAkmV5ybv9+fN6+WXdAzNz5syZeeaBzyMMw+AqruIqruIqruIqruJKQbrSDbiKq7iKq7iKq7iK/3/jqjByFVdxFVdxFVdxFVcUV4WRq7iKq7iKq7iKq7iiuCqMXMVVXMVVXMVVXMUVxVVh5Cqu4iqu4iqu4iquKK4KI1dxFVdxFVdxFVdxRfGOCSNCiBuEEPVCiEYhxJ+9U/VcxVVcxVVcxVVcxfsb4p3gGRFCyMBZYCvQARwAPmgYRt3bXtlVXMVVXMVVXMVVvK/xTmlGVgCNhmE0G4YRBx4EbnuH6rqKq7iKq7iKq7iK9zGUd6jcXKB90u8OYOXFLk5PTzeKiopmLVAzVAQCSchohookJMRVl5eruIo5IawmkITAKr9Tn/zMUHWdiJbAabIg3tWar+LtQMJIAGASpivckncOqqGiGzpmyfyWy9IMDc3Q3nJZMT1GQk+gSApWyfqW2/VewqFDhwYMw/BdePydmplmmnem2IOEEJ8GPg1QUFDAwYMHZyxIM1SiWpj6wFHcJi+F9gqG4r24TanYZAcxLUrCiKEIExbJRkKPER/7bZasRLUQBgZWyY5qqCiSCcPQ0dGRkIjpESQkrLKDmB7BwMAwdMySFR0dkzAT16OYJAuSuCr8XEkYhkHC0BCASZp96Gq6TsLQsMoTk2hcU9nd38iStELcJtus9US0BHbFPOXYwcEWMm1uChxps9Z9aqQTSUjM92TP7cHeYRiGwT8efp0idwr3ltUgibdPLDg93Ed7YJTrCubNeP7YQDc/PLWPf1h9Iy6z5bLK3t/bjmborM4qfDuaekWh6RpBNYpVNmOR3/rCPhwf4pHOh7kp62ayrNmIt/GdTsbewT1ICFakrXpHyr8YwmqYvUN7aAu3IhBUuCpZnroCWchohsZgfBCvyYtJmltfxrQYftVPujl9Wl/V+U/RFeni2owtc+rHuB5nJDGCz+ybdn1DoJ7mUDPXZd7wlt5Jd7SbfYN7GYoP8PGiT/6fWnuEEK0zHX+nhJEOIH/S7zyga/IFhmH8EPghwLJly2Z0XDEMg8bACXqirYwmhqh0L8GfGOLw0GvUejdisznYM/gcijCRaS2gxFnF7sHnsMp2fJZcXEoKx0f3YJFs+Cw5hLUAGZY8guoIIFCNOAF1hGBihFVpN3Bw+FXsshOb7CDPXkp94AjLvJvZP7Sd5albsMr28XadOdfLkfpOtqwsJ8Vl42RjNy1dQ8TjKqkpDhaWZZOV7p42IA3DIBxNUN/SS3vPCKFIDJMi40t1srAsh1SPffwewzA4draLls5Bblg7n7rmHhrbByjNS6e2MpdITGXfiRYGR0JUlWZRVZI1rT7dMGjrGuL0uV5GAhGsZhNlBelUFGVgUuR3bBJ7q7iYL9Puvka8ZgeLvHmz3n96tIuuyChbs6vGj5kkmQ2ZFeOS8sXqGIgFeaHrJB8uTk7A5/toWVrRrG0UQmAYBlWenDldd+Gxi2Gmdk6+/mLPMbmery7ZmDx2GWXPpdzXOpuJqipb88tmPL8oLYv/2HDbzLuTi9R9/vgzLWeoTM1gVWbBtLa9F3A5/nYhLcaTnXuY58plVfp8DMOY8qyX81yGYVDnP8W5YDNNoUayrNmztmWm8Tbb8cnnVqZOF0Iup6zZjs+Gg8P7aQu1sDFjEwk9gWqoSEjJjYIa5qXeF7k5+xYU4Z5T/a3hFuoD9dycfcuUPjcMg/muKua7qi75TZ2/vjPcwbHRo9yWc8e0ssqc5ZQ5y6fc92b6JcuSRYmjhJHE8Kz99HZgcjuu5Pf1TgkjB4B5QohioBO4D/jQ5RaiozOS6KfCtYTuaAsCgdecgc+aizGmaClxLqAldBrViBPXosS0CBt8tyGE4Iz/EHm2EtItOZwNHCXfXkZXpBnVSFDmrGH/0HZMkgXd0IhqISQhMc+1iFRzJoahY5GsnAvVkWbJxiJN3Ukfre/k+7/dicWk0NjWzxuHm9B0HVXVUXWdrDQ3n//wRlZWFyBJSanWMAwOn+7gv3/7Bp19o2iajiRJaLoOhkFelpc/uGctKxcWjg+K/Sda+e0LhxFC8JvnDtE7GMDlsPD5D23kdHMPL+w5QyAUw+d18s3P3MCi8omFMBSJ8dsXjvDMjlP4Q1EkSULXdcyKwqqaIv7w3nWkpzje1At+p9EdGeVX5/YS1eIsTSvi2qwq9g408fPm3aSYbCz05nFXwVKiusojrQcZiAYpd2dyR8FSzox28b+NbxDTVBr8PdxZsBSP2c7j7Yep9/fwuYrNZFjdxHSVp9qPcsbfg9tk5aOla4lpCb5f/yoNgV66wiPclLeICnc2O3rreamnjg8UraDGm49hGJwe7ebFrpP4E1E2ZlWwIaOcI0NtPNx2kK3ZC9iUVTmuUXmpuw7N0NmWu4ja1AK+W/c86WYnHeFhlqcXcV1ONfIMux/DMDg11MvvGk/QFfKjSBJLfbl8pKIWm2JCNwzOjvTzm4ZjdARHEQiWZuTyicqlWBUThwe6+OHJffRFQnxg3iLuLVs0Lhjohs6poT4ebDhGV8iPz+bk7tJqlmXkEVbj/MOh11memcfBvk46Q6PkOdx8asFK8p0emvyD/Kr+KC91NCIJieOD3QB8Y/m1FLu8BBJxflS3nyP9XRiGwf9uvgubMrGLHY5FeKD+CHVDvcR1nXynh0/MX0qhy8vRgW5+23iMlzua2NvbzottZ1GExHfXbiPFcnGN1ruNvYOnaQx04TU7WZ5awbPd+7FIJoocWaSanewZPI1JUljiLWO+u4Bydx4YENMSvNF/gpVplZzxt+M1uyhz5Vy6wjGoRoLGYAMr01bRFGxkTdo6EnqCl/q2oxkqQ/Ehiu3FNIUa2ZZ1M5nWLDoi7ezof42QFqLUUcba9PVYZSvd0S4ODO3HLttpDbdQ6pzHhvQNSMg83f0kjcEGlqWuYH36BiA5Hv2qn9f7X6Un2oMiZK7xbabYUcJQYpCXe19iNDGKVbayNfN6sqxZnPKfpCnYSFyPM5wYptRRykbfJpRZtJv+hJ9MaxYFtgLkSdd1hNt5qe9FTo2eZCg+hFWy8OGC+7HKNhpDDewb3EtEC+OzZHBtxlZcJheHRg7yWt8rjCZG6Ip0kmPL5ubspAvjjoHXOTZyhCJHCbdk3zouJPTH+3mj/3WG4kM4FSfX+DaRbcvh2OhRXu17mYFYP73RHnwWH3fk3g3AnsHdHBo+QI4tlzvHjhmGQVSPsndwD03BBgAWpyxhiXcpcT3G9t4X6Y32IgmJNWlrqXBVIoS4xOZER1XPIoQdWc5HCIGmD6Np3ShyLpLkmfNYAtD1IUb9f4fDfi8Wy0W9Kd5xvCPCiGEYqhDij4AXABn4iWEYpy63HAmJFLOP04FDRLQA85w1dEaaaQufJaKFSDGn408MoQgzXZEWyl2LMctW9g6+SKY1lzRzNsdHd9MX6yTTkkeWtZDT/oM4ZDep5gyKHVWMJgZRJBMeczoSEsq4bVSQbS3myMgONmXcOePgUDWNXzy5H5NJ5r4bllCcl4aq6ry8r56X9zfww4d3UZybSmaqa/x+i1nBYla4af0C5hX68LptBEIxXtpbz47DTfzXgztYWPYBHPYJlXY4GufF3We4/+bl9Az6+dUzB/nhI7tJ9Tj44oev4WRjN4+9fJyX99VTVZqJIsskVI2Htx/jJ4/tpbwog4/esoKMNBfBUJRnd57m+V2nkQR84cPX4LCZ31M7TkhqNhQh8dmKzShCwizJrM+Yx9nRHmrTCse1FBYtwQeLViJLEv9at53e6Cg1qQVcmzUfWUjckr94vMzb8mv5jzM
},
"metadata": {
"needs_background": "light"
}
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 61,
"source": [
"wc.generate(text).to_file('images/ds_wordcloud.png')"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<wordcloud.wordcloud.WordCloud at 0x224b99d76a0>"
]
},
"metadata": {},
"execution_count": 61
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"You can see that word cloud now looks more impressive, but it also contains a lot of noise (eg. unrelated words such as `Retrieved on`). Also, we get fewer keywords that consist of two words, such as *data scientist*, or *computer science*. This is because RAKE algorithm does much better job at selecting good keywords from text. This example illustrates the importance of data pre-processing and cleaning, because clear picture at the end will allow us to make better decisions.\r\n",
"\r\n",
"In this exercise we have gone through a simple process of extracting some meaning from Wikipedia text, in the form of keywords and word cloud. This example is quite simple, but it demonstrates well all typical steps a data scientist will take when working with data, starting from data acquisition, up to visualization.\r\n",
"\r\n",
"In our course we will discuss all those steps in detail. "
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python",
"version": "3.8.11",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.11 64-bit ('base': conda)"
},
"interpreter": {
"hash": "c28e7b6bf4e5b397b8288a85bf0a94ea8d3585ce2b01919feb195678ec71581b"
}
},
"nbformat": 4,
"nbformat_minor": 2
}