You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/1-Introduction/01-defining-data-science/solution/notebook.ipynb

527 lines
980 KiB

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Challenge: Analyzing Text about Data Science\r\n",
"\r\n",
"> *In this notebook, we experiment with using different URL - wikipedia article on Machine Learning. You can see that, unlike Data Science, this article contains a lot of terms, this making the analysis more problematic. We need to come up with another way to clean up the data after doing keyword extraction, to get rid of some frequent, but not meaningful word combinations.*\r\n",
"\r\n",
"In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \r\n",
"\r\n",
"## Goal\r\n",
"\r\n",
"In this lesson, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\r\n",
"\r\n",
"As a text, I will use the page on Data Science from Wikipedia:"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"url = 'https://en.wikipedia.org/wiki/Data_science'\r\n",
"url = 'https://en.wikipedia.org/wiki/Machine_learning'"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 1: Getting the Data\r\n",
"\r\n",
"First step in every data science process is getting the data. We will use `requests` library to do that:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 3,
"source": [
"import requests\r\n",
"\r\n",
"text = requests.get(url).content.decode('utf-8')\r\n",
"print(text[:1000])"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<!DOCTYPE html>\n",
"<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n",
"<head>\n",
"<meta charset=\"UTF-8\"/>\n",
"<title>Machine learning - Wikipedia</title>\n",
"<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":!1,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"77162785-16e9-4d7f-a175-7f3fcf502a66\",\"wgCSPNonce\":!1,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"Machine_learning\",\"wgTitle\":\"Machine learning\",\"wgCurRevisionId\":1041247229,\"wgRevisionId\":1041247229,\"wgArticleId\":233488,\"wgIsArticle\":!0,\"wgIsRedirect\":!1,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 errors: missing periodical\",\"Harv and Sfn no-target errors\",\"CS1 maint: uses authors parameter\",\"Articles with short description\",\"Short description is dif\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 2: Transforming the Data\r\n",
"\r\n",
"The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
"\r\n",
"There are many ways this can be done. We will use the simplest build-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"from html.parser import HTMLParser\r\n",
"\r\n",
"class MyHTMLParser(HTMLParser):\r\n",
" script = False\r\n",
" res = \"\"\r\n",
" def handle_starttag(self, tag, attrs):\r\n",
" if tag.lower() in [\"script\",\"style\"]:\r\n",
" self.script = True\r\n",
" def handle_endtag(self, tag):\r\n",
" if tag.lower() in [\"script\",\"style\"]:\r\n",
" self.script = False\r\n",
" def handle_data(self, data):\r\n",
" if str.strip(data)==\"\" or self.script:\r\n",
" return\r\n",
" self.res += ' '+data.replace('[ edit ]','')\r\n",
"\r\n",
"parser = MyHTMLParser()\r\n",
"parser.feed(text)\r\n",
"text = parser.res\r\n",
"print(text[:1000])"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" Machine learning - Wikipedia Machine learning From Wikipedia, the free encyclopedia Jump to navigation Jump to search Study of algorithms that improve automatically through experience For the journal, see Machine Learning (journal) . \"Statistical learning\" redirects here. For statistical learning in linguistics, see statistical learning in language acquisition . Part of a series on Artificial intelligence Major goals Artificial general intelligence Planning Computer vision General game playing Knowledge reasoning Machine learning Natural language processing Robotics Approaches Symbolic Deep learning Bayesian networks Evolutionary algorithms Philosophy Ethics Existential risk Turing test Chinese room Control problem Friendly AI History Timeline Progress AI winter Technology Applications Projects Programming languages Glossary Glossary v t e Part of a series on Machine learning and data mining Problems Classification Clustering Regression Anomaly detection Data Cleaning AutoML Associ\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 3: Getting Insights\r\n",
"\r\n",
"The most important step is to turn our data into some for from which we can draw insights. In our case, we want to extract keywords from the text, and see which keywords are more meaningful.\r\n",
"\r\n",
"We will use Python library called [RAKE](https://github.com/aneesha/RAKE) for keyword extraction. First, let's install this library in case it is not present: "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"import sys\r\n",
"!{sys.executable} -m pip install nlp_rake"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: nlp_rake in c:\\winapp\\miniconda3\\lib\\site-packages (0.0.2)\n",
"Requirement already satisfied: langdetect>=1.0.8 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (1.0.9)\n",
"Requirement already satisfied: pyrsistent>=0.14.2 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (0.17.3)\n",
"Requirement already satisfied: numpy>=1.14.4 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (1.19.5)\n",
"Requirement already satisfied: regex>=2018.6.6 in c:\\winapp\\miniconda3\\lib\\site-packages (from nlp_rake) (2021.8.3)\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"C:\\winapp\\Miniconda3\\lib\\site-packages\\secretstorage\\dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"C:\\winapp\\Miniconda3\\lib\\site-packages\\secretstorage\\util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n",
"WARNING: Ignoring invalid distribution -umpy (c:\\winapp\\miniconda3\\lib\\site-packages)\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: six in c:\\winapp\\miniconda3\\lib\\site-packages (from langdetect>=1.0.8->nlp_rake) (1.16.0)\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"The main functionality is available from `Rake` object, which we can customize using some parameters. In our case, we will set the minimum length of a keyword to 5 characters, minimum frequency of a keyword in the document to 3, and maximum number of words in a keyword - to 2. Feel free to play around with other values and observe the result."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"import nlp_rake\r\n",
"extractor = nlp_rake.Rake(max_words=2,min_freq=3,min_chars=5)\r\n",
"res = extractor.apply(text)\r\n",
"res"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('data mining', 4.0),\n",
" ('polynomial time', 4.0),\n",
" ('dimensionality reduction', 4.0),\n",
" ('anomaly detection', 4.0),\n",
" ('data set', 4.0),\n",
" ('bayesian networks', 4.0),\n",
" ('language corpora', 4.0),\n",
" ('mcgraw hill', 4.0),\n",
" ('mit press', 4.0),\n",
" ('retrieved 2018-08-20', 4.0),\n",
" ('artificial neuron', 3.9642857142857144),\n",
" ('statistical learning', 3.9470198675496686),\n",
" ('feature learning', 3.9470198675496686),\n",
" ('reinforcement learning', 3.9470198675496686),\n",
" ('deep learning', 3.9470198675496686),\n",
" ('main article', 3.9411764705882355),\n",
" ('machine learning', 3.9144111718974948),\n",
" ('pattern recognition', 3.9),\n",
" ('neural networks', 3.875),\n",
" ('artificial intelligence', 3.864285714285714),\n",
" ('supervised learning', 3.835908756438558),\n",
" ('speech recognition', 3.833333333333333),\n",
" ('bayesian network', 3.833333333333333),\n",
" ('explicitly programmed', 3.8),\n",
" ('biological brain', 3.8),\n",
" ('unsupervised learning', 3.780353200883002),\n",
" ('outlier detection', 3.75),\n",
" ('ieee transactions', 3.75),\n",
" ('isbn 978-0-262-01243-0', 3.7391304347826084),\n",
" ('training data', 3.7222222222222223),\n",
" ('training set', 3.7222222222222223),\n",
" ('artificial neurons', 3.7142857142857144),\n",
" ('make predictions', 3.666666666666667),\n",
" ('international conference', 3.666666666666667),\n",
" ('computer vision', 3.645833333333333),\n",
" ('mathematical model', 3.642857142857143),\n",
" ('genetic algorithms', 3.6233766233766236),\n",
" ('modern approach', 3.5555555555555554),\n",
" ('knowledge discovery', 3.5128205128205128),\n",
" ('genetic algorithm', 3.5090909090909093),\n",
" ('information theory', 3.4272727272727272),\n",
" ('training examples', 3.3222222222222224),\n",
" ('machine', 1.9673913043478262),\n",
" ('learning', 1.9470198675496688),\n",
" ('computer', 1.8125),\n",
" ('research', 1.7692307692307692),\n",
" ('regression', 1.75),\n",
" ('theory', 1.7272727272727273),\n",
" ('training', 1.7222222222222223),\n",
" ('algorithms', 1.7142857142857142),\n",
" ('related', 1.7),\n",
" ('information', 1.7),\n",
" ('representation', 1.6666666666666667),\n",
" ('discovery', 1.6666666666666667),\n",
" ('predictions', 1.6666666666666667),\n",
" ('model', 1.6428571428571428),\n",
" ('based', 1.625),\n",
" ('methods', 1.6111111111111112),\n",
" ('algorithm', 1.6),\n",
" ('examples', 1.6),\n",
" ('systems', 1.5625),\n",
" ('approach', 1.5555555555555556),\n",
" ('biases', 1.5454545454545454),\n",
" ('classification', 1.5333333333333334),\n",
" ('application', 1.5),\n",
" ('tasks', 1.5),\n",
" ('models', 1.5),\n",
" ('environment', 1.5),\n",
" ('compute', 1.5),\n",
" ('input', 1.4545454545454546),\n",
" ('racist', 1.4285714285714286),\n",
" ('perform', 1.4166666666666667),\n",
" ('journal', 1.4),\n",
" ('decisions', 1.4),\n",
" ('computers', 1.4),\n",
" ('improve', 1.4),\n",
" ('researchers', 1.4),\n",
" ('program', 1.4),\n",
" ('represented', 1.4),\n",
" ('accuracy', 1.4),\n",
" ('typically', 1.4),\n",
" ('system', 1.375),\n",
" ('optimization', 1.375),\n",
" ('field', 1.3529411764705883),\n",
" ('databases', 1.3333333333333333),\n",
" ('terms', 1.3333333333333333),\n",
" ('learned', 1.3333333333333333),\n",
" ('outputs', 1.3333333333333333),\n",
" ('called', 1.3333333333333333),\n",
" ('observations', 1.3333333333333333),\n",
" ('applied', 1.3333333333333333),\n",
" ('target', 1.3333333333333333),\n",
" ('original', 1.3333333333333333),\n",
" ('study', 1.2857142857142858),\n",
" ('trained', 1.2857142857142858),\n",
" ('approaches', 1.25),\n",
" ('complexity', 1.25),\n",
" ('signal', 1.25),\n",
" ('inputs', 1.25),\n",
" ('output', 1.25),\n",
" ('process', 1.25),\n",
" ('people', 1.25),\n",
" ('peter', 1.25),\n",
" ('christopher', 1.25),\n",
" ('problem', 1.2),\n",
" ('performance', 1.2),\n",
" ('difference', 1.2),\n",
" ('michael', 1.2),\n",
" ('predict', 1.2),\n",
" ('norvig', 1.2),\n",
" ('learn', 1.1666666666666667),\n",
" ('class', 1.1666666666666667),\n",
" ('applications', 1.125),\n",
" ('found', 1.125),\n",
" ('features', 1.1111111111111112),\n",
" ('introduction', 1.0909090909090908),\n",
" ('statistics', 1.0714285714285714),\n",
" ('experience', 1.0),\n",
" ('series', 1.0),\n",
" ('order', 1.0),\n",
" ('medicine', 1.0),\n",
" ('respect', 1.0),\n",
" ('question', 1.0),\n",
" ('subfield', 1.0),\n",
" ('leading', 1.0),\n",
" ('actions', 1.0),\n",
" ('maximize', 1.0),\n",
" ('evaluated', 1.0),\n",
" ('instances', 1.0),\n",
" ('generalization', 1.0),\n",
" ('context', 1.0),\n",
" ('hypothesis', 1.0),\n",
" ('addition', 1.0),\n",
" ('types', 1.0),\n",
" ('members', 1.0),\n",
" ('bioinformatics', 1.0),\n",
" ('connection', 1.0),\n",
" ('layers', 1.0),\n",
" ('ethics', 1.0),\n",
" ('david', 1.0),\n",
" ('martin', 1.0),\n",
" ('springer', 1.0),\n",
" ('citeseerx 10', 1.0),\n",
" ('github', 1.0),\n",
" ('andrew', 1.0),\n",
" ('cybernetics', 1.0),\n",
" ('proceedings', 1.0),\n",
" ('arxiv', 1.0),\n",
" ('archived', 1.0),\n",
" ('survey', 1.0),\n",
" ('bibcode', 1.0),\n",
" ('cambridge', 1.0)]"
]
},
"metadata": {},
"execution_count": 6
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"\r\n",
"We obtained a list terms together with associated degree of importance. As you can see, the most relevant disciplines, such as machine learning and big data, are present in the list at top positions.\r\n",
"\r\n",
"## Step 4: Visualizing the Result\r\n",
"\r\n",
"People can interpret the data best in the visual form. Thus it often makes sense to visualize the data in order to draw some insights. We can use `matplotlib` library in Python to plot simple distribution of the keywords with their relevance:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"import matplotlib.pyplot as plt\r\n",
"\r\n",
"def plot(pair_list):\r\n",
" k,v = zip(*pair_list)\r\n",
" plt.bar(range(len(k)),v)\r\n",
" plt.xticks(range(len(k)),k,rotation='vertical')\r\n",
" plt.show()\r\n",
"\r\n",
"plot(res[:30])"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAI7CAYAAAA+r5tHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAADosklEQVR4nOzdd1gU1/c/8PfSOwhKU0TQiIANe0XsLcYWPyYau1EsYG+xdxMbGgt2xIa9RcWOikIUBLGDiIKFWKKiWGjn9wc/5su6u7MNXTXn9Tz76M7O2Zlhd2fP3rn3XAkRERhjjDHGdERP1zvAGGOMsf82TkYYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3iZIQxxhhjOsXJCGOMMcZ0ipMRxhhjjOmUga53QBV5eXl49OgRLC0tIZFIdL07jDHGGFMBEeH169dwdnaGnp7i9o+vIhl59OgRXFxcdL0bjDHGGNNAWloaSpUqpfDxryIZsbS0BJB/MFZWVjreG8YYY4ypIiMjAy4uLsL3uCJfRTJScGnGysqKkxHGGGPsK6OsiwV3YGWMMcaYTnEywhhjjDGd4mSEMcYYYzrFyQhjjDHGdIqTEcYYY4zpFCcjjDHGGNMpTkYYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3SKhmZO3cuJBIJhg8fLrremTNnUL16dZiYmMDd3R3BwcHabJYxxhhj3xCNk5FLly5h9erVqFy5suh6KSkpaNOmDRo2bIi4uDj89ttvCAwMxO7duzXdNGOMMca+IRolI2/evEH37t2xZs0aFCtWTHTd4OBglC5dGkFBQfD09ET//v3Rt29fLFiwQKMdZowxxti3RaNkZMiQIWjbti2aNWumdN2oqCi0aNFCalnLli0RExOD7OxsuTEfPnxARkaG1I0xxhhj3yYDdQPCwsIQGxuLmJgYldZPT0+Hg4OD1DIHBwfk5OTg2bNncHJykomZO3cupk+fru6uaaTM+ENqrX9vXtsii+fYLzO2KLfNGGNMObVaRtLS0jBs2DBs2bIFJiYmKsdJJBKp+0Qkd3mBCRMm4NWrV8ItLS1Nnd1kjDHG2FdErZaR2NhYPHnyBNWrVxeW5ebm4uzZs1i2bBk+fPgAfX19qRhHR0ekp6dLLXvy5AkMDAxgZ2cndzvGxsYwNjZWZ9cYY4wx9pVSKxlp2rQprl69KrWsT58+qFChAsaNGyeTiABA3bp1cfDgQallx44dQ40aNWBoaKjBLjP29eBLPIwxppxayYilpSUqVqwotczc3Bx2dnbC8gkTJuDhw4cIDQ0FAPj7+2PZsmUYOXIkfv31V0RFRWHdunXYtm1bER0CY4wxxr5mandgVebx48dITU0V7ru5ueHw4cMYMWIEli9fDmdnZyxduhSdO3cu6k0z9k35Wjr8fgmxjLGvm9bJSEREhNT9kJAQmXUaNWqEy5cva7spxhhjjH2DeG4axhhjjOlUkV+mYYyxz03bWjKMMd3iZIQx9p/H/VUY0y2+TMMYY4wxneKWEcYY0wK3qjCmPU5GGGNMRziRYSwfX6ZhjDHGmE5xMsIYY4wxneLLNIwx9hXiSrnsW8ItI4wxxhjTKU5GGGOMMaZTfJmGMcbYZ6FtpVy+RPTt4mSEMcbYN48TmS8bX6ZhjDHGmE5xMsIYY4wxneLLNIwxxpgIHgr96XEywhhjjH2B/kuJDF+mYYwxxphOccsIY4wx9o352lpVuGWEMcYYYzrFyQhjjDHGdIqTEcYYY4zpFCcjjDHGGNMpTkYYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3iZIQxxhhjOsXJCGOMMcZ0ipMRxhhjjOkUJyOMMcYY0ylORhhjjDGmU5yMMMYYY0yn1EpGVq5cicqVK8PKygpWVlaoW7cujhw5onD9iIgISCQSmdutW7e03nHGGGOMfRsM1Fm5VKlSmDdvHsqVKwcA2LhxI9q3b4+4uDh4e3srjLt9+zasrKyE+yVKlNBwdxljjDH2rVErGWnXrp3U/dmzZ2PlypWIjo4WTUbs7e1hY2Oj0Q4yxhhj7NumcZ+R3NxchIWFITMzE3Xr1hVd18fHB05OTmjatClOnz6t9Lk/fPiAjIwMqRtjjDHGvk1qJyNXr16FhYUFjI2N4e/vj71798LLy0vuuk5OTli9ejV2796NPXv2wMPDA02bNsXZs2dFtzF37lxYW1sLNxcXF3V3kzHGGGNfCbUu0wCAh4cH4uPj8fLlS+zevRu9evXCmTNn5CYkHh4e8PDwEO7XrVsXaWlpWLBgAXx9fRVuY8KECRg5cqRwPyMjgxMSxhhj7BuldjJiZGQkdGCtUaMGLl26hCVLlmDVqlUqxdepUwebN28WXcfY2BjGxsbq7hpjjDHGvkJa1xkhInz48EHl9ePi4uDk5KTtZhljjDH2jVCrZeS3335D69at4eLigtevXyMsLAwREREIDw8HkH955eHDhwgNDQUABAUFoUyZMvD29kZWVhY2b96M3bt3Y/fu3UV/JIwxxhj7KqmVjPzzzz/o0aMHHj9+DGtra1SuXBnh4eFo3rw5AODx48dITU0V1s/KysLo0aPx8OFDmJqawtvbG4cOHUKbNm2K9igYY4wx9tVSKxlZt26d6OMhISFS98eOHYuxY8eqvVOMMcYY++/guWkYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3iZIQxxhhjOsXJCGOMMcZ0ipMRxhhjjOkUJyOMMcYY0ylORhhjjDGmU5yMMMYYY0ynOBlhjDHGmE5xMsIYY4wxneJkhDHGGGM6xckIY4wxxnSKkxHGGGOM6RQnI4wxxhjTKU5GGGOMMaZTnIwwxhhjTKc4GWGMMcaYTnEywhhjjDGd4mSEMcYYYzrFyQhjjDHGdIqTEcYYY4zpFCcjjDHGGNMpTkYYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3iZIQxxhhjOsXJCGOMMcZ0ipMRxhhjjOkUJyOMMcYY0ylORhhjjDGmU2olIytXrkTlypVhZWUFKysr1K1bF0eOHBGNOXPmDKpXrw4TExO4u7sjODhYqx1mjDHG2LdFrWSkVKlSmDdvHmJiYhATE4MmTZqgffv2uH79utz1U1JS0KZNGzRs2BBxcXH47bffEBgYiN27dxfJzjPGGGPs62egzsrt2rWTuj979mysXLkS0dHR8Pb2llk/ODgYpUuXRlBQEADA09MTMTExWLBgATp37qz5XjPGGGPsm6Fxn5Hc3FyEhYUhMzMTdevWlbtOVFQUWrRoIbWsZcuWiImJQXZ2tsLn/vDhAzIyMqRujDHGGPs2qZ2MXL16FRYWFjA2Noa/vz/27t0LLy8vueump6fDwcFBapmDgwNycnLw7NkzhduYO3curK2thZuLi4u6u8kYY4yxr4TayYiHhwfi4+MRHR2NQYMGoVevXrhx44bC9SUSidR9IpK7vLAJEybg1atXwi0tLU3d3WSMMcbYV0KtPiMAYGRkhHLlygEAatSogUuXLmHJkiVYtWqVzLqOjo5IT0+XWvbkyRMYGBjAzs5O4TaMjY1hbGys7q4xxhhj7CukdZ0RIsKHDx/kPla3bl0cP35catmxY8dQo0YNGBoaartpxhhjjH0D1EpGfvvtN5w7dw737t3D1atXMXHiRERERKB79+4A8i+v9OzZU1jf398f9+/fx8iRI3Hz5k2sX78e69atw+jRo4v2KBhjjDH21VLrMs0///yDHj164PHjx7C2tkblypURHh6O5s2bAwAeP36M1NRUYX03NzccPnwYI0aMwPLly+Hs7IylS5fysF7GGGOMCdRKRtatWyf6eEhIiMyyRo0a4fLly2rtFGOMMcb+O3huGsYYY4zpFCcjjDHGGNMpTkYYY4wxplOcjDDGGGNMpzgZYYwxxphOcTLCGGOMMZ3iZIQxxhhjOsXJCGOMMcZ0ipMRxhhjjOkUJyOMMcYY0ylORhhjjDGmU5yMMMYYY0ynOBlhjDHGmE5xMsIYY4wxneJkhDHGGGM6xckIY4wxxnSKkxHGGGOM6RQnI4w
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"There is, however, even better way to visualize word frequencies - using **Word Cloud**. We will need to install another library to plot the word cloud from our keyword list."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 71,
"source": [
"!{sys.executable} -m pip install wordcloud"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"`WordCloud` object is responsible for taking in either original text, or pre-computed list of words with their frequencies, and returns and image, which can then be displayed using `matplotlib`:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 12,
"source": [
"from wordcloud import WordCloud\r\n",
"import matplotlib.pyplot as plt\r\n",
"\r\n",
"wc = WordCloud(background_color='white',width=800,height=600)\r\n",
"plt.figure(figsize=(15,7))\r\n",
"plt.imshow(wc.generate_from_frequencies({ k:v for k,v in res }))\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAwcAAAJLCAYAAACytersAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOyddZgd132w3zN0GZYZxcxs2ZZly8yOKY4dJ18ax6EmadM00EADbdOkbTgOmxPbMcXMliWDmFlaLfPuZZqZ8/1xVyutdgWWBU5y3+fxY+2ZwzN35vzO+YGQUkpy5MiRI0eOHDly5Mjxd49ypjuQI0eOHDly5MiRI0eO9wc54SBHjhw5cuTIkSNHjhxATjjIkSNHjhw5cuTIkSPHADnhIEeOHDly5MiRI0eOHEBOOMiRI0eOHDly5MiRI8cAOeEgR44cOXLkyJEjR44cQE44yJEjR44cOXLkyJEjxwA54SBHjhw5cuTIkSNHjhxATjjIkSNHjhw5cuTIkSPHADnhIEeOHDly5MiRI0eOHMAZFg5+9rOfUVdXh9PpZNasWSxfvvxMdidHjhw5cuTIkSNHjr9rzphw8Mc//pF//Md/5Ctf+Qrr1q1j8eLFXHzxxTQ2Np6pLuXIkSNHjhw5cuTI8XeNkFLKM9HwvHnzmDlzJj//+c8H0yZMmMBVV13F9773vaOWtW2b1tZWfD4fQohT3dUcOXLkyJEjR44cOf5qkVISiUQoLy9HUY5+NqCdpj4NIZ1Os2bNGr70pS8NSV+2bBkrV64clj+VSpFKpQb/bmlpYeLEiae8nzly5MiRI0eOHDly/K3Q1NREZWXlUfOcEeGgu7sby7IoKSkZkl5SUkJ7e/uw/N/73vf45je/OSy9qakJv99/yvqZI0eOvxcOHKAOP4mUUiJlAktG0NXiEfOctF5IiSSFaXUBEkX40NS8U9be0LZNLDuEonhQhIOh4xx5ftJmC5oSRFE8p6WPfw3859ce4Z03dnL59XO47Y6lJ3y6LaXkh996jFUrd3PrPyzhkmtnn+SenjkeuW8lD/3hDSzLxrYltiWZOX8Ut39yKZU1hWe6e+8bpJSsX72XX/3P84yZWM6tH19CQdH7Y82TthOs7HqQBYXX41BPze+/J9WEZafJd1SjKfopaePviXA4TFVVFT6f75h5z4hwcIDDX5pSyhFfpP/6r//K5z//+cG/DwzQ7/fnhIMcOXK8Z2yZImM2YOhjEQx/L8VS6whHfkV13r2cauEgldlFZ+i/SJv78LsuJd//L6esvUPJmM10hX9G0HkDbscsDh2nZUex7C50rXbI/Oxuu5jiwL/id192Wvr414DT4UZTHbicHvx+/wkLB5Zp0dESA0ujuyP+N/Wtm7tgIsLWiESStDb1svatPTgMFz6f729qnO8VaUvsjEqkP0O4N4Ouud4X85Oy4qTNFLMqlpHvKkIVWvbdZUcJZ7oAgSUz+PUi3GoAU2boTTejCg3TTuNUffj1YiQ2fakWgkYpqjBI2TGiZg8FRhUJK0xneiuWksFlOPDpRfi0AkAQs/qIZLoBiUv149OzfchxfBzPO+mMzGZhYSGqqg47Jejs7Bx2mgDgcDhwOBynq3s5cuT4O0JKiWm20BH6HlWFvwHUYXl0tRSP81xOpWAA2Ze20xhLRcGP6In84uCG/WlAER7cjrloaiGHjlNKm2RmM+H4E5Tm/TsjzU+Ok4+qqVx983x2bWvjgkunnenunFQmTa9m0vRqADat3c+6t/ee4R69PxGKYPS4Mq754EIqawrIL/Se6S4BkLDCNMbWs77vaW6s+S/cmh+JTWtiB6t6HqHKPZWI2U2BUcmU4DJiZj9Pt/w3k4JLiZthFKEyJXg+huJhZfcDnF18GwG9lI7kHjb2PcslFV8glOmkI7kH004hsSlzTcCr5SOxWN/7NGk7hq44KXLU41aDqGpOODiZnJHZNAyDWbNm8cILL3D11VcPpr/wwgtceeWVZ6JLf1dIKYnF0zzz6maEUJg/s47K0uCZ7tbfHclkhr+8vAnLspk/o46ayoJT2p5tSxqae3jxje3Drs2aUs3MyVVn3MBfSotY8nVSme3YMoHEQlUK8DrPwqGPJZXZQSz5Jpbdh6aW4nMtRVOLSWX2kMpsRQiDVGYnAg238yyc+mSEEFh2lHjqDVKZXQjhxuNYgEMfC0Bf7F6S6S0k05vpCv0AEPhcF+EyJmPZUfpjD2DZETQlf1h/M1YHseQbZMxGQGBo1fhcl6AozqOMMUMqs4t46h0sO4Sq+PA4F2No9Qhx9IW3lCbR5MukMruRMonEQlfL8DjPQVcrSKRXk0ivxbbjGPoovM7zUBUfth0jllqBrlaRMneSzuxBUXzkeW5GCAfJzFaiyVdA2uCYN9BWVsWpL3ofifQaUpntg/MT9FyPoWUXd6bVQ2/0D1hWH4ZWh8e5CFsmSGd2k7E6sewe/K6LiSZfR1Fc+FzLUJUAyfQ24qm3sew+hHDhMqbhcS48oefmb5ULr5jJhVec6V7kOJNU1RVx2yfOO9PdGELQKGVi4Dx2hN8Ykm5LC0NxMzP/ckKZDjb2PUvKjgGQtGJMDV6IlJIN/c/SEt9GnXdkVTlVaJS5xlLjmYYtbSYHl6IPvFNN2yZlR/FoedR5Z5NvVKCJ3ObxyeaMiVqf//zn+dCHPsTs2bNZsGABd911F42Njdxxxx1nqkt/V+ze38WvH1iB06mja0pOODgDNDT38KsH3kBTVSxbnnLhQEpJe1eYJ1/cSCyeIp2xBq/pmsLMyVWntP3jIZ56k97Y3QRcV2LLOH3Ru3E7F6AIL6nMXvqi96AqRWhqKcnMFhLpNZQGv0Xa3EdX+H/xOBZi6GNIZ3bR0f9tqgp/hcBBNPkCseQbOPWpWHY33ZGfUuz/J3StCl2tJqM0oQgXhjYGgUBVAgAIoWFotcSSb9KXeJZ83+2DfbXsMN3hHyNlYkAI0bBkCMTRvUBIbCy7F1sm0NRCkpktJCPbKAl+DVUcXWUgmnyN3ug9BD0fwLQ6CMcfw+u8AEW4iKfepi/6e5zGVFS1gGjyddKZfRT6P40tE4TjT4FQcOlT0NQyLLsXhAYD41VFgP7Ew7gd8wYX/gIFQ6skldmBIvyD86MI92CfwonH8TqXoShuIslnsWUUQ6uhJ/prvM4lxFIrSaTX4jJmkkiuQVXycejj6I/dj6oWoqll2DKc7U+OHDn+alGEikfLw6l6SVhhbGwOOMR0qG6cqo+0ncSpeEnZsWHnsJadPmYbqlCZnncprYnt7Ay/QdAoZ6xvEQ7VfcyyOY6fMyYc3HDDDfT09PCtb32LtrY2Jk+ezNNPP01NTc2Z6tLfFXl+Fz6vE6dDp6jg2MYpOYZiWja79nZQVOCjMP/EjnqDfjd+rwuA0tNgZKYogqnjy/nvr16Ladq0dvTz6wdX0NLef8rbPl5iyZVoSgE+9yVImSZl7kQVPnStnJ7IbxDCQdBzHZpaTMZaxP7OG0ikNwKgCCce59l4nIux7TB72s8nldmFodUQjv+FgPs6vK4l2HaC1t5PE0utIKjdgNd5DpI08dRqAp4rEIeozSjChdd5AbZME0sN3SWLpVaSzuymOPhlHPo4QCBlAkUYRx2jQMdlzMBpTEUIJ470eNr6voQtE6j4OJrqUjT5CoZagd91KabVRdrch6YWoCp59McfxtDHEPTchKJ4cBpT6Oj7Jl7XUnS1DFsmMdRqfO5L0ZR8bJkYMDwGQ6vB4zyHaOq1g/0UAjDwOM/FtLqwZXzY/GTL1pHnvRmBhkAhmd6IquQBNl7nUgQafbF7Kc37Nv3RB0mbe9GUQtJmI/nO83E7F4LMILGP/YDkyJHjfY1yhPBZ0UwvoUwHKjpRs4dCZy264sS006SsOOjQFN88tC6hkbHj2NI6JFXg1gKM959NY2wjbYltxMy+nHBwkjmjSlp33nknd95555nswt8tlWV5/OK7NyME+LxHVoHIMRwpJQ1NPXzzf5/ixitmc9WF00+onuJCLz//zk3A6bkHQgi8Hifj6p1ICXkBF27X0ReypxtVDZLIrANpAhlMqwPDqENKm7TZQCj
"text/plain": [
"<Figure size 1500x700 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"We can also pass in the original text to `WordCloud` - let's see if we are able to get similar result:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 13,
"source": [
"plt.figure(figsize=(15,7))\r\n",
"plt.imshow(wc.generate(text))\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAwcAAAJLCAYAAACytersAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOyddXwd15mwn6HLKGaywDIzs2M74DjYQJMU0pSy5W677Xa7291+ZdgtbNN2yymkbRomx0mcxOyYSZYtSxYzXMaZ+f648rVlSabYMeQ+v5/baO6Zc96ZO/fMec9Lgq7rOilSpEiRIkWKFClSpHjXI15uAVKkSJEiRYoUKVKkSHFlkFIOUqRIkSJFihQpUqRIAaSUgxQpUqRIkSJFihQpUgySUg5SpEiRIkWKFClSpEgBpJSDFClSpEiRIkWKFClSDJJSDlKkSJEiRYoUKVKkSAGklIMUKVKkSJEiRYoUKVIMklIOUqRIkSJFihQpUqRIAaSUgxQpUqRIkSJFihQpUgySUg5SpEiRIkWKFClSpEgBXGbl4Gc/+xmlpaWYTCamT5/Ohg0bLqc4KVKkSJEiRYoUKVK8q7lsysFf//pXPvOZz/CVr3yF3bt3s3DhQm644Qaampoul0gpUqRIkSJFihQpUryrEXRd1y/HwLNnz2batGk88sgjyWPV1dXceuutfOtb3zrjuZqm0dbWht1uRxCESy1qihQpUqRIkSJFihRXLbqu4/P5yMvLQxTPbBuQ3yGZhhCNRtm5cydf+tKXhhxfuXIlmzdvHtY+EokQiUSSf7e2tjJu3LhLLmeKFClSpEiRIkWKFNcKzc3NFBQUnLHNZVEOenp6UFWV7OzsIcezs7Pp6OgY1v5b3/oW//mf/znseHNzMw6H45LJmSLF1UYgEuXRzbv5x64DxOMqgWiMeeXF/PCemxBTVrYUKVKkSJHiXYnX66WwsBC73X7WtpdFOTjB6S5Buq6P6Cb05S9/mc997nPJv09coMPhSCkHKVKcglXTuGv+DGZUlnG4vZs/bN6FwWzB4XCklIMLpKetn/4uD+4sJxl57sstTooUKVKkSHHBnIs7/mUJSM7IyECSpGFWgq6urmHWBACj0ZhUBFIKQYoUoyOJIiUZbpZVj2FhZQlmg3K5RbrqaTnawRM/W8eW5/dcblFSpEiRIkWKS85lsRwYDAamT5/OunXruO2225LH161bxy233HI5REqRIsVVTDgYoX5/M10tfURCUdJynExdPA5REji8o57mIx0YzQYqp5WQV5pF/f5m6g82E4+pFFXlUjWtlKYj7RyvaSMejWNxmJk0vxK7y8rkRWNpOdaJrp7M3dDT3k/tjgZ8AwEy8txMXTIOSUqVjUmRIkWKFFc/l82t6HOf+xwPPPAAM2bMYO7cufzyl7+kqamJj33sY5dLpBQp3pVE4nHerG1gd1Mb/cEQNqORyYU5zCsvJs1qSbbTdR1PKMy2+mYOtXfR4wuCrpNhtzK7rJAZJQUYZGlIv2/UNtDcN8DNU6rp9Ph5o7aBtgEvBlmiPCudmyaNxW010+X187e39lGZk8mUwlzWHarjSEcPMVUlz+VgUWUp4/OzkEbJsBANxdi7oRbFKDNmYiGv/m0reWVZGIwGXv3bVhbcPJ32493sWn8IXdPZ+OwuyiYWYLGZWffnzdjTbNTvb+HwjmPMv3k6O149gN1pYfyccsTTFv2RcJS9bx4mGomRU5zJ609sJ7sonYLynFT2tHeAWEzlwJ5Gmo73sHj5eFxp1ndchq4OD7//5XpuuGUaEyYXnde5O7bWsXH9YUKhKAD3vH8+pWOGW8yvBsKhKFs3HkHTdOYsqMRiNV5ukVKkSHERuGzKwd13301vby//9V//RXt7OxMmTOCFF16guLj4comUIsW7Dl84wn+/vJE3jjRglGUy7Vb6A128WlPH/pYOPrhgBjnOk8FLz+yu4febdyEKApkOK6qms+HocV7cf4RPXzeP6ydWJhfwqqZzpKOH7Q0taBq8eaQBHR2TItPp8VPT3sXqydUA+CNRNtU1caSjh79u30e7x0eu085AMMTrh+vZVt/Mp66bx8zS0TMsKCaFoqpcpiyuZvsr+2mt6wRB4MDmo0iyRNAXIqsgneOH20CACXMqcGbY2fnaQer3NyPJItlFGUxeWEXzkXa6W/uIx1QMpykH/oEgR3Yfp6d9gPRcFz1tA3S39lNQnnPGe61pGt/4yj+oHJfH3Q/MH/LZT7//ApIk8uFPrkA+RcFKMRxN02hr6ePg3iZmzi2/LMqB3xfmjVcOMW3WGCZMPr9zC4oymDWvnD07G3nuiR2sXD35qlUOYjGVY0c6UFWNabPKUspBihTXCJc1IPnhhx/m4YcfvpwipEhxznz258/wwZUzmFCae80E9/5p6x6e31fL9RMr+fiS2RhkCV84wt/e2sfjOw5QlpnObdPGYZATU8XCqlJcVjMzSvIxDh5780gDP1q3id9t2smK8eXDdvePdvYwEAzx0SWzmVmSjyJJRFUVbyiMwzx0MfHm0eNMLcrjf+9fg9NsIhpXeXH/EX766mY21TUyNjcTu2nkBUjIHybkD6NrOn0dHpzpNhSjgayCdO757I0AyIrEQLePPW/UEI+p6LpOZ1MPc26YTEdjD77+ALqm4+3zk5btGGY1ADBZjLiznFTPHMPkhVWoqo7NaT7rvdZ1qDnQgsU2XP7j9d3IsgiXperM1YXBILNs1UQWLB2HdYR7eaWTneskMytRo+eFp3debnHeFlabibvfNx9dB2tKMUiR4prhsioHKVJcTfR4A0Rj6uUW46LR5fWz/nA9dpOBz1w3H5fFhCAIuC1mlldXsK2+hU11jSysLCHPlUgCUJLuoijNiSgISRea1ZPH8re39nOwtRNthJqKnlCY986ZwqLKEmxGA4IgoOs6WXbrMDecuKrxxRsWUZLuTrabXVbI2gMZNPUO4A1FRlUOZEXi9ce3s+4vmymqzKW4Oh9ZkZixYgL//anfI4gC826cwnX3zKV6Zhk/+5e/EAlFmbJoLJVTS+hq6aNuXxNff/8juDLtLLtrDmpc45df+Su1u48jIOAbCLD6Q0uYvmw8L/5hA689vg10+OLPH8RoNpz1nuu6flYFQNd1TtxGQTj5tyCczDJx4v9P1LBMtgEYbHdqmxOZ4E6937quo2s6gigk7/WJdro+mD1uUIhTxz71OnRGlw1A0/RTrgNOvbATzXQt0YEoDs9edzKDnQCDY+m6jsGoYDQpI7pxnbwnif/RB8cXEJL3ZqR2p8s16j1+m/sCgiAgyRKSfOYYlXO5x6fKxuCjdW7fxcnnRRBPtk200U95HkYf98TnZotx8NkaLv/pz3FCwLM/x0PvF2ct2HQhHPa0Y1dM5JqdiMKF9d8fCfCDQ+v4aOUiCq1p53yepmvs62/hrZ7jfLhy0QWNnSLFpeSaVA7imoZ02ovwfFA1jbimYZCks/YRV1VUTceoXJ5bqWmJGf5a2cm+0olrOv5QBAEBRZYwKtLgizSOrkcAEUEwnfbSiwECgjB65iBdj6GjIgqmd+Q6AI529TIQDDEuN5u4puENnyw0qEgiTrOJ5r4BgpEYQHLBoOk6kXgcTdNJrDd0bEYDmq6jaqcu6BIYZYkJ+VlJxeBEXyNRlplG7iluTIIgYDHIOM0mwrEYqja6cmaxmbj5oaVMXVI95Piah5ay5qGlQ44tvXM2S++cnfw7Fo2jGGRmr5rEzae1ffi77x02lt1t5ZM/uG/IsYsVbxAKRfnp919EFEUqxubw/BM78XlD5Bel874PL2HC5KLkQiwSibNt4xGee3InTQ3dSJJIdq6LlTdN5oZbpgFw7EgnP/neC6y4cRKrb58BJJ7LQ/ua+fcvPMbnv3oL8xZW8daWOp7++1ssXTGBt7bWseetemSDzJIV47n3/Qux2U2D50IoGOXZf+zg1bX76enykp5pZ8UNk1h92wzM1pPf8/9+/wUG+oN85FMr+f0vXmP3juOIksCaO2ex5o4ZRKMxPvHBXzN7XgX/9M/XD/ndPPW37bzw9C7+9b/uoGRMFrqu86PvvMAbrxwkFos
"text/plain": [
"<Figure size 1500x700 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 61,
"source": [
"wc.generate(text).to_file('images/ds_wordcloud.png')"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<wordcloud.wordcloud.WordCloud at 0x224b99d76a0>"
]
},
"metadata": {},
"execution_count": 61
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"You can see that word cloud now looks more impressive, but it also contains a lot of noise (eg. unrelated words such as `Retrieved on`). Also, we get fewer keywords that consist of two words, such as *data scientist*, or *computer science*. This is because RAKE algorithm does much better job at selecting good keywords from text. This example illustrates the importance of data pre-processing and cleaning, because clear picture at the end will allow us to make better decisions.\r\n",
"\r\n",
"In this exercise we have gone through a simple process of extracting some meaning from Wikipedia text, in the form of keywords and word cloud. This example is quite simple, but it demonstrates well all typical steps a data scientist will take when working with data, starting from data acquisition, up to visualization.\r\n",
"\r\n",
"In our course we will discuss all those steps in detail. "
],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python",
"version": "3.8.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.8 64-bit (conda)"
},
"interpreter": {
"hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}