{ "cells": [ { "cell_type": "markdown", "source": [ "# Challenge: Analyzing Text about Data Science\n", "\n", "> *In this notebook, we experiment with using a different URL - the Wikipedia article on Machine Learning. You can see that, unlike Data Science, this article contains many terms, which makes the analysis more challenging. We need to find another way to clean up the data after performing keyword extraction to eliminate some frequent but insignificant word combinations.*\n", "\n", "In this example, let's do a simple exercise that covers all the steps of a traditional data science process. You don't need to write any code; you can simply click on the cells below to execute them and observe the results. As a challenge, you are encouraged to try this code with different data.\n", "\n", "## Goal\n", "\n", "In this lesson, we have been discussing various concepts related to Data Science. Let's try to uncover more related concepts by performing **text mining**. We will start with a text about Data Science, extract keywords from it, and then attempt to visualize the results.\n", "\n", "For the text, I will use the Wikipedia page on Data Science:\n" ], "metadata": {} }, { "cell_type": "markdown", "source": [], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "url = 'https://en.wikipedia.org/wiki/Data_science'\r\n", "url = 'https://en.wikipedia.org/wiki/Machine_learning'" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Step 1: Obtaining the Data\n", "\n", "The first step in any data science process is acquiring the data. We'll use the `requests` library to accomplish this:\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "import requests\r\n", "\r\n", "text = requests.get(url).content.decode('utf-8')\r\n", "print(text[:1000])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "\n", "\n", "Machine learning - Wikipedia\n", "