Merge pull request #728 from microsoft/copilot/fix-relevant-content-extraction

Replace HTMLParser with BeautifulSoup and add content cleaning to extract clean Wikipedia articles
3 months ago · c6c4afccc4
parent 91a12d6cfe 990d3c469d
commit c6c4afccc4
2 changed files with 41 additions and 53 deletions
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@ -66,45 +66,39 @@
  {
   "cell_type": "markdown",
   "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest built-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
   ],
   "metadata": {}
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      " Data science - Wikipedia Data science From Wikipedia, the free encyclopedia Jump to navigation Jump to search Interdisciplinary field of study focused on deriving knowledge and insights from data Not to be confused with  information science . The existence of  Comet NEOWISE  (here depicted as a series of red dots) was discovered by analyzing  astronomical survey  data acquired by a  space telescope , the  Wide-field Infrared Survey Explorer . Part of a series on Machine learning and  data mining Problems Classification Clustering Regression Anomaly detection AutoML Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction Supervised learning ( classification  •  regression ) Decision trees Ensembles Bagging Boosting Random forest k -NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine \n"
+      "Data science From Wikipedia, the free encyclopedia Interdisciplinary field of study focused on deriving knowledge and insights from data Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data. Data science also integrates domain knowledge from the underlying application domain. Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.\n"
     ]
    }
   ],
@ -416,4 +410,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 2
-}
+}
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@ -69,45 +69,39 @@
  {
   "cell_type": "markdown",
   "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest build-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
   ],
   "metadata": {}
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
-      " Machine learning - Wikipedia Machine learning From Wikipedia, the free encyclopedia Jump to navigation Jump to search Study of algorithms that improve automatically through experience For the journal, see  Machine Learning (journal) . \"Statistical learning\" redirects here. For statistical learning in linguistics, see  statistical learning in language acquisition . Part of a series on Artificial intelligence Major goals Artificial general intelligence Planning Computer vision General game playing Knowledge reasoning Machine learning Natural language processing Robotics Approaches Symbolic Deep learning Bayesian networks Evolutionary algorithms Philosophy Ethics Existential risk Turing test Chinese room Control problem Friendly AI History Timeline Progress AI winter Technology Applications Projects Programming languages Glossary Glossary v t e Part of a series on Machine learning and  data mining Problems Classification Clustering Regression Anomaly detection Data Cleaning AutoML Associ\n"
+      "Machine learning From Wikipedia, the free encyclopedia Study of algorithms that improve automatically through experience Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance. ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.\n"
     ]
    }
   ],