From 0fcfd8b3232c7c88702b41338edeb93ed519a781 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 16 Jan 2026 08:49:20 +0000
Subject: [PATCH] Replace HTMLParser with BeautifulSoup for extracting only
 relevant Wikipedia content

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
---
 .../01-defining-data-science/notebook.ipynb   | 46 ++++++++-----------
 .../solution/notebook.ipynb                   | 44 ++++++++----------
 2 files changed, 39 insertions(+), 51 deletions(-)
diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index cf3988e8..4648caf0 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -66,38 +66,32 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest built-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
    ],
    "metadata": {}
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
   {
    "cell_type": "code",
    "execution_count": 64,
    "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {
@@ -416,4 +410,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}
\ No newline at end of file
diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index ac2c5524..4a13c9ee 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -69,38 +69,32 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest build-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
    ],
    "metadata": {}
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
    "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {