Apply reviewer feedback: add content cleaning function and update documentation

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
6 months ago · c8299dabb2
parent 3a34115701
commit c8299dabb2
2 changed files with 2 additions and 2 deletions
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@ -66,7 +66,7 @@
  {
   "cell_type": "markdown",
   "source": [
-    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
   ],
   "metadata": {}
  },
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@ -94,7 +94,7 @@
   "cell_type": "code",
   "execution_count": 4,
   "source": [
-    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef _clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    _clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
   ],
   "outputs": [
    {