Update 1-Introduction/01-defining-data-science/solution/notebook.ipynb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
copilot/fix-relevant-content-extraction
Lee Stott 1 week ago committed by GitHub
parent 0fcfd8b323
commit 0634419bce
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -69,7 +69,7 @@
{
"cell_type": "markdown",
"source": [
"## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
"## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
],
"metadata": {}
},

Loading…
Cancel
Save