Merge branch 'main' into classifiation-intro

pull/41/head
softchris 4 years ago
commit f43f8dcb57

@ -4,56 +4,70 @@
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)
### Introduction
## Introduction
In this final lesson on Regression, one of the basic 'classic' ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
In this final lesson on Regression, one of the basic _classic_ ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?
In this lesson, you will learn:
- A new library for data visualization
- Techniques for logistic regression
Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa)
## Prerequisite
Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: Color. Let's build a logistic regression model to predict that, given some variables, what color a given pumpkin is likely to be (orange 🎃 or white 👻).
Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.
Let's build a logistic regression model to predict that, given some variables, _what color a given pumpkin is likely to be_ (orange 🎃 or white 👻).
> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.
## Define the question
For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.
> 🎃 Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!
## About logistic regression
Logistic regression differs from linear regression, which you learned about previously, in a few important ways.
### Binary classification
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, how much its price will rise.
Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_.
![Pumpkin classification Model](./images/pumpkin-classifier.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
### Other classifications
There are other types of logistic regression, including multinomial and ordinal. Multinomial involves having more than one categories - "Orange, White, and Striped". Ordinal involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
There are other types of logistic regression, including multinomial and ordinal:
- **Multinomial**, which involves having more than one category - "Orange, White, and Striped".
- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).
![Multinomial vs ordinal regression](./images/multinomial-ordinal.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
### It's still linear
Even though this type of Regression is all about category predictions, it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.
### Variables DO NOT have to correlate
Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.
### You need a lot of clean data
Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.
✅ Think about the types of data that would lend themselves well to logistic regression
## Tidy the data
## Exercise - tidy the data
First, clean the data a bit, dropping null values and selecting only some of the columns:
1. Add the following code:
```python
from sklearn.preprocessing import LabelEncoder
@ -71,9 +85,14 @@ You can always take a peek at your new dataframe:
```python
new_pumpkins.info
```
### Visualization
By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier. Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by side grid.
### Visualization - side-by-side grid
By now you have loaded up the [starter notebook](./notebook.ipynb) with pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including `Color`. Let's visualize the dataframe in the notebook using a different library: [Seaborn](https://seaborn.pydata.org/index.html), which is built on Matplotlib which we used earlier.
Seaborn offers some neat ways to visualize your data. For example, you can compare distributions of the data for each point in a side-by-side grid.
1. Create such a grid by instantiating a `PairGrid`, using our pumpkin data `new_pumpkins`, followed by calling `map()`:
```python
import seaborn as sns
@ -88,7 +107,13 @@ By observing data side-by-side, you can see how the Color data relates to the ot
✅ Given this scatterplot grid, what are some interesting explorations you can envision?
Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables. You can visualize variables side-by-side with Seaborn plots. Try a 'swarm' plot to show the distribution of values:
### Use a swarm plot
Since Color is a binary category (Orange or Not), it's called 'categorical data' and needs 'a more [specialized approach](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) to visualization'. There are other ways to visualize the relationship of this category with other variables.
You can visualize variables side-by-side with Seaborn plots.
1. Try a 'swarm' plot to show the distribution of values:
```python
sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
@ -96,12 +121,17 @@ sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
![A swarm of visualized data](images/swarm.png)
### Violin plot
A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'.
1. As parameters `x=Color`, `kind="violin"` and call `catplot()`:
```python
sns.catplot(x="Color", y="Item Size",
kind="violin", data=new_pumpkins)
```
![a violin type chart](images/violin.png)
✅ Try creating this plot, and other Seaborn plots, using other variables.
@ -120,7 +150,7 @@ Now that we have an idea of the relationship between the binary categories of co
Building a model to find these binary classification is surprisingly straightforward in Scikit-learn.
Select the variables you want to use in your classification model and split the training and test sets:
1. Select the variables you want to use in your classification model and split the training and test sets calling `train_test_split()`:
```python
from sklearn.model_selection import train_test_split
@ -134,7 +164,7 @@ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
```
Now you can train your model and print out its result:
1. Now you can train your model, by calling `fit()` with your training data, and print out its result:
```python
from sklearn.model_selection import train_test_split
@ -152,7 +182,7 @@ print('Accuracy: ', accuracy_score(y_test, predictions))
Take a look at your model's scoreboard. It's not too bad, considering you have only about 1000 rows of data:
```
```output
precision recall f1-score support
0 0.85 0.95 0.90 166
@ -176,6 +206,8 @@ While you can get a scoreboard report [terms](https://scikit-learn.org/stable/mo
> 🎓 A '[confusion matrix](https://wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that expresses your model's true vs. false positives and negatives, thus gauging the accuracy of predictions.
1. To use a confusion metrics, call `confusin_matrix()`:
```python
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
@ -183,7 +215,7 @@ confusion_matrix(y_test, predictions)
Take a look at your model's confusion matrix:
```
```output
array([[162, 4],
[ 33, 0]])
```

@ -1,19 +1,39 @@
# Build a Web App to use a ML Model
In this lesson, you will train a ML model on a dataset that's out of this world: UFO sightings over the past century, sourced from [NUFORC's database](https://www.nuforc.org). We will continue our use of notebooks to clean data and train our model, but you can take the process one step further by exploring using a model 'in the wild', so to speak: in a web app. To do this, you need to build a web app using Flask.
In this lesson, you will train an ML model on a data set that's out of this world: _UFO sightings over the past century_, sourced from [NUFORC's database](https://www.nuforc.org).
You will learn:
- How to 'pickle' a trained model
- How to use that model in a Flask app
We will continue our use of notebooks to clean data and train our model, but you can take the process one step further by exploring using a model 'in the wild', so to speak: in a web app.
To do this, you need to build a web app using Flask.
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/17/)
There are several ways to build web apps to consume machine learning models. Your web architecture may influence the way your model is trained. Imagine that you are working in a business where the data science group has trained a model that they want you to use in an app. There are many questions you need to ask: Is it a web app, or a mobile app? Where will the model reside, in the cloud or locally? Does the app have to work offline? And what technology was used to train the model, because that may influence the tooling you need to use?
## Building an app
There are several ways to build web apps to consume machine learning models. Your web architecture may influence the way your model is trained. Imagine that you are working in a business where the data science group has trained a model that they want you to use in an app.
If you are training a model using TensorFlow, for example, that ecosystem provides the ability to convert a TensorFlow model for use in a web app by using [TensorFlow.js](https://www.tensorflow.org/js/). If you are building a mobile app or need to use the model in an IoT context, you could use [TensorFlow Lite](https://www.tensorflow.org/lite/) and use the model in an Android or iOS app.
### Considerations
If you are building a model using a library such as [PyTorch](https://pytorch.org/), you have the option to export it in [ONNX](https://onnx.ai/) (Open Neural Network Exchange) format for use in JavaScript web apps that can use the [Onnx Runtime](https://www.onnxruntime.ai/). This option will be explored in a future lesson for a Scikit-learn-trained model.
There are many questions you need to ask:
If you are using an ML SaaS (Software as a Service) system such as [Lobe.ai](https://lobe.ai/) or [Azure Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-15963-cxa) to train a model, this type of software provides ways to export the model for many platforms, including building a bespoke API to be queried in the cloud by your online application.
- **Is it a web app or a mobile app?** If you are building a mobile app or need to use the model in an IoT context, you could use [TensorFlow Lite](https://www.tensorflow.org/lite/) and use the model in an Android or iOS app.
- **Where will the model reside**? In the cloud or locally?
- **Offline support**. Does the app have to work offline?
- **What technology was used to train the model?** The chosen technology may influence the tooling you need to use.
- **Using Tensor flow**. If you are training a model using TensorFlow, for example, that ecosystem provides the ability to convert a TensorFlow model for use in a web app by using [TensorFlow.js](https://www.tensorflow.org/js/).
- **Using PyTorch**. If you are building a model using a library such as [PyTorch](https://pytorch.org/), you have the option to export it in [ONNX](https://onnx.ai/) (Open Neural Network Exchange) format for use in JavaScript web apps that can use the [Onnx Runtime](https://www.onnxruntime.ai/). This option will be explored in a future lesson for a Scikit-learn-trained model.
- **Using Lobe.ai or Azure Custom vision**. If you are using an ML SaaS (Software as a Service) system such as [Lobe.ai](https://lobe.ai/) or [Azure Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-15963-cxa) to train a model, this type of software provides ways to export the model for many platforms, including building a bespoke API to be queried in the cloud by your online application.
You also have the opportunity to build an entire Flask web app that would be able to train the model itself in a web browser. This can also be done using TensorFlow.js in a JavaScript context. For our purposes, since we have been working with Python-based notebooks, let's explore the steps you need to take to export a trained model from such a notebook to a format readable by a Python-built web app.
You also have the opportunity to build an entire Flask web app that would be able to train the model itself in a web browser. This can also be done using TensorFlow.js in a JavaScript context.
## Tools
For our purposes, since we have been working with Python-based notebooks, let's explore the steps you need to take to export a trained model from such a notebook to a format readable by a Python-built web app.
## Tool
For this task, you need two tools: Flask and Pickle, both of which run on Python.
@ -21,11 +41,18 @@ For this task, you need two tools: Flask and Pickle, both of which run on Python
✅ What's [Pickle](https://docs.python.org/3/library/pickle.html)? Pickle 🥒 is a Python module that serializes and de-serializes a Python object structure. When you 'pickle' a model, you serialize or flatten its structure for use on the web. Be careful: pickle is not intrinsically secure, so be careful if prompted to 'un-pickle' a file. A pickled file has the suffix `.pkl`.
## Clean your data
## Exercise - clean your data
In this lesson you'll use data from 80,000 UFO sightings, gathered by [NUFORC](https://nuforc.org) (The National UFO Reporting Center). This data has some interesting descriptions of UFO sightings, for example:
In this lesson you'll use data from 80,000 UFO sightings, gathered by [NUFORC](https://nuforc.org) (The National UFO Reporting Center). This data has some interesting descriptions of UFO sightings, for example "A man emerges from a beam of light that shines on a grassy field at night and he runs towards the Texas Instruments parking lot" or simply "the lights chased us". The [ufos.csv](./data/ufos.csv) spreadsheet includes columns about the city, state and country where the sighting occurred, the object's shape and its latitude and longitude.
- **Long example description**. "A man emerges from a beam of light that shines on a grassy field at night and he runs towards the Texas Instruments parking lot".
- **Short example description**. "the lights chased us".
In the blank [notebook](notebook.ipynb) included in this lesson, import pandas, matplotlib, and numpy as you did in previous lessons and import the ufos spreadsheet. You can take a look at a sample data set:
The [ufos.csv](./data/ufos.csv) spreadsheet includes columns about the `city`, `state` and `country` where the sighting occurred, the object's `shape` and its `latitude` and `longitude`.
In the blank [notebook](notebook.ipynb) included in this lesson:
1. import `pandas`, `matplotlib`, and `numpy` as you did in previous lessons and import the ufos spreadsheet. You can take a look at a sample data set:
```python
import pandas as pd
@ -34,7 +61,8 @@ import numpy as np
ufos = pd.read_csv('../data/ufos.csv')
ufos.head()
```
Convert the ufos data to a small dataframe with fresh titles. Check the unique values in the Country field.
1. Convert the ufos data to a small dataframe with fresh titles. Check the unique values in the `Country` field.
```python
ufos = pd.DataFrame({'Seconds': ufos['duration (seconds)'], 'Country': ufos['country'],'Latitude': ufos['latitude'],'Longitude': ufos['longitude']})
@ -42,7 +70,7 @@ ufos = pd.DataFrame({'Seconds': ufos['duration (seconds)'], 'Country': ufos['cou
ufos.Country.unique()
```
Now, you can reduce the amount of data we need to deal with by dropping any null values and only importing sightings between 1-60 seconds:
1. Now, you can reduce the amount of data we need to deal with by dropping any null values and only importing sightings between 1-60 seconds:
```python
ufos.dropna(inplace=True)
@ -52,7 +80,7 @@ ufos = ufos[(ufos['Seconds'] >= 1) & (ufos['Seconds'] <= 60)]
ufos.info()
```
Next, import Scikit-learn's LabelEncoder library to convert the text values for countries to a number.
1. Import Scikit-learn's `LabelEncoder` library to convert the text values for countries to a number:
✅ LabelEncoder encodes data alphabetically
@ -66,7 +94,7 @@ ufos.head()
Your data should look like this:
```
```output
Seconds Country Latitude Longitude
2 20.0 3 53.200000 -2.916667
3 20.0 4 28.978333 -96.645833
@ -74,9 +102,12 @@ Your data should look like this:
23 60.0 4 45.582778 -122.352222
24 3.0 3 51.783333 -0.783333
```
## Build your model
Now you can get ready to train a model by diving the data into the training and testing group. Select the three features you want to train on as your X vector, and the y vector will be the Country. You want to be able to input seconds, latitude and longitude and get a country id to return.
## Exercise - build your model
Now you can get ready to train a model by diving the data into the training and testing group.
1. Select the three features you want to train on as your X vector, and the y vector will be the `Country`. You want to be able to input `Seconds`, `Latitude` and `Longitude` and get a country id to return.
```python
from sklearn.model_selection import train_test_split
@ -89,7 +120,7 @@ y = ufos['Country']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
Finally, train your model using logistic regression:
1. Train your model using logistic regression:
```python
from sklearn.metrics import accuracy_score, classification_report
@ -103,10 +134,13 @@ print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
The accuracy isn't bad (around 95%), unsurprisingly, as country and latitude/longitude correlate. The model you created isn't very revolutionary as it's obvious you should be able to infer a country from its latitude and longitude, but it's a good exercise to try to train from raw data that you cleaned, exported, and then use this model in a web app.
## Pickle your model
The accuracy isn't bad **(around 95%)**, unsurprisingly, as `Country` and `Latitude/Longitude` correlate.
The model you created isn't very revolutionary as you should be able to infer a `Country` from its `Latitude` and `Longitude`, but it's a good exercise to try to train from raw data that you cleaned, exported, and then use this model in a web app.
Now, it's time to pickle your model! You can do that in just a few lines of code. Once it's pickled, load your pickled model and test it against a sample data array containing values for seconds, latitude and longitude,
## Exercise - 'pickle' your model
Now, it's time to _pickle_ your model! You can do that in a few lines of code. Once it's _pickled_, load your pickled model and test it against a sample data array containing values for seconds, latitude and longitude,
```python
import pickle
@ -116,17 +150,29 @@ pickle.dump(model, open(model_filename,'wb'))
model = pickle.load(open('ufo-model.pkl','rb'))
print(model.predict([[50,44,-12]]))
```
The model returns '3', which is the country code for the UK. Wild! 👽
## Build a Flask app
The model returns **'3'**, which is the country code for the UK. Wild! 👽
## Exercise - build a Flask app
Now you can build a Flask app to call your model and return similar results, but in a more visually pleasing way.
Start by creating a folder called web-app next to the _notebook.ipynb_ file where your _ufo-model.pkl_ file resides. In that folder create three more folders: `static`, with a folder `css` inside it, and `templates`.
1. Start by creating a folder called **web-app** next to the _notebook.ipynb_ file where your _ufo-model.pkl_ file resides.
1. In that folder create three more folders: **static**, with a folder **css** inside it, and **templates`**. You should now have the following files and directories:
```output
web-app/
static/
css/
templates/
notebook.ipynb
ufo-model.pk1
```
✅ Refer to the solution folder for a view of the finished app
The first file to create in `web-app` is a `requirements.txt` file. Like `package.json` in a JavaScript app, this file lists dependencies required by the app. In `requirements.txt` add the lines:
1. The first file to create in _web-app_ folder is **requirements.txt** file. Like _package.json_ in a JavaScript app, this file lists dependencies required by the app. In **requirements.txt** add the lines:
```text
scikit-learn
@ -134,15 +180,26 @@ pandas
numpy
flask
```
Now, run this file by navigating to `web-app` (`cd web-app`) in your terminal and typing `pip install -r requirements.txt`.
Now, you're ready to create three more files to finish the app:
1. Now, run this file by navigating to _web-app_:
```bash
cd web-app
```
1. Create `app.py` in the root
2. Create `index.html` in `templates`
3. Create `styles.css` in `static/css`
1. In your terminal type `pip install`, to install the libraries listed in _reuirements.txt_:
Build out the styles.css file with a few styles:
```bash
pip install -r requirements.txt
```
1. Now, you're ready to create three more files to finish the app:
1. Create **app.py** in the root
2. Create **index.html** in _templates_ directory.
3. Create **styles.css** in _static/css_ directory.
1. Build out the _styles.css__ file with a few styles:
```css
body {
@ -175,7 +232,8 @@ input {
display: inline-block;
}
```
Next, build out the `index.html` file:
1. Next, build out the _index.html_ file:
```html
<!DOCTYPE html>
@ -209,11 +267,12 @@ Next, build out the `index.html` file:
</body>
</html>
```
Take a look at the templating in this file. Notice the 'mustache' syntax around variables that will be provided by the app, like the prediction text: `{{}}`. There's also a form that posts a prediction to the `/predict` route.
Finally, you're ready to build the python file that drives the consumption of the model and the display of predictions:
In `app.py` add:
1. In `app.py` add:
```python
import numpy as np
@ -254,9 +313,13 @@ if __name__ == "__main__":
If you run `python app.py` or `python3 app.py` - your web server starts up, locally, and you can fill out a short form to get an answer to your burning question about where UFOs have been sighted!
Before doing that, take a look at the parts of `app.py`.
Before doing that, take a look at the parts of `app.py`:
1. First, dependencies are loaded and the app starts.
1. Then, the model is imported.
1. Then, index.html is rendered on the home route.
First, dependencies are loaded and the app starts. Then, the model is imported. Then, index.html is rendered on the home route. On the `/predict` route, several things happen when the form is posted:
On the `/predict` route, several things happen when the form is posted:
1. The form variables are gathered and converted to a numpy array. They are then sent to the model and a prediction is returned.
2. The Countries that we want displayed are re-rendered as readable text from their predicted country code, and that value is sent back to index.html to be rendered in the template.
@ -266,6 +329,7 @@ Using a model this way, with Flask and a pickled model, is relatively straightfo
In a professional setting, you can see how good communication is necessary between the folks who train the model and those who consume it in a web or mobile app. In our case, it's only one person, you!
---
## 🚀 Challenge:
Instead of working in a notebook and importing the model to the Flask app, you could train the model right within the Flask app! Try converting your Python code in the notebook, perhaps after your data is cleaned, to train the model from within the app on a route called `train`. What are the pros and cons of pursuing this method?

@ -1,55 +1,501 @@
# [Lesson Topic]
# Sentiment Analysis
Add a sketchnote if possible/appropriate
In this section you will use the techniques in the previous lessons to do some exploratory data analysis of a large dataset. Once you have a good understanding of the usefulness of the various columns, you will learn how to remove the unneeded columns, calculate some new data based on the existing columns, and save the resulting dataset for use in the final challenge.
![Embed a video here if available](video-url)
### Introduction
## [Pre-lecture quiz](link-to-quiz-app) 37
So far you've learned about how text data is quite unlike numerical types of data. If it's text that was written or spoken by a human, if can be analysed to find patterns and frequencies, sentiment and meaning. This final lesson takes you into a real data set with a real challenge. This lesson is a lot of code and analysis of a data set, it is quite dense but very amenable to experimentation in your favourite IDE or Notebook.
Describe what we will learn
> This lesson uses the data set **515K Hotel Reviews Data in Europe**, CC0: Public Domain license, scraped from Booking.com from public sources. The creator of the dataset was Jiashen Liu.
### Introduction
### Preparation
Describe what will be covered
You will need:
> Notes
* Python 3
### Prerequisite
* pandas
* **TODO install NTLK details**
What steps should have been covered before this lesson?
* The data set is available on Kaggle [515K Hotel Reviews Data in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe), it is around 230 MB unzipped.
### Preparation
## Exploratory Data Analysis
This challenge assumes you are building a hotel recommendation bot using sentiment analysis and guest reviews scores. The dataset you will be starting from has over 515,000 rows reviewing 1493 different hotels in 6 cities.
Using Python, a dataset of hotel reviews, and NLTK's sentiment analysis you could find out:
* what are the most frequently used words and phrases in reviews?
* do the official *tags* describing a hotel correlate with review scores (e.g. are the more negative reviews for a particular hotel for *Family with young children* than by *Solo traveller*, perhaps indicating it is better for *Solo travellers*?)
* do the NLTK sentiment scores 'agree' with the hotel reviewer's numerical score?
#### Dataset
Let's explore the dataset first. Remember to download and save the CSV file here: https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe.
The dataset was created by **Jiashen Liu** 4 years ago (as of writing) and is licensed [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/).
> "This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis."
You could open the file in an editor like VS Code or even Excel, and as it's a text CSV file, any editor that can handle large text files should be able to open it.
The headers in the dataset are as follows:
*Hotel_Address, Additional_Number_of_Scoring, Review_Date, Average_Score, Hotel_Name, Reviewer_Nationality, Negative_Review, Review_Total_Negative_Word_Counts, Total_Number_of_Reviews, Positive_Review, Review_Total_Positive_Word_Counts, Total_Number_of_Reviews_Reviewer_Has_Given, Reviewer_Score, Tags, days_since_review, lat, lng*
and Jiashen provides the description of each item on Kaggle.
Here they are grouped in a way that might be easier to examine:
##### Hotel columns
* `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude)
* Using *lat* and *lng* you could plot a map with Python showing the hotel locations (perhaps colour coded for negative and positive reviews)
* Hotel_Address is not obviously useful to us, and we'll probably replace that with a country for easier sorting & searching
**Hotel Meta-review columns**
* `Average_Score`
* According to the dataset creator, this column is *Average Score of the hotel, calculated based on the latest comment in the last year*. This seems like an unusual way to calculate the score, but it is the data scraped so we may take it as face value for now. Based on the other columns in this data, can you think of another way to calculate the average score?
* `Total_Number_of_Reviews`
* The total number of reviews this hotel has received - it is not clear (without writing some code) if this refers to the reviews in the dataset. More on this discrepancy below in the **Average hotel score** section.
* `Additional_Number_of_Scoring`
* This means a review score was given but no positive or negative review was written by the reviewer
**Review columns**
- `Reviewer_Score`
- This is a numerical value with at most 1 decimal place between the min and max values 2.5 and 10
- It is not explained why 2.5 is the lowest score possible
- `Negative_Review`
- If a reviewer wrote nothing, this field will have "**No Negative**"
- Note that a reviewer may write a positive review in the Negative review column (e.g. "there is nothing bad about this hotel")
- `Review_Total_Negative_Word_Counts`
- Are higher negative word counts indicative of a lower score (without checking the sentimentality)
- `Positive_Review`
- If a reviewer wrote nothing, this field will have "**No Positive**"
- Note that a reviewer may write a negative review in the Positive review column (e.g. "there is nothing good about this hotel at all")
- `Review_Total_Positive_Word_Counts`
- Are higher positive word counts indicative of a higher score (without checking the sentimentality)
- `Review_Date` and `days_since_review`
- A freshness or staleness measure might be applied to a review (older reviews might not be as accurate as newer ones because hotel management changed, or renovations have been done, or a pool was added etc.)
- `Tags`
- These are short descriptors that a reviewer may select to describe the type of guest they were (e.g. solo or family), the type of room they had, the length of stay and how the review was submitted.
- Unfortunately, using these tags is problematic, check the section below which discusses their usefulness
**Reviewer columns**
- `Total_Number_of_Reviews_Reviewer_Has_Given`
- This might be an factor in a recommendation model, for instance, if you could determine that more prolific reviewers with hundreds of reviews were more likely to be negative rather than positive. However, the reviewer of any particular review is not identified with a unique code, and therefore cannot be linked to a set of reviews. There are 30 reviewers with 100 or more reviews, but hard to see how this can aid the recommendation model.
- `Reviewer_Nationality`
- Some people might think that certain nationalities are more likely to give a positive or negative review because of a national inclination. Be careful building such anecdotal views into your models. These are national (and sometimes racial) stereotypes, and each reviewer was an individual who wrote a review based on their experience. It may have been filtered through many lens, such as their previous hotel stays, the distance travelled, and their personal temperament - but thinking that their nationality was the reason for a review score is a hard to justify assumption.
##### Examples
| Average Score | Total Number Reviews | Reviewer Score | Negative <br />Review | Positive Review | Tags |
| -------------- | ---------------------- | ---------------- | :----------------------------------------------------------- | --------------------------------- | ------------------------------------------------------------ |
| 7.8 | 1945 | 2.5 | This is currently not a hotel but a construction site I was terroized from early morning and all day with unacceptable building noise while resting after a long trip and working in the room People were working all day i e with jackhammers in the adjacent rooms I asked for a room change but no silent room was available To make thinks worse I was overcharged I checked out in the evening since I had to leave very early flight and received an appropiate bill A day later the hotel made another charge without my concent in excess of booked price It s a terrible place Don t punish yourself by booking here | Nothing Terrible place Stay away | Business trip Couple Standard Double Room Stayed 2 nights |
As you can see from this guest, they did not have a happy stay at this hotel. The hotel has a good average score of 7.8 and 1945 reviews, but this reviewer gave it 2.5 and wrote 115 words about how negative their stay was. If they wrote nothing at all in the Positive_Review column, you might surmise there was nothing positive, but alas they wrote 7 words of warning. If we just counted words instead of the meaning, or sentiment of the words, we might have a skewed view of the reviewers intent. Strangely, their score of 2.5 is confusing, because if that hotel stay was so bad, why give it any points at all? Investigating the dataset closely, you'll see that the lowest possible score is 2.5, not 0. The highest possible score is 10.
##### Tags
As mentioned above, at first glance, the idea to use `Tags` to categorise the data makes sense. Unfortunately these tags are not standardised, which means in one hotel, the options might be *Single room*, *Twin room*, and *Double room*, but in the next hotel, they are *Deluxe Single Room*, *Classic Queen Room*, and *Executive King Room*. These might be the same things, but there are so many variations, the choice becomes:
1. Attempt to change all terms to a single standard, which is very difficult, because it is not clear what the conversion path would be in each case (e.g. *Classic single room* maps to *Single room* but *Superior Queen Room with Courtyard Garden or City View* is much harder to map)
2. We can take an NLP approach and measure the frequency of certain terms like *Solo*, *Business Traveller*, or *Family with young kids* as they apply to each hotel, and factor that into the recommendation
Tags are usually (but not always) a single field containing a list of 5 to 6 comma separated values aligning to *Type of trip*, *Type of guests*, *Type of room*, *Number of nights*, and *Type of device review was submitted on*. However, because some reviewers don't fill in each field (they might leave one blank), the values are not always in the same order.
As an example, take *Type of group*. There are 1025 unique possibilities in this field in the `Tags` column, and unfortunately only some of them refer to a group (some are the type of room etc.). If you filter only the ones that mention family, the results contain many *Family room* type results. If you include the term *with*, i.e. count the *Family with* values, the results are better, with over 80,000 of the 515,000 results containing the phrase "Family with young children" or "Family with older children".
This means the tags column is not completely useless to us, but will take some work to make it useful.
##### Average Hotel Score
There are a number of oddities or discrepancies with the data set that I can't figure out, but are illustrated here so you are aware of them when building your models. If you figure it out, please let us know!
The dataset has the following columns relating to the average score and number of reviews:
1. Hotel_Name
2. Additional_Number_of_Scoring
3. Average_Score
4. Total_Number_of_Reviews
5. Reviewer_Score
If we take a single hotel and count the reviews, we see that the single hotel with the most reviews in this dataset is *Britannia International Hotel Canary Wharf* with 4789 reviews out of 515,000. But if we look at the `Total_Number_of_Reviews` value for this hotel, it is 9086. You might surmise that there are many more scores without reviews, so perhaps we should add in the `Additional_Number_of_Scoring` column value. That value is 2682, and adding it to 4789 gets us 7,471 which is still 1615 short of the `Total_Number_of_Reviews`.
If you take the `Average_Score` columns, you might surmise it is the average of the reviews in the dataset, but the description from Kaggle is "*Average Score of the hotel, calculated based on the latest comment in the last year*". That doesn't seem that useful, but we can calculate our own average based on the reviews scores in the data set. Using the same hotel as an example, the average hotel score is given as 7.1 but the calculated score (average reviewer score *in* the dataset) is 6.8. This is close, but not the same value, and we can only guess that the scores given in the `Additional_Number_of_Scoring` reviews increased the average to 7.1. Unfortunately with no way to test or prove that assertion, it is difficult to use or trust `Average_Score`, `Additional_Number_of_Scoring` and `Total_Number_of_Reviews` when they are based on, or refer to, data we do not have.
To complicate things further, the hotel with the second highest number of reviews has a calculated average score of 8.12 and the dataset `Average_Score` is 8.1. Is this correct score a coincidence or is the first hotel a discrepancy?
On the possibility that these hotel might be an outlier, and that maybe most of the values tally up (but some do not for some reason) we will write a short programs next to explore the values in the dataset and determine the correct usage (or non-usage) of the values.
##### A note of caution when working with datasets with human written reviews
Most of the time working with this dataset, you will write code that calculates something from the text, without having to read or analyse the text yourself. This is the essence of NLP, interpreting meaning or sentiment without having to have a human do it. However, it is possible you will read some of the negative reviews. I would urge you not to, because you don't have to. However they were written by humans, hotel guests who decided to write a review. Some of them are silly, or irrelevant negative hotel reviews, such as "The weather wasn't great", something beyond the control of the hotel, or indeed, anyone. But there is a dark side to some reviews too. Sometimes the negative reviews are racist, sexist, or ageist. This is unfortunate but to be expected in a dataset scraped off a public website. Some reviewers leave reviews that you would find distasteful, uncomfortable, or upsetting. Better to let the code measure the sentiment, than read them yourself and be upset. That said, it is a minority that write such things, but they exist all the same.
#### Loading the CSV data into a pandas DataFrame
That's enough examining the data visually, now you'll write some code and get some answers! This section is focused on the pandas library. Your very first task is to ensure you can load and read the CSV data. The pandas library has a fast CSV loader, and the result is placed in a *DataFrame*. If you've never used a DataFrame before, imagine it's a 2D structure with rows and columns. The CSV we are loading has over half a million rows, but only 17 columns. pandas gives you lots of powerful ways to interact with a DataFrame, including the ability to perform operations on every row.
Learning pandas is hard but very worth while, it is a great library to be a master of. For this lesson, you need to understand the following items like DataFrames, Series, value_count(), apply(), groupBy(), and transform().
There are some great guides and docs at the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) and it's worth following the *Getting started* and *User guide*.
From here on in this lesson, there will be code snippets and some explanations of the code and some discussion about what the results mean. Try to do each section in turn, and you may find the Juypter notebook useful as it contains all the sections. **TODO: clean and upload notebook too**
Let's start with loading the data file you be using:
```python
# Load the hotel reviews from CSV
import pandas as pd
import time
# importing time so the start and end time can be used to calculate file loading time
print("Loading data file now, this could take a while depending on file size")
start = time.time()
# df is 'DataFrame'
df = pd.read_csv('Hotel_Reviews.csv')
end = time.time()
print("Loading took " + str(round(end - start, 2)) + " seconds")
```
Now that the data is loaded, we can perform some operations on it. Keep this code at the top of your program for the next part.
#### Exploring the data
In this case, the data is already *clean*, that means that it is ready to work with, and does not have characters in other languages that might trip up the algorithms expecting only English characters. You might have to work with data that required some initial processing to format it before applying NLP techniques, but not this time.
However, you should take a moment to ensure you that once loaded, you can explore the data with code. It's very easy to want to focus on the `Negative_Review` and `Positive_Review` columns. They are filled with natural text for your NLP algorithms to process. But wait! Before you jump into the NLP and sentiment, you should follow the code below, to get used to working with DataFrames and also to ascertain if the values given in the dataset match the values you calculate with *pandas*.
#### DataFrame operations
The first task in this lesson is to check if the following assertions are correct by writing some code that examines the data frame (without changing it). The first is below as an example and the others are similar, but this is a great way to learn how to work with a DataFrame (if this is your first time encountering them, you should definitely try to complete them before the next section).
> Like many programming tasks, there are several ways to complete this, but good advice is to do it in the simplest, easiest way you can, especially if it will be easier to understand when you come back this code in the future. With DataFrames, there is a comprehensive API that will often have a way to do what you want efficiently.
If you prefer, you can treat these as coding tasks and attempt to answer them without looking at the solution. If you are new to DataFrames, try following and executing the code of each step, paying attention to methods you do not recognise.
With each of these questions, you can build on the previous answer by adding each solution beneath the previous answer (you don't have to create a new Python file for each answer). Remember to include the code in the *Loading the CSV file* above, that code is *required* before your code.
Here are the questions on their own, followed by the code and explanations:
1. Print out the *shape* of the data frame you have just loaded (the shape is the number of rows and columns)
2. Calculate the frequency count for reviewer nationalities:
1. How many distinct values are there for the column `Reviewer_Nationality` and what are they?
2. What reviewer nationality is the most common in the dataset (print country and number of reviews)?
3. What are the next top 10 most frequently found nationalities, and their frequency count?
3. What was the most frequently reviewed hotel for each of the top 10 most reviewer nationalities?
4. How many reviews are there per hotel (frequency count of hotel) in the dataset?
5. While there is an `Average_Score` column for each hotel in the dataset, you can also calculate an average score (getting the average of all reviewer scores in the dataset for each hotel). Add a new column to your dataframe with the column header `Calc_Average_Score` that contains that calculated average.
6. Do any hotels have the same (rounded to 1 decimal place) `Average_Score` and `Calc_Average_Score`?
1. Try writing a Python function that takes a Series (row) as an argument and compares the values, printing out a message when the values are not equal. Then use the `.apply()` method to process every row with the function.
7. Calculate and print out how many rows have column `Negative_Review` values of "No Negative"
8. Calculate and print out how many rows have column `Positive_Review` values of "No Positive"
9. Calculate and print out how many rows have column `Positive_Review` values of "No Positive" **and** `Negative_Review` values of "No Negative"
### Code
1. Print out the *shape* of the data frame you have just loaded (the shape is the number of rows and columns)
```python
print("The shape of the data (rows, cols) is " + str(df.shape))
> The shape of the data (rows, cols) is (515738, 17)
```
2. Calculate the frequency count for reviewer nationalities:
Preparatory steps to start this lesson
1. How many distinct values are there for the column `Reviewer_Nationality` and what are they?
2. What reviewer nationality is the most common in the dataset (print country and number of reviews)?
---
```python
# value_counts() creates a Series object that has index and values in this case, the country and the frequency they occur in reviewer nationality
nationality_freq = df["Reviewer_Nationality"].value_counts()
print("There are " + str(nationality_freq.size) + " different nationalities")
# print first and last rows of the Series. Change to nationality_freq.to_string() to print all of the data
print(nationality_freq)
[Step through content in blocks]
There are 227 different nationalities
United Kingdom 245246
United States of America 35437
Australia 21686
Ireland 14827
United Arab Emirates 10235
...
Comoros 1
Palau 1
Northern Mariana Islands 1
Cape Verde 1
Guinea 1
Name: Reviewer_Nationality, Length: 227, dtype: int64
```
3. What are the next top 10 most frequently found nationalities, and their frequency count?
```python
print("The highest frequency reviewer nationality is " + str(nationality_freq.index[0]).strip() + " with " + str(nationality_freq[0]) + " reviews.")
# Notice there is a leading space on the values, strip() removes that for printing
# What is the top 10 most common nationalities and their frequencies?
print("The next 10 highest frequency reviewer nationalities are:")
print(nationality_freq[1:11].to_string())
The highest frequency reviewer nationality is United Kingdom with 245246 reviews.
The next 10 highest frequency reviewer nationalities are:
United States of America 35437
Australia 21686
Ireland 14827
United Arab Emirates 10235
Saudi Arabia 8951
Netherlands 8772
Switzerland 8678
Germany 7941
Canada 7894
France 7296
```
3. What was the most frequently reviewed hotel for each of the top 10 most reviewer nationalities?
```python
# What was the most frequently reviewed hotel for the top 10 nationalities
# Normally with pandas you will avoid an explicit loop, but wanted to show creating a new dataframe using criteria (don't do this with large amounts of data because it could be very slow)
for nat in nationality_freq[:10].index:
# First, extract all the rows that match the criteria into a new dataframe
nat_df = df[df["Reviewer_Nationality"] == nat]
# Now get the hotel freq
freq = nat_df["Hotel_Name"].value_counts()
print("The most reviewed hotel for " + str(nat).strip() + " was " + str(freq.index[0]) + " with " + str(freq[0]) + " reviews.")
The most reviewed hotel for United Kingdom was Britannia International Hotel Canary Wharf with 3833 reviews.
The most reviewed hotel for United States of America was Hotel Esther a with 423 reviews.
The most reviewed hotel for Australia was Park Plaza Westminster Bridge London with 167 reviews.
The most reviewed hotel for Ireland was Copthorne Tara Hotel London Kensington with 239 reviews.
The most reviewed hotel for United Arab Emirates was Millennium Hotel London Knightsbridge with 129 reviews.
The most reviewed hotel for Saudi Arabia was The Cumberland A Guoman Hotel with 142 reviews.
The most reviewed hotel for Netherlands was Jaz Amsterdam with 97 reviews.
The most reviewed hotel for Switzerland was Hotel Da Vinci with 97 reviews.
The most reviewed hotel for Germany was Hotel Da Vinci with 86 reviews.
The most reviewed hotel for Canada was St James Court A Taj Hotel London with 61 reviews.
```
4. How many reviews are there per hotel (frequency count of hotel) in the dataset?
```python
# First create a new dataframe based on the old one, removing the uneeded columns
hotel_freq_df = df.drop(["Hotel_Address", "Additional_Number_of_Scoring", "Review_Date", "Average_Score", "Reviewer_Nationality", "Negative_Review", "Review_Total_Negative_Word_Counts", "Positive_Review", "Review_Total_Positive_Word_Counts", "Total_Number_of_Reviews_Reviewer_Has_Given", "Reviewer_Score", "Tags", "days_since_review", "lat", "lng"], axis = 1)
# Group the rows by Hotel_Name, count them and put the result in a new column Total_Reviews_Found
hotel_freq_df['Total_Reviews_Found'] = hotel_freq_df.groupby('Hotel_Name').transform('count')
# Get rid of all the duplicated rows
hotel_freq_df = hotel_freq_df.drop_duplicates(subset = ["Hotel_Name"])
display(hotel_freq_df)
Hotel_Name Total_Number_of_Reviews Total_Reviews_Found
Britannia International Hotel Canary Wharf 9086 4789
Park Plaza Westminster Bridge London 12158 4169
Copthorne Tara Hotel London Kensington 7105 3578
...
Mercure Paris Porte d Orleans 110 10
Hotel Wagner 135 10
Hotel Gallitzinberg 173 8
```
You may notice that the *counted in the dataset* results do not match the value in `Total_Number_of_Reviews`. It is unclear if this value in the dataset represented the total number of reviews the hotel had, but not all were scraped, or some other calculation. `Total_Number_of_Reviews` is not used in the model because of this unclarity.
5. While there is an `Average_Score` column for each hotel in the dataset, you can also calculate an average score (getting the average of all reviewer scores in the dataset for each hotel). Add a new column to your dataframe with the column header `Calc_Average_Score` that contains that calculated average. Print out the columns `Hotel_Name`, `Average_Score`, and `Calc_Average_Score`.
## [Topic 1]
```python
# define a function that takes a row and performs some calculation with it
def get_difference_review_avg(row):
return row["Average_Score"] - row["Calc_Average_Score"]
### Task:
# 'mean' is mathematical word for 'average'
df['Calc_Average_Score'] = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
Work together to progressively enhance your codebase to build the project with shared code:
# Add a new column with the difference between the two average scores
df["Average_Score_Difference"] = df.apply(get_difference_review_avg, axis = 1)
```html
code blocks
# Create a df without all the duplicates of Hotel_Name (so only 1 row per hotel)
review_scores_df = df.drop_duplicates(subset = ["Hotel_Name"])
# Sort the dataframe to find the lowest and highest average score difference
review_scores_df = review_scores_df.sort_values(by=["Average_Score_Difference"])
display(review_scores_df[["Average_Score_Difference", "Average_Score", "Calc_Average_Score", "Hotel_Name"]])
```
✅ Knowledge Check - use this moment to stretch students' knowledge with open questions
You may also wonder about the supplied in dataset `Average_Score` value and why it is sometimes different from the calculated average score. As we can't know why some of the values match, but others have a difference, it's safest in this case to use the review scores that we have to calculate the average ourselves. That said, the differences are usually very small, here are the hotels with the greatest deviation from the dataset average and the calculated average:
| Average_Score_Difference | Average_Score | Calc_Average_Score | Hotel_Name |
| :----------------------: | :-----------: | :----------------: | ------------------------------------------: |
| -0.8 | 7.7 | 8.5 | Best Western Hotel Astoria |
| -0.7 | 8.8 | 9.5 | Hotel Stendhal Place Vend me Paris MGallery |
| -0.7 | 7.5 | 8.2 | Mercure Paris Porte d Orleans |
| -0.7 | 7.9 | 8.6 | Renaissance Paris Vendome Hotel |
| -0.5 | 7.0 | 7.5 | Hotel Royal Elys es |
| ... | ... | ... | ... |
| 0.7 | 7.5 | 6.8 | Mercure Paris Op ra Faubourg Montmartre |
| 0.8 | 7.1 | 6.3 | Holiday Inn Paris Montparnasse Pasteur |
| 0.9 | 6.8 | 5.9 | Villa Eugenie |
| 0.9 | 8.6 | 7.7 | MARQUIS Faubourg St Honor Relais Ch teaux |
| 1.3 | 7.2 | 5.9 | Kube Hotel Ice Bar |
With only 1 hotel having a difference of score greater than 1, it means we can probably ignore the difference and use the calculated average score.
6. Calculate and print out how many rows have column `Negative_Review` values of "No Negative"
7. Calculate and print out how many rows have column `Positive_Review` values of "No Positive"
8. Calculate and print out how many rows have column `Positive_Review` values of "No Positive" **and** `Negative_Review` values of "No Negative"
```python
# with lambdas:
start = time.time()
no_negative_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" else False , axis=1)
print("Number of No Negative reviews: " + str(len(no_negative_reviews[no_negative_reviews == True].index)))
no_positive_reviews = df.apply(lambda x: True if x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of No Positive reviews: " + str(len(no_positive_reviews[no_positive_reviews == True].index)))
## [Topic 2]
both_no_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" and x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of both No Negative and No Positive reviews: " + str(len(both_no_reviews[both_no_reviews == True].index)))
end = time.time()
print("Lamdas took " + str(round(end - start, 2)) + " seconds")
## [Topic 3]
Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Lamdas took 9.64 seconds
```
Another way to do that one is without Lambdas, and use sum to count the rows:
```python
# without lambdas (using a mixture of notations to show you can use both)
start = time.time()
no_negative_reviews = sum(df.Negative_Review == "No Negative")
print("Number of No Negative reviews: " + str(no_negative_reviews))
no_positive_reviews = sum(df["Positive_Review"] == "No Positive")
print("Number of No Positive reviews: " + str(no_positive_reviews))
both_no_reviews = sum((df.Negative_Review == "No Negative") & (df.Positive_Review == "No Positive"))
print("Number of both No Negative and No Positive reviews: " + str(both_no_reviews))
end = time.time()
print("Sum took " + str(round(end - start, 2)) + " seconds")
Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Sum took 0.19 seconds
```
## 🚀Challenge
You may have noticed that there are 127 rows that have both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review` respectively. That means that the reviewer gave the hotel a numerical score, but declined to write either a positive or negative review. Luckily this is a small amount of rows (127 out of 515738, or 0.02%), so it probably won't skew our model or results in any particular direction, but you might not have expected a data set of reviews to have rows with no reviews, so worth exploring the data to discover rows like this.
Add a challenge for students to work on collaboratively in class to enhance the project
### Modifying the DataFrame
Optional: add a screenshot of the completed lesson's UI if appropriate
Now that you've explored the dataset, you can see some issues with it. Some columns are are filled with useless information, others are just incorrect, or if they are correct, it's unclear how to they were calculated, and answers cannot be independently verified by your own calculations.
## [Post-lecture quiz](link-to-quiz-app) 38
Next, you will add columns that will be useful later, change the values in other columns, and drop certain columns completely.
Follow these steps in order:
1. `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude)
1. Drop lat and lng
2. Replace Hotel_Address values with the following values (if the address contains the same of the city and the country, change it to just the city and the country).
These are the only cities and countries in the dataset:
Amsterdam, Netherlands
Barcelona, Spain
London, United Kingdom
Milan, Italy
Paris, France
Vienna, Austria
```python
def replace_address(row):
if "Netherlands" in row["Hotel_Address"]:
return "Amsterdam, Netherlands"
elif "Barcelona" in row["Hotel_Address"]:
return "Barcelona, Spain"
elif "United Kingdom" in row["Hotel_Address"]:
return "London, United Kingdom"
elif "Milan" in row["Hotel_Address"]:
return "Milan, Italy"
elif "France" in row["Hotel_Address"]:
return "Paris, France"
elif "Vienna" in row["Hotel_Address"]:
return "Vienna, Austria"
# Replace all the addresses with a shortened, more useful form
df["Hotel_Address"] = df.apply(replace_address, axis = 1)
# The sum of the value_counts() should add up to the total number of reviews
print(df["Hotel_Address"].value_counts())
```
Now you can query country level data:
```python
display(df.groupby("Hotel_Address").agg({"Hotel_Name": "nunique"}))
```
| Hotel_Address | Hotel_Name |
| ---------------------: | ---------: |
| Amsterdam, Netherlands | 105 |
| Barcelona, Spain | 211 |
| London, United Kingdom | 400 |
| Milan, Italy | 162 |
| Paris, France | 458 |
| Vienna, Austria | 158 |
2. Hotel Meta-review columns: `Average_Score`, `Total_Number_of_Reviews`, `Additional_Number_of_Scoring`
* Drop `Additional_Number_of_Scoring`
* Replace `Total_Number_of_Reviews` with the total number of reviews for that hotel that are actually in the dataset and
* Replace `Average_Score` with our own calculated score
```python
# Drop `Additional_Number_of_Scoring`
df.drop(["Additional_Number_of_Scoring"], axis = 1, inplace=True)
# Replace `Total_Number_of_Reviews` and `Average_Score` with our own calculated values
df.Total_Number_of_Reviews = df.groupby('Hotel_Name').transform('count')
df.Average_Score = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
```
**Review columns**
- Drop `Review_Total_Negative_Word_Counts`, `Review_Total_Positive_Word_Counts`, `Review_Date` and `days_since_review`
- Keep `Reviewer_Score`, `Negative_Review`, and `Positive_Review` as they are,
- Keep `Tags`
- We'll be doing some NLP operations on the tags in the next section.
**Reviewer columns**
- Drop `Total_Number_of_Reviews_Reviewer_Has_Given`
- Keep `Reviewer_Nationality`
Finally, save the dataset as it is now with a new name, then proceed to the NLP section.
```python
df.drop(["Review_Total_Negative_Word_Counts", "Review_Total_Positive_Word_Counts", "days_since_review", "Total_Number_of_Reviews_Reviewer_Has_Given"], axis = 1, inplace=True)
# Saving new data file with calculated columns
print("Saving results to Hotel_Reviews_Filtered.csv")
df.to_csv(r'Hotel_Reviews_Filtered.csv', index = False)
```
## Review & Self Study
### NLP & Sentiment Analysis Operations
## Assignment [Assignment Name](assignment.md)
*I'm currently editing this final section*

@ -0,0 +1,9 @@
# [Assignment Name]
## Instructions
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
| | | | |

@ -14,7 +14,8 @@ In these lessons we'll learn the basics of NLP by building small conversational
1. [Introduction to natural language processing](1-Introduction-to-NLP/README.md)
2. [Common NLP tasks and techniques](2-Tasks/README.md)
3. [Translation and sentiment analysis with machine learning](3-Translation-Sentiment/README.md)
4. TBD
4. [NLTK for Sentiment Analysis](4-Hotel-Reviews-1/README.md)
5. TBD
## Credits

@ -222,7 +222,7 @@ for epoch in range(5000):
v = probs(Q[x,y])
a = random.choices(list(actions),weights=v)[0]
dpos = actions[a]
m.move(dpos)
m.move(dpos,check_correctness=False) # we allow player to move outside the board, which terminates episode
r = reward(m)
cum_reward += r
if r==end_reward or cum_reward < -1000:

@ -108,8 +108,9 @@ class Board:
def move_pos(self, pos, dpos):
return (pos[0] + dpos[0], pos[1] + dpos[1])
def move(self,dpos):
def move(self,dpos,check_correctness=True):
new_pos = self.move_pos(self.human,dpos)
if self.is_valid(new_pos) or not check_correctness:
self.human = new_pos
def random_pos(self):

File diff suppressed because one or more lines are too long

@ -108,9 +108,9 @@ class Board:
def move_pos(self, pos, dpos):
return (pos[0] + dpos[0], pos[1] + dpos[1])
def move(self,dpos):
def move(self,dpos,check_correctness=True):
new_pos = self.move_pos(self.human,dpos)
if self.is_valid(new_pos):
if self.is_valid(new_pos) or not check_correctness:
self.human = new_pos
def random_pos(self):

@ -44,24 +44,12 @@
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting gym\n",
" Downloading gym-0.18.3.tar.gz (1.6 MB)\n",
"\u001b[K |████████████████████████████████| 1.6 MB 2.3 MB/s \n",
"\u001b[?25hRequirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.4.1)\n",
"Requirement already satisfied: gym in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.18.3)\n",
"Requirement already satisfied: Pillow<=8.2.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (7.0.0)\n",
"Requirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.4.1)\n",
"Requirement already satisfied: numpy>=1.10.4 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.19.2)\n",
"Collecting pyglet<=1.5.15,>=1.4.0\n",
" Downloading pyglet-1.5.15-py3-none-any.whl (1.1 MB)\n",
"\u001b[K |████████████████████████████████| 1.1 MB 3.7 MB/s \n",
"\u001b[?25hRequirement already satisfied: Pillow<=8.2.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (7.0.0)\n",
"Collecting cloudpickle<1.7.0,>=1.2.0\n",
" Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)\n",
"Building wheels for collected packages: gym\n",
" Building wheel for gym (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for gym: filename=gym-0.18.3-py3-none-any.whl size=1657514 sha256=578c789ab75e603e58dd1152b2bd60d9a5adc6a057559cf8b5bdd6ee8b80abf2\n",
" Stored in directory: /Users/jenlooper/Library/Caches/pip/wheels/1a/ec/6d/705d53925f481ab70fd48ec7728558745eeae14dfda3b49c99\n",
"Successfully built gym\n",
"Installing collected packages: pyglet, cloudpickle, gym\n",
"Successfully installed cloudpickle-1.6.0 gym-0.18.3 pyglet-1.5.15\n",
"Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.6.0)\n",
"Requirement already satisfied: pyglet<=1.5.15,>=1.4.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.5.15)\n",
"\u001b[33mWARNING: You are using pip version 20.2.3; however, version 21.1.2 is available.\n",
"You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.\u001b[0m\n"
]
@ -99,7 +87,7 @@
"output_type": "stream",
"name": "stdout",
"text": [
"Discrete(2)\nBox(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)\n1\n"
"Discrete(2)\nBox(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)\n0\n"
]
}
]
@ -159,32 +147,25 @@
"output_type": "stream",
"name": "stdout",
"text": [
"[-0.035025 0.21201857 -0.010404 -0.3300738 ] -> 1.0\n",
"[-0.03078463 0.40728707 -0.01700547 -0.62601941] -> 1.0\n",
"[-0.02263889 0.21240657 -0.02952586 -0.3387403 ] -> 1.0\n",
"[-0.01839076 0.01771693 -0.03630067 -0.05551247] -> 1.0\n",
"[-0.01803642 0.21334007 -0.03741092 -0.35942391] -> 1.0\n",
"[-0.01376962 0.40897331 -0.0445994 -0.66366469] -> 1.0\n",
"[-0.00559015 0.21449925 -0.05787269 -0.38535156] -> 1.0\n",
"[-0.00130017 0.410393 -0.06557972 -0.69570532] -> 1.0\n",
"[ 0.00690769 0.21623893 -0.07949383 -0.42436686] -> 1.0\n",
"[ 0.01123247 0.02232776 -0.08798116 -0.15776523] -> 1.0\n",
"[ 0.01167903 0.21859198 -0.09113647 -0.47685598] -> 1.0\n",
"[ 0.01605087 0.02486705 -0.10067359 -0.21423159] -> 1.0\n",
"[ 0.01654821 0.22127341 -0.10495822 -0.53689749] -> 1.0\n",
"[ 0.02097368 0.02777162 -0.11569617 -0.27904318] -> 1.0\n",
"[ 0.02152911 -0.16552613 -0.12127703 -0.02497378] -> 1.0\n",
"[ 0.01821859 -0.35871897 -0.12177651 0.22711886] -> 1.0\n",
"[ 0.01104421 -0.16208621 -0.11723413 -0.10135989] -> 1.0\n",
"[ 0.00780248 -0.35534992 -0.11926133 0.15215788] -> 1.0\n",
"[ 0.00069548 -0.15874009 -0.11621817 -0.17564179] -> 1.0\n",
"[-0.00247932 0.03783674 -0.11973101 -0.50260923] -> 1.0\n",
"[-0.00172258 0.23442478 -0.12978319 -0.83049704] -> 1.0\n",
"[ 0.00296591 0.04129289 -0.14639313 -0.58128481] -> 1.0\n",
"[ 0.00379177 0.23812991 -0.15801883 -0.9162682 ] -> 1.0\n",
"[ 0.00855437 0.04545686 -0.17634419 -0.67712384] -> 1.0\n",
"[ 0.00946351 0.24253346 -0.18988667 -1.01973114] -> 1.0\n",
"[ 0.01431417 0.05037919 -0.21028129 -0.7921723 ] -> 1.0\n"
"[ 0.03044442 -0.19543914 -0.04496216 0.28125618] -> 1.0\n",
"[ 0.02653564 -0.38989186 -0.03933704 0.55942606] -> 1.0\n",
"[ 0.0187378 -0.19424049 -0.02814852 0.25461393] -> 1.0\n",
"[ 0.01485299 -0.38894946 -0.02305624 0.53828712] -> 1.0\n",
"[ 0.007074 -0.19351108 -0.0122905 0.23842953] -> 1.0\n",
"[ 0.00320378 0.00178427 -0.00752191 -0.05810469] -> 1.0\n",
"[ 0.00323946 0.19701326 -0.008684 -0.35315131] -> 1.0\n",
"[ 0.00717973 0.00201587 -0.01574703 -0.06321931] -> 1.0\n",
"[ 0.00722005 0.19736001 -0.01701141 -0.36082863] -> 1.0\n",
"[ 0.01116725 0.39271958 -0.02422798 -0.65882671] -> 1.0\n",
"[ 0.01902164 0.19794307 -0.03740452 -0.37387001] -> 1.0\n",
"[ 0.0229805 0.39357584 -0.04488192 -0.67810827] -> 1.0\n",
"[ 0.03085202 0.58929164 -0.05844408 -0.98457719] -> 1.0\n",
"[ 0.04263785 0.78514572 -0.07813563 -1.2950295 ] -> 1.0\n",
"[ 0.05834076 0.98116859 -0.10403622 -1.61111521] -> 1.0\n",
"[ 0.07796413 0.78741784 -0.13625852 -1.35259196] -> 1.0\n",
"[ 0.09371249 0.98396202 -0.16331036 -1.68461179] -> 1.0\n",
"[ 0.11339173 0.79106371 -0.1970026 -1.44691436] -> 1.0\n",
"[ 0.12921301 0.59883361 -0.22594088 -1.22169133] -> 1.0\n"
]
}
]
@ -281,7 +262,7 @@
"output_type": "stream",
"name": "stdout",
"text": [
"(0, 0, -1, -3)\n(0, 0, -2, 0)\n(0, 0, -2, -3)\n(0, 1, -3, -6)\n(0, 0, -4, -3)\n(0, 1, -5, -6)\n(0, 2, -6, -9)\n(0, 1, -8, -6)\n(0, 2, -9, -9)\n(0, 2, -11, -13)\n(0, 2, -14, -10)\n(0, 1, -16, -8)\n(0, 2, -18, -11)\n(0, 3, -20, -15)\n(0, 2, -23, -12)\n"
"(0, 0, -1, -3)\n(0, 0, -2, 0)\n(0, 0, -2, -3)\n(0, 1, -3, -6)\n(0, 2, -4, -9)\n(0, 3, -6, -12)\n(0, 2, -8, -9)\n(0, 3, -10, -13)\n(0, 4, -13, -16)\n(0, 4, -16, -19)\n(0, 4, -20, -17)\n(0, 4, -24, -20)\n"
]
}
],

Loading…
Cancel
Save