31 KiB

Raw Permalink Blame History

Sentiment analysis with hotel reviews - processing the data

In this section, you'll apply techniques from previous lessons to perform exploratory data analysis on a large dataset. Once you understand the relevance of the various columns, you'll learn:

How to remove unnecessary columns
How to calculate new data based on existing columns
How to save the resulting dataset for use in the final challenge

Pre-lecture quiz

Introduction

Up to this point, you've learned that text data is quite different from numerical data. If the text is written or spoken by humans, it can be analyzed to uncover patterns, frequencies, sentiment, and meaning. This lesson introduces you to a real dataset with a real challenge: 515K Hotel Reviews Data in Europe, which is licensed under CC0: Public Domain license. The dataset was scraped from Booking.com using public sources and created by Jiashen Liu.

Preparation

You will need:

The ability to run .ipynb notebooks using Python 3
pandas
NLTK, which you should install locally
The dataset, available on Kaggle: 515K Hotel Reviews Data in Europe. It is approximately 230 MB unzipped. Download it to the root /data folder associated with these NLP lessons.

Exploratory data analysis

This challenge assumes you're building a hotel recommendation bot using sentiment analysis and guest review scores. The dataset includes reviews of 1493 hotels across 6 cities.

Using Python, the hotel reviews dataset, and NLTK's sentiment analysis, you could explore:

What are the most frequently used words and phrases in reviews?
Do the official tags describing a hotel correlate with review scores (e.g., are negative reviews more common for Family with young children than for Solo traveler, possibly indicating the hotel is better suited for Solo travelers?)
Do the NLTK sentiment scores align with the numerical scores given by hotel reviewers?

Dataset

Let's examine the dataset you've downloaded and saved locally. Open the file in an editor like VS Code or Excel.

The dataset headers are as follows:

Hotel_Address, Additional_Number_of_Scoring, Review_Date, Average_Score, Hotel_Name, Reviewer_Nationality, Negative_Review, Review_Total_Negative_Word_Counts, Total_Number_of_Reviews, Positive_Review, Review_Total_Positive_Word_Counts, Total_Number_of_Reviews_Reviewer_Has_Given, Reviewer_Score, Tags, days_since_review, lat, lng

Here they are grouped for easier analysis:

Hotel columns

Hotel_Name, Hotel_Address, lat (latitude), lng (longitude)
- Using lat and lng, you could plot a map with Python showing hotel locations, perhaps color-coded for negative and positive reviews.
- Hotel_Address is not particularly useful and could be replaced with a country for easier sorting and searching.

Hotel Meta-review columns

Average_Score
- According to the dataset creator, this column represents the Average Score of the hotel, calculated based on the latest comment in the last year. While this calculation method seems unusual, we'll take it at face value for now.
✅ Can you think of another way to calculate the average score using the other columns in this dataset?
Total_Number_of_Reviews
- The total number of reviews the hotel has received. It's unclear (without writing code) whether this refers to the reviews in the dataset.
Additional_Number_of_Scoring
- Indicates a review score was given without a positive or negative review being written.

Review columns

Reviewer_Score
- A numerical value with at most one decimal place, ranging between 2.5 and 10.
- It's unclear why 2.5 is the lowest possible score.
Negative_Review
- If no negative review was written, this field contains "No Negative."
- Sometimes, reviewers write positive comments in the Negative_Review column (e.g., "there is nothing bad about this hotel").
Review_Total_Negative_Word_Counts
- Higher negative word counts generally indicate a lower score (without sentiment analysis).
Positive_Review
- If no positive review was written, this field contains "No Positive."
- Sometimes, reviewers write negative comments in the Positive_Review column (e.g., "there is nothing good about this hotel at all").
Review_Total_Positive_Word_Counts
- Higher positive word counts generally indicate a higher score (without sentiment analysis).
Review_Date and days_since_review
- You could apply a freshness or staleness measure to reviews (e.g., older reviews might be less accurate due to changes in hotel management, renovations, or new amenities like a pool).
Tags
- Short descriptors selected by reviewers to describe their type of stay (e.g., solo or family), room type, length of stay, and how the review was submitted.
- Unfortunately, these tags are problematic; see the section below discussing their usefulness.

Reviewer columns

Total_Number_of_Reviews_Reviewer_Has_Given
- This could be a factor in a recommendation model. For example, prolific reviewers with hundreds of reviews might tend to be more negative. However, reviewers are not identified with unique codes, so their reviews cannot be linked. There are 30 reviewers with 100 or more reviews, but it's unclear how this could aid the recommendation model.
Reviewer_Nationality
- Some might assume certain nationalities are more likely to give positive or negative reviews due to cultural tendencies. Be cautious about incorporating such anecdotal views into your models. These are stereotypes, and each reviewer is an individual whose review reflects their personal experience, filtered through factors like previous hotel stays, travel distance, and temperament. It's hard to justify attributing review scores solely to nationality.

Examples

Average Score	Total Number Reviews	Reviewer Score	Negative Review	Positive Review	Tags
7.8	1945	2.5	This is currently not a hotel but a construction site. I was terrorized from early morning and all day with unacceptable building noise while resting after a long trip and working in the room. People were working all day, i.e., with jackhammers in the adjacent rooms. I asked for a room change, but no silent room was available. To make things worse, I was overcharged. I checked out in the evening since I had to leave for an early flight and received an appropriate bill. A day later, the hotel made another charge without my consent in excess of the booked price. It's a terrible place. Don't punish yourself by booking here.	Nothing. Terrible place. Stay away.	Business trip, Couple, Standard Double Room, Stayed 2 nights

As you can see, this guest had a very negative experience. The hotel has a decent average score of 7.8 and 1945 reviews, but this reviewer gave it a 2.5 and wrote 115 words about their dissatisfaction. They wrote only 7 words in the Positive_Review column, warning others to avoid the hotel. If we only counted words instead of analyzing their sentiment, we might misinterpret the reviewer's intent. Interestingly, the lowest possible score is 2.5, not 0, which raises questions about the scoring system.

Average hotel score

There are some oddities in the dataset that are worth noting:

The dataset includes the following columns related to average scores and reviews:

Hotel_Name
Additional_Number_of_Scoring
Average_Score
Total_Number_of_Reviews
Reviewer_Score

The hotel with the most reviews in the dataset is Britannia International Hotel Canary Wharf, with 4789 reviews out of 515,000. However, its Total_Number_of_Reviews value is 9086. Adding the Additional_Number_of_Scoring value (2682) to 4789 gives 7471, which is still 1615 short of the Total_Number_of_Reviews.

The Average_Score column description from Kaggle states it is "Average Score of the hotel, calculated based on the latest comment in the last year." This calculation method seems unhelpful, but we can calculate our own average based on the review scores in the dataset. For example, the average score for this hotel is listed as 7.1, but the calculated average (based on reviewer scores in the dataset) is 6.8. This discrepancy might be due to scores from Additional_Number_of_Scoring reviews, but there's no way to confirm this.

To further complicate matters, the hotel with the second-highest number of reviews has a calculated average score of 8.12, while the dataset lists it as 8.1. Is this coincidence or a discrepancy?

Given these inconsistencies, we'll write a short program to explore the dataset and determine the best way to use (or not use) these values.

🚨 A note of caution

When working with this dataset, you'll be writing code to calculate something based on the text without needing to read or analyze the text yourself. This is the core of NLP—interpreting meaning or sentiment without requiring human intervention. However, you might come across some negative reviews. I strongly recommend avoiding reading them, as it's unnecessary. Some of these reviews are trivial or irrelevant, like "The weather wasn't great," which is beyond the hotel's control—or anyone's, for that matter. But there is also a darker side to some reviews. Occasionally, negative reviews may contain racist, sexist, or ageist remarks. This is unfortunate but not surprising, given that the dataset is scraped from a public website. Some reviewers leave comments that you might find offensive, uncomfortable, or upsetting. It's better to let the code assess the sentiment rather than exposing yourself to potentially distressing content. That said, such reviews are written by a minority, but they do exist nonetheless.

Exercise - Data exploration

Load the data

Enough of visually examining the data—now it's time to write some code and get answers! This section uses the pandas library. Your first task is to ensure you can load and read the CSV data. The pandas library has a fast CSV loader, and the result is stored in a dataframe, as you've seen in previous lessons. The CSV file we are loading contains over half a million rows but only 17 columns. Pandas provides many powerful ways to interact with a dataframe, including the ability to perform operations on every row.

From this point onward in the lesson, you'll encounter code snippets, explanations of the code, and discussions about the results. Use the included notebook.ipynb for your code.

Let's start by loading the data file you'll be working with:

# Load the hotel reviews from CSV
import pandas as pd
import time
# importing time so the start and end time can be used to calculate file loading time
print("Loading data file now, this could take a while depending on file size")
start = time.time()
# df is 'DataFrame' - make sure you downloaded the file to the data folder
df = pd.read_csv('../../data/Hotel_Reviews.csv')
end = time.time()
print("Loading took " + str(round(end - start, 2)) + " seconds")

Now that the data is loaded, you can perform some operations on it. Keep this code at the top of your program for the next part.

Explore the data

In this case, the data is already clean, meaning it is ready to work with and does not contain characters in other languages that might confuse algorithms expecting only English characters.

✅ You might encounter data that requires initial processing to format it before applying NLP techniques, but not this time. If you had to, how would you handle non-English characters?

Take a moment to ensure that once the data is loaded, you can explore it with code. It's tempting to focus on the Negative_Review and Positive_Review columns, as they contain natural text for your NLP algorithms to process. But wait! Before diving into NLP and sentiment analysis, follow the code below to verify whether the values in the dataset match the values you calculate using pandas.

Dataframe operations

Your first task in this lesson is to check whether the following assertions are correct by writing code to examine the dataframe (without modifying it).

As with many programming tasks, there are multiple ways to approach this, but a good rule of thumb is to choose the simplest and easiest method, especially if it will be easier to understand when revisiting the code later. With dataframes, the comprehensive API often provides efficient ways to achieve your goals.

Treat the following questions as coding tasks and try to answer them without looking at the solution.

Print the shape of the dataframe you just loaded (the shape refers to the number of rows and columns).
Calculate the frequency count for reviewer nationalities:
1. How many distinct values exist in the Reviewer_Nationality column, and what are they?
2. Which reviewer nationality is the most common in the dataset (print the country and the number of reviews)?
3. What are the next top 10 most frequently found nationalities, along with their frequency counts?
What is the most frequently reviewed hotel for each of the top 10 reviewer nationalities?
How many reviews are there per hotel (frequency count of hotels) in the dataset?
While the dataset includes an Average_Score column for each hotel, you can also calculate an average score (by averaging all reviewer scores in the dataset for each hotel). Add a new column to your dataframe called Calc_Average_Score that contains this calculated average.
Do any hotels have the same (rounded to 1 decimal place) Average_Score and Calc_Average_Score?
1. Try writing a Python function that takes a Series (row) as an argument and compares the values, printing a message when the values are not equal. Then use the .apply() method to process every row with the function.
Calculate and print how many rows have Negative_Review values of "No Negative."
Calculate and print how many rows have Positive_Review values of "No Positive."
Calculate and print how many rows have Positive_Review values of "No Positive" and Negative_Review values of "No Negative."

Code answers

Print the shape of the dataframe you just loaded (the shape refers to the number of rows and columns).

print("The shape of the data (rows, cols) is " + str(df.shape))
> The shape of the data (rows, cols) is (515738, 17)

Calculate the frequency count for reviewer nationalities:

How many distinct values exist in the Reviewer_Nationality column, and what are they?
Which reviewer nationality is the most common in the dataset (print the country and the number of reviews)?

# value_counts() creates a Series object that has index and values in this case, the country and the frequency they occur in reviewer nationality
nationality_freq = df["Reviewer_Nationality"].value_counts()
print("There are " + str(nationality_freq.size) + " different nationalities")
# print first and last rows of the Series. Change to nationality_freq.to_string() to print all of the data
print(nationality_freq) 

There are 227 different nationalities
 United Kingdom               245246
 United States of America      35437
 Australia                     21686
 Ireland                       14827
 United Arab Emirates          10235
                               ...  
 Comoros                           1
 Palau                             1
 Northern Mariana Islands          1
 Cape Verde                        1
 Guinea                            1
Name: Reviewer_Nationality, Length: 227, dtype: int64

What are the next top 10 most frequently found nationalities, along with their frequency counts?

print("The highest frequency reviewer nationality is " + str(nationality_freq.index[0]).strip() + " with " + str(nationality_freq[0]) + " reviews.")
# Notice there is a leading space on the values, strip() removes that for printing
# What is the top 10 most common nationalities and their frequencies?
print("The next 10 highest frequency reviewer nationalities are:")
print(nationality_freq[1:11].to_string())

The highest frequency reviewer nationality is United Kingdom with 245246 reviews.
The next 10 highest frequency reviewer nationalities are:
 United States of America     35437
 Australia                    21686
 Ireland                      14827
 United Arab Emirates         10235
 Saudi Arabia                  8951
 Netherlands                   8772
 Switzerland                   8678
 Germany                       7941
 Canada                        7894
 France                        7296

What is the most frequently reviewed hotel for each of the top 10 reviewer nationalities?

# What was the most frequently reviewed hotel for the top 10 nationalities
# Normally with pandas you will avoid an explicit loop, but wanted to show creating a new dataframe using criteria (don't do this with large amounts of data because it could be very slow)
for nat in nationality_freq[:10].index:
   # First, extract all the rows that match the criteria into a new dataframe
   nat_df = df[df["Reviewer_Nationality"] == nat]   
   # Now get the hotel freq
   freq = nat_df["Hotel_Name"].value_counts()
   print("The most reviewed hotel for " + str(nat).strip() + " was " + str(freq.index[0]) + " with " + str(freq[0]) + " reviews.") 

The most reviewed hotel for United Kingdom was Britannia International Hotel Canary Wharf with 3833 reviews.
The most reviewed hotel for United States of America was Hotel Esther a with 423 reviews.
The most reviewed hotel for Australia was Park Plaza Westminster Bridge London with 167 reviews.
The most reviewed hotel for Ireland was Copthorne Tara Hotel London Kensington with 239 reviews.
The most reviewed hotel for United Arab Emirates was Millennium Hotel London Knightsbridge with 129 reviews.
The most reviewed hotel for Saudi Arabia was The Cumberland A Guoman Hotel with 142 reviews.
The most reviewed hotel for Netherlands was Jaz Amsterdam with 97 reviews.
The most reviewed hotel for Switzerland was Hotel Da Vinci with 97 reviews.
The most reviewed hotel for Germany was Hotel Da Vinci with 86 reviews.
The most reviewed hotel for Canada was St James Court A Taj Hotel London with 61 reviews.

How many reviews are there per hotel (frequency count of hotels) in the dataset?

# First create a new dataframe based on the old one, removing the uneeded columns
hotel_freq_df = df.drop(["Hotel_Address", "Additional_Number_of_Scoring", "Review_Date", "Average_Score", "Reviewer_Nationality", "Negative_Review", "Review_Total_Negative_Word_Counts", "Positive_Review", "Review_Total_Positive_Word_Counts", "Total_Number_of_Reviews_Reviewer_Has_Given", "Reviewer_Score", "Tags", "days_since_review", "lat", "lng"], axis = 1)

# Group the rows by Hotel_Name, count them and put the result in a new column Total_Reviews_Found
hotel_freq_df['Total_Reviews_Found'] = hotel_freq_df.groupby('Hotel_Name').transform('count')

# Get rid of all the duplicated rows
hotel_freq_df = hotel_freq_df.drop_duplicates(subset = ["Hotel_Name"])
display(hotel_freq_df)

Hotel_Name	Total_Number_of_Reviews	Total_Reviews_Found
Britannia International Hotel Canary Wharf	9086	4789
Park Plaza Westminster Bridge London	12158	4169
Copthorne Tara Hotel London Kensington	7105	3578
...	...	...
Mercure Paris Porte d Orleans	110	10
Hotel Wagner	135	10
Hotel Gallitzinberg	173	8

You may notice that the counted in the dataset results do not match the value in Total_Number_of_Reviews. It is unclear whether this value represents the total number of reviews the hotel received but not all were scraped, or some other calculation. Total_Number_of_Reviews is not used in the model due to this ambiguity.

While the dataset includes an Average_Score column for each hotel, you can also calculate an average score (by averaging all reviewer scores in the dataset for each hotel). Add a new column to your dataframe called Calc_Average_Score that contains this calculated average. Print the columns Hotel_Name, Average_Score, and Calc_Average_Score.

# define a function that takes a row and performs some calculation with it
def get_difference_review_avg(row):
  return row["Average_Score"] - row["Calc_Average_Score"]

# 'mean' is mathematical word for 'average'
df['Calc_Average_Score'] = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)

# Add a new column with the difference between the two average scores
df["Average_Score_Difference"] = df.apply(get_difference_review_avg, axis = 1)

# Create a df without all the duplicates of Hotel_Name (so only 1 row per hotel)
review_scores_df = df.drop_duplicates(subset = ["Hotel_Name"])

# Sort the dataframe to find the lowest and highest average score difference
review_scores_df = review_scores_df.sort_values(by=["Average_Score_Difference"])

display(review_scores_df[["Average_Score_Difference", "Average_Score", "Calc_Average_Score", "Hotel_Name"]])

You may also wonder about the Average_Score value and why it sometimes differs from the calculated average score. Since we can't determine why some values match while others differ, it's safest to use the review scores we have to calculate the average ourselves. That said, the differences are usually very small. Here are the hotels with the greatest deviation between the dataset average and the calculated average:

Average_Score_Difference	Average_Score	Calc_Average_Score	Hotel_Name
-0.8	7.7	8.5	Best Western Hotel Astoria
-0.7	8.8	9.5	Hotel Stendhal Place Vend me Paris MGallery
-0.7	7.5	8.2	Mercure Paris Porte d Orleans
-0.7	7.9	8.6	Renaissance Paris Vendome Hotel
-0.5	7.0	7.5	Hotel Royal Elys es
...	...	...	...
0.7	7.5	6.8	Mercure Paris Op ra Faubourg Montmartre
0.8	7.1	6.3	Holiday Inn Paris Montparnasse Pasteur
0.9	6.8	5.9	Villa Eugenie
0.9	8.6	7.7	MARQUIS Faubourg St Honor Relais Ch teaux
1.3	7.2	5.9	Kube Hotel Ice Bar

With only one hotel having a score difference greater than 1, we can likely ignore the difference and use the calculated average score.

Calculate and print how many rows have Negative_Review values of "No Negative."
Calculate and print how many rows have Positive_Review values of "No Positive."

Calculate and print how many rows have Positive_Review values of "No Positive" and Negative_Review values of "No Negative."

# with lambdas:
start = time.time()
no_negative_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" else False , axis=1)
print("Number of No Negative reviews: " + str(len(no_negative_reviews[no_negative_reviews == True].index)))

no_positive_reviews = df.apply(lambda x: True if x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of No Positive reviews: " + str(len(no_positive_reviews[no_positive_reviews == True].index)))

both_no_reviews = df.apply(lambda x: True if x['Negative_Review'] == "No Negative" and x['Positive_Review'] == "No Positive" else False , axis=1)
print("Number of both No Negative and No Positive reviews: " + str(len(both_no_reviews[both_no_reviews == True].index)))
end = time.time()
print("Lambdas took " + str(round(end - start, 2)) + " seconds")

Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Lambdas took 9.64 seconds

Another way

Another way to count items without using Lambdas is to use the sum function to count rows:

# without lambdas (using a mixture of notations to show you can use both)
start = time.time()
no_negative_reviews = sum(df.Negative_Review == "No Negative")
print("Number of No Negative reviews: " + str(no_negative_reviews))

no_positive_reviews = sum(df["Positive_Review"] == "No Positive")
print("Number of No Positive reviews: " + str(no_positive_reviews))

both_no_reviews = sum((df.Negative_Review == "No Negative") & (df.Positive_Review == "No Positive"))
print("Number of both No Negative and No Positive reviews: " + str(both_no_reviews))

end = time.time()
print("Sum took " + str(round(end - start, 2)) + " seconds")

Number of No Negative reviews: 127890
Number of No Positive reviews: 35946
Number of both No Negative and No Positive reviews: 127
Sum took 0.19 seconds

You may have noticed that there are 127 rows with both "No Negative" and "No Positive" values for the columns Negative_Review and Positive_Review, respectively. This means the reviewer gave the hotel a numerical score but chose not to write either a positive or negative review. Fortunately, this is a small number of rows (127 out of 515,738, or 0.02%), so it likely won't skew the model or results significantly. However, you might not have expected a dataset of reviews to include rows without any reviews, so it's worth exploring the data to uncover such anomalies.

Now that you've explored the dataset, the next lesson will focus on filtering the data and adding sentiment analysis.

🚀Challenge

This lesson demonstrates, as we've seen in previous lessons, how critically important it is to understand your data and its quirks before performing operations on it. Text-based data, in particular, requires careful scrutiny. Explore various text-heavy datasets and see if you can identify areas that might introduce bias or skewed sentiment into a model.

Post-lecture quiz

Review & Self Study

Take this Learning Path on NLP to discover tools to try when building speech and text-heavy models.

Assignment

NLTK

Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

31 KiB Raw Permalink Blame History