Moving last section to start of next lesson for flow

pull/55/head
Stephen Howell (MSFT) 4 years ago committed by GitHub
parent 2e960fb1a2
commit 3594659ca5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -380,108 +380,7 @@ Treat the following questions as coding tasks and attempt to answer them without
You may have noticed that there are 127 rows that have both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review` respectively. That means that the reviewer gave the hotel a numerical score, but declined to write either a positive or negative review. Luckily this is a small amount of rows (127 out of 515738, or 0.02%), so it probably won't skew our model or results in any particular direction, but you might not have expected a data set of reviews to have rows with no reviews, so it's worth exploring the data to discover rows like this. You may have noticed that there are 127 rows that have both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review` respectively. That means that the reviewer gave the hotel a numerical score, but declined to write either a positive or negative review. Luckily this is a small amount of rows (127 out of 515738, or 0.02%), so it probably won't skew our model or results in any particular direction, but you might not have expected a data set of reviews to have rows with no reviews, so it's worth exploring the data to discover rows like this.
### Modifying the dataframe Now that you have explored the dataset, in the next lesson you will filter the data and add some sentiment analysis.
Now that you've explored the dataset, you can see some issues with it. Some columns are are filled with useless information, others are just incorrect. If they are correct, it's unclear how they were calculated, and answers cannot be independently verified by your own calculations.
Next, you will add columns that will be useful later, change the values in other columns, and drop certain columns completely.
Follow these steps in order:
1. `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude)
1. Drop lat and lng
2. Replace Hotel_Address values with the following values (if the address contains the same of the city and the country, change it to just the city and the country).
These are the only cities and countries in the dataset:
Amsterdam, Netherlands
Barcelona, Spain
London, United Kingdom
Milan, Italy
Paris, France
Vienna, Austria
```python
def replace_address(row):
if "Netherlands" in row["Hotel_Address"]:
return "Amsterdam, Netherlands"
elif "Barcelona" in row["Hotel_Address"]:
return "Barcelona, Spain"
elif "United Kingdom" in row["Hotel_Address"]:
return "London, United Kingdom"
elif "Milan" in row["Hotel_Address"]:
return "Milan, Italy"
elif "France" in row["Hotel_Address"]:
return "Paris, France"
elif "Vienna" in row["Hotel_Address"]:
return "Vienna, Austria"
# Replace all the addresses with a shortened, more useful form
df["Hotel_Address"] = df.apply(replace_address, axis = 1)
# The sum of the value_counts() should add up to the total number of reviews
print(df["Hotel_Address"].value_counts())
```
Now you can query country level data:
```python
display(df.groupby("Hotel_Address").agg({"Hotel_Name": "nunique"}))
```
| Hotel_Address | Hotel_Name |
| ---------------------: | ---------: |
| Amsterdam, Netherlands | 105 |
| Barcelona, Spain | 211 |
| London, United Kingdom | 400 |
| Milan, Italy | 162 |
| Paris, France | 458 |
| Vienna, Austria | 158 |
2. Hotel Meta-review columns: `Average_Score`, `Total_Number_of_Reviews`, `Additional_Number_of_Scoring`
* Drop `Additional_Number_of_Scoring`
* Replace `Total_Number_of_Reviews` with the total number of reviews for that hotel that are actually in the dataset
* Replace `Average_Score` with our own calculated score
```python
# Drop `Additional_Number_of_Scoring`
df.drop(["Additional_Number_of_Scoring"], axis = 1, inplace=True)
# Replace `Total_Number_of_Reviews` and `Average_Score` with our own calculated values
df.Total_Number_of_Reviews = df.groupby('Hotel_Name').transform('count')
df.Average_Score = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
```
**Review columns**
- Drop `Review_Total_Negative_Word_Counts`, `Review_Total_Positive_Word_Counts`, `Review_Date` and `days_since_review`
- Keep `Reviewer_Score`, `Negative_Review`, and `Positive_Review` as they are,
- Keep `Tags`
- We'll be doing some additional filtering operations on the tags in the next lesson.
**Reviewer columns**
- Drop `Total_Number_of_Reviews_Reviewer_Has_Given`
- Keep `Reviewer_Nationality`
Finally, save the dataset as it is now with a new name.
```python
df.drop(["Review_Total_Negative_Word_Counts", "Review_Total_Positive_Word_Counts", "days_since_review", "Total_Number_of_Reviews_Reviewer_Has_Given"], axis = 1, inplace=True)
# Saving new data file with calculated columns
print("Saving results to Hotel_Reviews_Filtered.csv")
df.to_csv(r'Hotel_Reviews_Filtered.csv', index = False)
```
--- ---
## 🚀Challenge ## 🚀Challenge

Loading…
Cancel
Save