Moving last section to start of next lesson for flow

4 years ago · 3594659ca5
parent 2e960fb1a2
commit 3594659ca5
1 changed files with 1 additions and 102 deletions
--- a/6-NLP/4-Hotel-Reviews-1/README.md
+++ b/6-NLP/4-Hotel-Reviews-1/README.md
@ -380,108 +380,7 @@ Treat the following questions as coding tasks and attempt to answer them without

   You may have noticed that there are 127 rows that have both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review` respectively. That means that the reviewer gave the hotel a numerical score, but declined to write either a positive or negative review. Luckily this is a small amount of rows (127 out of 515738, or 0.02%), so it probably won't skew our model or results in any particular direction, but you might not have expected a data set of reviews to have rows with no reviews, so it's worth exploring the data to discover rows like this.

-### Modifying the dataframe
-
-Now that you've explored the dataset, you can see some issues with it. Some columns are are filled with useless information, others are just incorrect. If they are correct, it's unclear how they were calculated, and answers cannot be independently verified by your own calculations.
-
-Next, you will add columns that will be useful later, change the values in other columns, and drop certain columns completely.
-
-Follow these steps in order:
-
-1. `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude)
-
-   1. Drop lat and lng
-
-   2. Replace Hotel_Address values with the following values (if the address contains the same of the city and the country, change it to just the city and the country). 
-
-      These are the only cities and countries in the dataset:
-
-      Amsterdam, Netherlands
-
-      Barcelona, Spain
-
-      London, United Kingdom
-
-      Milan, Italy
-
-      Paris, France
-
-      Vienna, Austria 
-
-      ```python
-      def replace_address(row):
-          if "Netherlands" in row["Hotel_Address"]:
-              return "Amsterdam, Netherlands"
-          elif "Barcelona" in row["Hotel_Address"]:
-              return "Barcelona, Spain"
-          elif "United Kingdom" in row["Hotel_Address"]:
-              return "London, United Kingdom"
-          elif "Milan" in row["Hotel_Address"]:        
-              return "Milan, Italy"
-          elif "France" in row["Hotel_Address"]:
-              return "Paris, France"
-          elif "Vienna" in row["Hotel_Address"]:
-              return "Vienna, Austria" 
-      
-      # Replace all the addresses with a shortened, more useful form
-      df["Hotel_Address"] = df.apply(replace_address, axis = 1)
-      # The sum of the value_counts() should add up to the total number of reviews
-      print(df["Hotel_Address"].value_counts())
-      ```
-
-      Now you can query country level data:
-
-      ```python
-      display(df.groupby("Hotel_Address").agg({"Hotel_Name": "nunique"}))
-      ```
-
-      |          Hotel_Address | Hotel_Name |
-      | ---------------------: | ---------: |
-      | Amsterdam, Netherlands |        105 |
-      |       Barcelona, Spain |        211 |
-      | London, United Kingdom |        400 |
-      |           Milan, Italy |        162 |
-      |          Paris, France |        458 |
-      |        Vienna, Austria |        158 |
-
-
-
-2. Hotel Meta-review columns: `Average_Score`, `Total_Number_of_Reviews`, `Additional_Number_of_Scoring`
-
-* Drop `Additional_Number_of_Scoring`
-* Replace `Total_Number_of_Reviews` with the total number of reviews for that hotel that are actually in the dataset 
-
-* Replace `Average_Score` with our own calculated score
-
-  ```python
-  # Drop `Additional_Number_of_Scoring`
-  df.drop(["Additional_Number_of_Scoring"], axis = 1, inplace=True)
-  # Replace `Total_Number_of_Reviews` and `Average_Score` with our own calculated values
-  df.Total_Number_of_Reviews = df.groupby('Hotel_Name').transform('count')
-  df.Average_Score = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1)
-  ```
-
-**Review columns**
-
- Drop `Review_Total_Negative_Word_Counts`, `Review_Total_Positive_Word_Counts`, `Review_Date` and `days_since_review`
- Keep `Reviewer_Score`, `Negative_Review`, and `Positive_Review` as they are,
- Keep `Tags`
-  - We'll be doing some additional filtering operations on the tags in the next lesson.
-
-**Reviewer columns**
-
- Drop `Total_Number_of_Reviews_Reviewer_Has_Given`
- Keep `Reviewer_Nationality` 
-
-Finally, save the dataset as it is now with a new name.
-
-```python
-df.drop(["Review_Total_Negative_Word_Counts", "Review_Total_Positive_Word_Counts", "days_since_review", "Total_Number_of_Reviews_Reviewer_Has_Given"], axis = 1, inplace=True)
-
-# Saving new data file with calculated columns
-print("Saving results to Hotel_Reviews_Filtered.csv")
-df.to_csv(r'Hotel_Reviews_Filtered.csv', index = False)
-```
+Now that you have explored the dataset, in the next lesson you will filter the data and add some sentiment analysis.

 ---
 ## 🚀Challenge