From 5f1497ce6b38a9e0a7ca08e2d6db76618e1669ea Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Sun, 14 Feb 2021 21:37:37 +0800
Subject: [PATCH] Add. The visualization display
---
.../酒店推荐-checkpoint.ipynb | 2742 ++++++++++++++++-
.../酒店推荐.ipynb | 396 ++-
2 files changed, 3120 insertions(+), 18 deletions(-)
diff --git a/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/.ipynb_checkpoints/酒店推荐-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/.ipynb_checkpoints/酒店推荐-checkpoint.ipynb
index 2fd6442..4c5ff63 100644
--- a/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/.ipynb_checkpoints/酒店推荐-checkpoint.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/.ipynb_checkpoints/酒店推荐-checkpoint.ipynb
@@ -1,6 +1,2744 @@
{
- "cells": [],
- "metadata": {},
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 背景描述\n",
+ "当一个新用户进来时,系统不知道推荐什么,可以从用户看什么来进行相关性的推荐,比如靠近交通、景区等,又或者是含早餐、有电梯等特殊的,这里怎么基于不同酒店的相似度来进行推荐的。\n",
+ "\n",
+ "#### 基于酒店的文本描述来推荐相似酒店"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ " \n",
+ " "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from nltk.corpus import stopwords\n",
+ "from sklearn.metrics.pairwise import linear_kernel\n",
+ "from sklearn.feature_extraction.text import CountVectorizer\n",
+ "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+ "import re\n",
+ "import random\n",
+ "import cufflinks # pip install cufflinks\n",
+ "import matplotlib.pyplot as plt\n",
+ "from plotly.offline import iplot\n",
+ "cufflinks.go_offline()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " address | \n",
+ " desc | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Hilton Garden Seattle Downtown | \n",
+ " 1821 Boren Avenue, Seattle Washington 98101 USA | \n",
+ " Located on the southern tip of Lake Union, the... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Sheraton Grand Seattle | \n",
+ " 1400 6th Avenue, Seattle, Washington 98101 USA | \n",
+ " Located in the city's vibrant core, the Sherat... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Crowne Plaza Seattle Downtown | \n",
+ " 1113 6th Ave, Seattle, WA 98101 | \n",
+ " Located in the heart of downtown Seattle, the ... | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Kimpton Hotel Monaco Seattle | \n",
+ " 1101 4th Ave, Seattle, WA98101 | \n",
+ " What?s near our hotel downtown Seattle locatio... | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " The Westin Seattle | \n",
+ " 1900 5th Avenue, Seattle, Washington 98101 USA | \n",
+ " Situated amid incredible shopping and iconic a... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name \\\n",
+ "0 Hilton Garden Seattle Downtown \n",
+ "1 Sheraton Grand Seattle \n",
+ "2 Crowne Plaza Seattle Downtown \n",
+ "3 Kimpton Hotel Monaco Seattle \n",
+ "4 The Westin Seattle \n",
+ "\n",
+ " address \\\n",
+ "0 1821 Boren Avenue, Seattle Washington 98101 USA \n",
+ "1 1400 6th Avenue, Seattle, Washington 98101 USA \n",
+ "2 1113 6th Ave, Seattle, WA 98101 \n",
+ "3 1101 4th Ave, Seattle, WA98101 \n",
+ "4 1900 5th Avenue, Seattle, Washington 98101 USA \n",
+ "\n",
+ " desc \n",
+ "0 Located on the southern tip of Lake Union, the... \n",
+ "1 Located in the city's vibrant core, the Sherat... \n",
+ "2 Located in the heart of downtown Seattle, the ... \n",
+ "3 What?s near our hotel downtown Seattle locatio... \n",
+ "4 Situated amid incredible shopping and iconic a... "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv(\"data/Seattle_Hotels.csv\", encoding=\"latin-1\") # 西雅图酒店推荐数据\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "上面分别是酒店名字、地址及描述"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(152, 3)"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"Located on the southern tip of Lake Union, the Hilton Garden Inn Seattle Downtown hotel is perfectly located for business and leisure. \\nThe neighborhood is home to numerous major international companies including Amazon, Google and the Bill & Melinda Gates Foundation. A wealth of eclectic restaurants and bars make this area of Seattle one of the most sought out by locals and visitors. Our proximity to Lake Union allows visitors to take in some of the Pacific Northwest's majestic scenery and enjoy outdoor activities like kayaking and sailing. over 2,000 sq. ft. of versatile space and a complimentary business center. State-of-the-art A/V technology and our helpful staff will guarantee your conference, cocktail reception or wedding is a success. Refresh in the sparkling saltwater pool, or energize with the latest equipment in the 24-hour fitness center. Tastefully decorated and flooded with natural light, our guest rooms and suites offer everything you need to relax and stay productive. Unwind in the bar, and enjoy American cuisine for breakfast, lunch and dinner in our restaurant. The 24-hour Pavilion Pantry? stocks a variety of snacks, drinks and sundries.\""
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df['desc'][0] # 查看酒店描述的个例"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 文本词频统计\n",
+ "统计下酒店介绍文本里大多数描述的信息有哪些"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vec = CountVectorizer().fit(df['desc']) # 寄存器\n",
+ "bag_of_words = vec.transform(df['desc']) # 将文本转数值"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[0, 1, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " ...,\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 0, 0, 0],\n",
+ " [0, 0, 0, ..., 1, 0, 0]], dtype=int64)"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "bag_of_words.toarray()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(152, 3200)"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "bag_of_words.shape # 一共152含对应上面的数据,其中有3200个不同的词"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "matrix([[ 1, 11, 11, ..., 2, 6, 2]], dtype=int64)"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sum_words = bag_of_words.sum(axis=0) # 计算每个词重复的次数\n",
+ "sum_words"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[('located', 108),\n",
+ " ('on', 129),\n",
+ " ('the', 1258),\n",
+ " ('southern', 1),\n",
+ " ('tip', 1),\n",
+ " ('of', 536),\n",
+ " ('lake', 41),\n",
+ " ('union', 33),\n",
+ " ('hilton', 12),\n",
+ " ('garden', 11),\n",
+ " ('inn', 89),\n",
+ " ('seattle', 533),\n",
+ " ('downtown', 133),\n",
+ " ('hotel', 295),\n",
+ " ('is', 271),\n",
+ " ('perfectly', 6),\n",
+ " ('for', 216),\n",
+ " ('business', 87),\n",
+ " ('and', 1062),\n",
+ " ('leisure', 18),\n",
+ " ('neighborhood', 35),\n",
+ " ('home', 57),\n",
+ " ('to', 471),\n",
+ " ('numerous', 1),\n",
+ " ('major', 12),\n",
+ " ('international', 32),\n",
+ " ('companies', 6),\n",
+ " ('including', 47),\n",
+ " ('amazon', 19),\n",
+ " ('google', 6),\n",
+ " ('bill', 4),\n",
+ " ('melinda', 4),\n",
+ " ('gates', 5),\n",
+ " ('foundation', 4),\n",
+ " ('wealth', 1),\n",
+ " ('eclectic', 8),\n",
+ " ('restaurants', 35),\n",
+ " ('bars', 7),\n",
+ " ('make', 43),\n",
+ " ('this', 63),\n",
+ " ('area', 51),\n",
+ " ('one', 75),\n",
+ " ('most', 40),\n",
+ " ('sought', 1),\n",
+ " ('out', 23),\n",
+ " ('by', 71),\n",
+ " ('locals', 5),\n",
+ " ('visitors', 4),\n",
+ " ('our', 359),\n",
+ " ('proximity', 8),\n",
+ " ('allows', 3),\n",
+ " ('take', 31),\n",
+ " ('in', 449),\n",
+ " ('some', 22),\n",
+ " ('pacific', 42),\n",
+ " ('northwest', 42),\n",
+ " ('majestic', 4),\n",
+ " ('scenery', 2),\n",
+ " ('enjoy', 93),\n",
+ " ('outdoor', 23),\n",
+ " ('activities', 8),\n",
+ " ('like', 46),\n",
+ " ('kayaking', 3),\n",
+ " ('sailing', 1),\n",
+ " ('over', 14),\n",
+ " ('000', 11),\n",
+ " ('sq', 4),\n",
+ " ('ft', 4),\n",
+ " ('versatile', 3),\n",
+ " ('space', 97),\n",
+ " ('complimentary', 62),\n",
+ " ('center', 151),\n",
+ " ('state', 30),\n",
+ " ('art', 44),\n",
+ " ('technology', 4),\n",
+ " ('helpful', 2),\n",
+ " ('staff', 9),\n",
+ " ('will', 46),\n",
+ " ('guarantee', 3),\n",
+ " ('your', 186),\n",
+ " ('conference', 6),\n",
+ " ('cocktail', 6),\n",
+ " ('reception', 7),\n",
+ " ('or', 161),\n",
+ " ('wedding', 4),\n",
+ " ('success', 4),\n",
+ " ('refresh', 4),\n",
+ " ('sparkling', 2),\n",
+ " ('saltwater', 1),\n",
+ " ('pool', 37),\n",
+ " ('energize', 2),\n",
+ " ('with', 280),\n",
+ " ('latest', 4),\n",
+ " ('equipment', 3),\n",
+ " ('24', 42),\n",
+ " ('hour', 32),\n",
+ " ('fitness', 42),\n",
+ " ('tastefully', 4),\n",
+ " ('decorated', 4),\n",
+ " ('flooded', 1),\n",
+ " ('natural', 8),\n",
+ " ('light', 26),\n",
+ " ('guest', 57),\n",
+ " ('rooms', 106),\n",
+ " ('suites', 67),\n",
+ " ('offer', 59),\n",
+ " ('everything', 18),\n",
+ " ('you', 304),\n",
+ " ('need', 25),\n",
+ " ('relax', 25),\n",
+ " ('stay', 105),\n",
+ " ('productive', 4),\n",
+ " ('unwind', 11),\n",
+ " ('bar', 34),\n",
+ " ('american', 5),\n",
+ " ('cuisine', 11),\n",
+ " ('breakfast', 68),\n",
+ " ('lunch', 4),\n",
+ " ('dinner', 7),\n",
+ " ('restaurant', 32),\n",
+ " ('pavilion', 1),\n",
+ " ('pantry', 2),\n",
+ " ('stocks', 1),\n",
+ " ('variety', 12),\n",
+ " ('snacks', 9),\n",
+ " ('drinks', 6),\n",
+ " ('sundries', 2),\n",
+ " ('city', 79),\n",
+ " ('vibrant', 14),\n",
+ " ('core', 5),\n",
+ " ('sheraton', 8),\n",
+ " ('grand', 13),\n",
+ " ('provides', 9),\n",
+ " ('gateway', 4),\n",
+ " ('diverse', 5),\n",
+ " ('sights', 2),\n",
+ " ('sounds', 2),\n",
+ " ('step', 11),\n",
+ " ('front', 11),\n",
+ " ('doors', 8),\n",
+ " ('find', 31),\n",
+ " ('gourmet', 1),\n",
+ " ('dining', 36),\n",
+ " ('world', 24),\n",
+ " ('class', 13),\n",
+ " ('shopping', 31),\n",
+ " ('exciting', 7),\n",
+ " ('entertainment', 11),\n",
+ " ('iconic', 15),\n",
+ " ('local', 45),\n",
+ " ('attractions', 59),\n",
+ " ('pike', 90),\n",
+ " ('place', 102),\n",
+ " ('market', 97),\n",
+ " ('needle', 68),\n",
+ " ('chihuly', 3),\n",
+ " ('glass', 10),\n",
+ " ('museum', 43),\n",
+ " ('as', 117),\n",
+ " ('only', 34),\n",
+ " ('seven', 6),\n",
+ " ('hotels', 28),\n",
+ " ('north', 14),\n",
+ " ('america', 5),\n",
+ " ('earn', 3),\n",
+ " ('esteemed', 1),\n",
+ " ('designation', 1),\n",
+ " ('guests', 54),\n",
+ " ('can', 55),\n",
+ " ('book', 11),\n",
+ " ('confidently', 1),\n",
+ " ('knowing', 1),\n",
+ " ('they', 11),\n",
+ " ('re', 64),\n",
+ " ('receiving', 1),\n",
+ " ('highest', 2),\n",
+ " ('benchmark', 1),\n",
+ " ('product', 1),\n",
+ " ('service', 53),\n",
+ " ('offerings', 2),\n",
+ " ('available', 36),\n",
+ " ('experience', 52),\n",
+ " ('recently', 7),\n",
+ " ('completed', 1),\n",
+ " ('multimillion', 1),\n",
+ " ('dollar', 1),\n",
+ " ('transformation', 1),\n",
+ " ('featuring', 26),\n",
+ " ('all', 100),\n",
+ " ('new', 17),\n",
+ " ('an', 91),\n",
+ " ('expanded', 5),\n",
+ " ('club', 17),\n",
+ " ('lounge', 20),\n",
+ " ('modern', 34),\n",
+ " ('meeting', 27),\n",
+ " ('event', 29),\n",
+ " ('spaces', 11),\n",
+ " ('gather', 1),\n",
+ " ('stylish', 17),\n",
+ " ('lobby', 26),\n",
+ " ('private', 29),\n",
+ " ('collection', 6),\n",
+ " ('artists', 2),\n",
+ " ('while', 34),\n",
+ " ('enjoying', 5),\n",
+ " ('favorite', 7),\n",
+ " ('beverage', 5),\n",
+ " ('from', 224),\n",
+ " ('starbucks', 12),\n",
+ " ('features', 27),\n",
+ " ('several', 5),\n",
+ " ('options', 14),\n",
+ " ('loulay', 1),\n",
+ " ('kitchen', 17),\n",
+ " ('james', 1),\n",
+ " ('beard', 1),\n",
+ " ('award', 13),\n",
+ " ('winning', 11),\n",
+ " ('chef', 3),\n",
+ " ('thierry', 1),\n",
+ " ('rautureau', 1),\n",
+ " ('heart', 35),\n",
+ " ('crowne', 5),\n",
+ " ('plaza', 4),\n",
+ " ('offers', 43),\n",
+ " ('exceptional', 5),\n",
+ " ('blend', 3),\n",
+ " ('style', 27),\n",
+ " ('comfort', 24),\n",
+ " ('ll', 48),\n",
+ " ('notice', 1),\n",
+ " ('cool', 5),\n",
+ " ('comfortable', 30),\n",
+ " ('unconventional', 1),\n",
+ " ('touches', 4),\n",
+ " ('that', 65),\n",
+ " ('set', 4),\n",
+ " ('us', 21),\n",
+ " ('apart', 1),\n",
+ " ('soon', 3),\n",
+ " ('inside', 7),\n",
+ " ('marvel', 2),\n",
+ " ('at', 231),\n",
+ " ('stunning', 7),\n",
+ " ('views', 39),\n",
+ " ('lights', 4),\n",
+ " ('relaxing', 12),\n",
+ " ('sleep', 6),\n",
+ " ('advantage', 12),\n",
+ " ('beds', 17),\n",
+ " ('wireless', 9),\n",
+ " ('internet', 26),\n",
+ " ('throughout', 7),\n",
+ " ('amenities', 60),\n",
+ " ('help', 11),\n",
+ " ('temple', 1),\n",
+ " ('spa', 13),\n",
+ " ('tight', 1),\n",
+ " ('amenity', 1),\n",
+ " ('kits', 1),\n",
+ " ('lavender', 1),\n",
+ " ('spray', 1),\n",
+ " ('lotions', 1),\n",
+ " ('rejuvenate', 1),\n",
+ " ('invigorating', 2),\n",
+ " ('workout', 7),\n",
+ " ('get', 11),\n",
+ " ('suggestions', 1),\n",
+ " ('expert', 2),\n",
+ " ('concierge', 3),\n",
+ " ('savor', 6),\n",
+ " ('sumptuous', 1),\n",
+ " ('regatta', 1),\n",
+ " ('grille', 1),\n",
+ " ('where', 27),\n",
+ " ('happy', 6),\n",
+ " ('daily', 10),\n",
+ " ('4pm', 1),\n",
+ " ('7pm', 1),\n",
+ " ('monthly', 1),\n",
+ " ('drink', 5),\n",
+ " ('specials', 2),\n",
+ " ('come', 17),\n",
+ " ('emerald', 17),\n",
+ " ('has', 41),\n",
+ " ('what', 11),\n",
+ " ('near', 48),\n",
+ " ('location', 33),\n",
+ " ('better', 5),\n",
+ " ('question', 2),\n",
+ " ('might', 5),\n",
+ " ('be', 43),\n",
+ " ('not', 20),\n",
+ " ('nearby', 16),\n",
+ " ('addition', 4),\n",
+ " ('being', 1),\n",
+ " ('here', 19),\n",
+ " ('just', 82),\n",
+ " ('small', 8),\n",
+ " ('sampling', 2),\n",
+ " ('rest', 4),\n",
+ " ('columbia', 2),\n",
+ " ('whose', 1),\n",
+ " ('sky', 2),\n",
+ " ('view', 11),\n",
+ " ('observatory', 1),\n",
+ " ('73rd', 1),\n",
+ " ('floor', 11),\n",
+ " ('tallest', 1),\n",
+ " ('public', 6),\n",
+ " ('viewing', 1),\n",
+ " ('west', 15),\n",
+ " ('mississippi', 1),\n",
+ " ('historic', 24),\n",
+ " ('5th', 4),\n",
+ " ('avenue', 15),\n",
+ " ('theatre', 2),\n",
+ " ('musical', 1),\n",
+ " ('productions', 1),\n",
+ " ('central', 12),\n",
+ " ('library', 7),\n",
+ " ('architectural', 2),\n",
+ " ('within', 36),\n",
+ " ('half', 5),\n",
+ " ('mile', 14),\n",
+ " ('must', 5),\n",
+ " ('see', 15),\n",
+ " ('which', 14),\n",
+ " ('houses', 5),\n",
+ " ('original', 11),\n",
+ " ('pioneer', 15),\n",
+ " ('square', 28),\n",
+ " ('fantastic', 3),\n",
+ " ('flagship', 1),\n",
+ " ('nordstrom', 6),\n",
+ " ('rack', 1),\n",
+ " ('macy', 1),\n",
+ " ('sportswear', 1),\n",
+ " ('louis', 1),\n",
+ " ('vuitton', 1),\n",
+ " ('arcteryx', 1),\n",
+ " ('oodles', 1),\n",
+ " ('independent', 2),\n",
+ " ('boutiques', 5),\n",
+ " ('great', 39),\n",
+ " ('wheel', 2),\n",
+ " ('washington', 67),\n",
+ " ('convention', 24),\n",
+ " ('about', 11),\n",
+ " ('bell', 2),\n",
+ " ('street', 26),\n",
+ " ('pier', 5),\n",
+ " ('cruise', 11),\n",
+ " ('terminal', 6),\n",
+ " ('66', 1),\n",
+ " ('sports', 13),\n",
+ " ('stadiums', 5),\n",
+ " ('centurylink', 17),\n",
+ " ('field', 34),\n",
+ " ('safeco', 20),\n",
+ " ('seahawks', 10),\n",
+ " ('mariners', 9),\n",
+ " ('sounders', 4),\n",
+ " ('situated', 14),\n",
+ " ('amid', 3),\n",
+ " ('incredible', 3),\n",
+ " ('westin', 1),\n",
+ " ('contemporary', 12),\n",
+ " ('haven', 3),\n",
+ " ('prime', 3),\n",
+ " ('recharge', 2),\n",
+ " ('accommodations', 15),\n",
+ " ('comforts', 14),\n",
+ " ('signature', 14),\n",
+ " ('heavenly', 2),\n",
+ " ('gorgeous', 7),\n",
+ " ('skyline', 8),\n",
+ " ('puget', 16),\n",
+ " ('sound', 21),\n",
+ " ('cascade', 1),\n",
+ " ('mountain', 5),\n",
+ " ('range', 9),\n",
+ " ('newly', 5),\n",
+ " ('renovated', 10),\n",
+ " ('1900', 1),\n",
+ " ('fifth', 2),\n",
+ " ('offering', 21),\n",
+ " ('carefully', 2),\n",
+ " ('curated', 4),\n",
+ " ('wine', 10),\n",
+ " ('crafted', 3),\n",
+ " ('explore', 20),\n",
+ " ('spectacular', 6),\n",
+ " ('celebrated', 3),\n",
+ " ('waterfront', 38),\n",
+ " ('host', 9),\n",
+ " ('unforgettable', 3),\n",
+ " ('meetings', 16),\n",
+ " ('social', 11),\n",
+ " ('engagements', 1),\n",
+ " ('more', 29),\n",
+ " ('than', 19),\n",
+ " ('70', 2),\n",
+ " ('feet', 16),\n",
+ " ('enhanced', 1),\n",
+ " ('planning', 8),\n",
+ " ('custom', 6),\n",
+ " ('catering', 8),\n",
+ " ('mind', 8),\n",
+ " ('body', 1),\n",
+ " ('sleek', 5),\n",
+ " ('westinworkout', 1),\n",
+ " ('studio', 11),\n",
+ " ('designed', 21),\n",
+ " ('reflect', 3),\n",
+ " ('substance', 1),\n",
+ " ('welcoming', 5),\n",
+ " ('best', 50),\n",
+ " ('paramount', 4),\n",
+ " ('summons', 1),\n",
+ " ('feel', 11),\n",
+ " ('cozy', 10),\n",
+ " ('elegant', 5),\n",
+ " ('luxurious', 4),\n",
+ " ('residence', 5),\n",
+ " ('friendly', 29),\n",
+ " ('hosts', 3),\n",
+ " ('asian', 2),\n",
+ " ('right', 15),\n",
+ " ('downstairs', 1),\n",
+ " ('fall', 1),\n",
+ " ('love', 9),\n",
+ " ('simple', 6),\n",
+ " ('luxury', 18),\n",
+ " ('charm', 10),\n",
+ " ('boutique', 8),\n",
+ " ('warm', 8),\n",
+ " ('inviting', 10),\n",
+ " ('wood', 7),\n",
+ " ('finishes', 2),\n",
+ " ('comfy', 4),\n",
+ " ('seating', 7),\n",
+ " ('areas', 12),\n",
+ " ('fireplace', 6),\n",
+ " ('classically', 1),\n",
+ " ('appointed', 9),\n",
+ " ('dash', 1),\n",
+ " ('urban', 15),\n",
+ " ('flair', 1),\n",
+ " ('puts', 5),\n",
+ " ('good', 4),\n",
+ " ('company', 3),\n",
+ " ('block', 10),\n",
+ " ('walking', 21),\n",
+ " ('distance', 23),\n",
+ " ('cafes', 1),\n",
+ " ('there', 15),\n",
+ " ('are', 136),\n",
+ " ('many', 31),\n",
+ " ('reasons', 1),\n",
+ " ('annually', 1),\n",
+ " ('ranked', 1),\n",
+ " ('among', 7),\n",
+ " ('top', 17),\n",
+ " ('five', 12),\n",
+ " ('why', 4),\n",
+ " ('yours', 2),\n",
+ " ('shops', 14),\n",
+ " ('sightseeing', 4),\n",
+ " ('tour', 5),\n",
+ " ('rent', 1),\n",
+ " ('car', 7),\n",
+ " ('if', 19),\n",
+ " ('town', 13),\n",
+ " ('walk', 28),\n",
+ " ('via', 5),\n",
+ " ('underground', 3),\n",
+ " ('concourse', 1),\n",
+ " ('hungry', 1),\n",
+ " ('visit', 21),\n",
+ " ('redtrees', 1),\n",
+ " ('wide', 6),\n",
+ " ('satisfy', 2),\n",
+ " ('any', 12),\n",
+ " ('foodie', 1),\n",
+ " ('destination', 5),\n",
+ " ('steps', 20),\n",
+ " ('everywhere', 3),\n",
+ " ('want', 11),\n",
+ " ('motif', 1),\n",
+ " ('welcome', 20),\n",
+ " ('libation', 1),\n",
+ " ('rooftop', 8),\n",
+ " ('across', 16),\n",
+ " ('touchstones', 1),\n",
+ " ('sweeping', 2),\n",
+ " ('landscape', 3),\n",
+ " ('rich', 8),\n",
+ " ('arts', 7),\n",
+ " ('music', 16),\n",
+ " ('culture', 6),\n",
+ " ('infuse', 1),\n",
+ " ('surroundings', 1),\n",
+ " ('residences', 1),\n",
+ " ('hardwoods', 1),\n",
+ " ('colors', 2),\n",
+ " ('inspired', 12),\n",
+ " ('region', 3),\n",
+ " ('culinary', 5),\n",
+ " ('bounty', 3),\n",
+ " ('reflected', 1),\n",
+ " ('menus', 1),\n",
+ " ('frolik', 1),\n",
+ " ('cocktails', 5),\n",
+ " ('adjoining', 2),\n",
+ " ('patio', 6),\n",
+ " ('join', 6),\n",
+ " ('between', 5),\n",
+ " ('monorail', 5),\n",
+ " ('rail', 19),\n",
+ " ('airport', 99),\n",
+ " ('stroll', 2),\n",
+ " ('away', 59),\n",
+ " ('known', 8),\n",
+ " ('setting', 6),\n",
+ " ('trends', 2),\n",
+ " ('warwick', 5),\n",
+ " ('leading', 4),\n",
+ " ('way', 5),\n",
+ " ('upbeat', 1),\n",
+ " ('belltown', 13),\n",
+ " ('district', 27),\n",
+ " ('blocks', 21),\n",
+ " ('blends', 1),\n",
+ " ('classic', 8),\n",
+ " ('expected', 1),\n",
+ " ('name', 2),\n",
+ " ('styling', 1),\n",
+ " ('boasting', 1),\n",
+ " ('unique', 16),\n",
+ " ('staying', 12),\n",
+ " ('truly', 3),\n",
+ " ('finding', 2),\n",
+ " ('pleasant', 3),\n",
+ " ('surprises', 1),\n",
+ " ('along', 13),\n",
+ " ('refreshing', 4),\n",
+ " ('seaborne', 1),\n",
+ " ('mists', 1),\n",
+ " ('breeze', 3),\n",
+ " ('evergreen', 2),\n",
+ " ('covered', 4),\n",
+ " ('hills', 1),\n",
+ " ('lining', 1),\n",
+ " ('horizon', 2),\n",
+ " ('doorstep', 2),\n",
+ " ('anything', 2),\n",
+ " ('possible', 5),\n",
+ " ('surrounded', 4),\n",
+ " ('snow', 2),\n",
+ " ('capped', 2),\n",
+ " ('peaks', 1),\n",
+ " ('deep', 3),\n",
+ " ('blue', 1),\n",
+ " ('waters', 3),\n",
+ " ('swaths', 1),\n",
+ " ('forests', 1),\n",
+ " ('wild', 2),\n",
+ " ('it', 55),\n",
+ " ('trendy', 4),\n",
+ " ('side', 4),\n",
+ " ('another', 2),\n",
+ " ('elliott', 10),\n",
+ " ('bay', 14),\n",
+ " ('gleaming', 1),\n",
+ " ('wake', 6),\n",
+ " ('fresh', 18),\n",
+ " ('cup', 6),\n",
+ " ('coffee', 38),\n",
+ " ('delivered', 1),\n",
+ " ('straight', 3),\n",
+ " ('room', 77),\n",
+ " ('then', 10),\n",
+ " ('head', 5),\n",
+ " ('neighbourhoods', 1),\n",
+ " ('craft', 5),\n",
+ " ('breweries', 3),\n",
+ " ('spend', 9),\n",
+ " ('day', 39),\n",
+ " ('hiking', 6),\n",
+ " ('up', 44),\n",
+ " ('mount', 3),\n",
+ " ('rainer', 1),\n",
+ " ('nightfall', 1),\n",
+ " ('meet', 12),\n",
+ " ('goldfinch', 1),\n",
+ " ('tavern', 1),\n",
+ " ('ethan', 1),\n",
+ " ('stowell', 1),\n",
+ " ('let', 4),\n",
+ " ('chefs', 2),\n",
+ " ('show', 4),\n",
+ " ('flavours', 1),\n",
+ " ('favourite', 1),\n",
+ " ('soak', 3),\n",
+ " ('scene', 6),\n",
+ " ('living', 22),\n",
+ " ('mix', 2),\n",
+ " ('live', 10),\n",
+ " ('dj', 1),\n",
+ " ('series', 1),\n",
+ " ('before', 3),\n",
+ " ('heading', 3),\n",
+ " ('memorable', 4),\n",
+ " ('trace', 1),\n",
+ " ('seasonal', 10),\n",
+ " ('fare', 4),\n",
+ " ('atmosphere', 7),\n",
+ " ('missed', 1),\n",
+ " ('work', 25),\n",
+ " ('off', 19),\n",
+ " ('next', 20),\n",
+ " ('morning', 18),\n",
+ " ('fit', 7),\n",
+ " ('wandering', 1),\n",
+ " ('always', 7),\n",
+ " ('we', 128),\n",
+ " ('ve', 7),\n",
+ " ('got', 4),\n",
+ " ('during', 9),\n",
+ " ('time', 24),\n",
+ " ('whatever', 2),\n",
+ " ('whenever', 1),\n",
+ " ('wish', 2),\n",
+ " ('command', 1),\n",
+ " ('upscale', 11),\n",
+ " ('getaway', 6),\n",
+ " ('hyatt', 15),\n",
+ " ('theater', 6),\n",
+ " ('four', 9),\n",
+ " ('diamond', 4),\n",
+ " ('landmarks', 6),\n",
+ " ('destinations', 3),\n",
+ " ('scenic', 4),\n",
+ " ('luxuriate', 1),\n",
+ " ('opt', 2),\n",
+ " ('decadent', 1),\n",
+ " ('suite', 25),\n",
+ " ('bath', 10),\n",
+ " ('upgraded', 2),\n",
+ " ('access', 59),\n",
+ " ('ever', 3),\n",
+ " ('had', 1),\n",
+ " ('reading', 3),\n",
+ " ('couldn', 3),\n",
+ " ('put', 5),\n",
+ " ('down', 6),\n",
+ " ('but', 12),\n",
+ " ('never', 4),\n",
+ " ('wanted', 2),\n",
+ " ('end', 5),\n",
+ " ('know', 8),\n",
+ " ('kimpton', 5),\n",
+ " ('alexis', 3),\n",
+ " ('1901', 1),\n",
+ " ('building', 19),\n",
+ " ('close', 34),\n",
+ " ('enough', 5),\n",
+ " ('smell', 1),\n",
+ " ('sea', 23),\n",
+ " ('air', 11),\n",
+ " ('plot', 1),\n",
+ " ('peaceful', 2),\n",
+ " ('sanctuary', 2),\n",
+ " ('den', 1),\n",
+ " ('mixed', 1),\n",
+ " ('matched', 3),\n",
+ " ('characters', 1),\n",
+ " ('attentive', 3),\n",
+ " ('members', 1),\n",
+ " ('who', 5),\n",
+ " ('seem', 1),\n",
+ " ('plus', 15),\n",
+ " ('fellow', 2),\n",
+ " ('interesting', 3),\n",
+ " ('stories', 2),\n",
+ " ('tell', 4),\n",
+ " ('ending', 1),\n",
+ " ('without', 4),\n",
+ " ('giving', 3),\n",
+ " ('easy', 44),\n",
+ " ('perennial', 1),\n",
+ " ('seller', 1),\n",
+ " ('positioned', 1),\n",
+ " ('edge', 3),\n",
+ " ('borders', 1),\n",
+ " ('retail', 4),\n",
+ " ('guestrooms', 15),\n",
+ " ('turntables', 1),\n",
+ " ('vinyl', 1),\n",
+ " ('max', 1),\n",
+ " ('dedicated', 5),\n",
+ " ('lovers', 2),\n",
+ " ('indulge', 3),\n",
+ " ('provenance', 1),\n",
+ " ('locally', 7),\n",
+ " ('influenced', 1),\n",
+ " ('honor', 1),\n",
+ " ('beer', 5),\n",
+ " ('miller', 1),\n",
+ " ('guild', 1),\n",
+ " ('fell', 1),\n",
+ " ('former', 1),\n",
+ " ('maritime', 1),\n",
+ " ('workers', 1),\n",
+ " ('started', 1),\n",
+ " ('first', 19),\n",
+ " ('1999', 1),\n",
+ " ('roots', 1),\n",
+ " ('unfussy', 1),\n",
+ " ('intentional', 1),\n",
+ " ('design', 5),\n",
+ " ('ethos', 1),\n",
+ " ('drive', 12),\n",
+ " ('loft', 3),\n",
+ " ('ceilings', 3),\n",
+ " ('hardwood', 1),\n",
+ " ('floors', 7),\n",
+ " ('wherever', 1),\n",
+ " ('could', 5),\n",
+ " ('preserve', 1),\n",
+ " ('them', 3),\n",
+ " ('friends', 11),\n",
+ " ('kaws', 1),\n",
+ " ('shepard', 1),\n",
+ " ('fairey', 1),\n",
+ " ('were', 2),\n",
+ " ('elements', 3),\n",
+ " ('hotelier', 1),\n",
+ " ('map', 1),\n",
+ " ('still', 3),\n",
+ " ('touch', 1),\n",
+ " ('point', 1),\n",
+ " ('ace', 1),\n",
+ " ('today', 16),\n",
+ " ('when', 38),\n",
+ " ('marriott', 5),\n",
+ " ('reveals', 1),\n",
+ " ('mountains', 9),\n",
+ " ('famous', 15),\n",
+ " ('elevator', 2),\n",
+ " ('ride', 9),\n",
+ " ('sit', 3),\n",
+ " ('adjacent', 4),\n",
+ " ('harbor', 4),\n",
+ " ('also', 62),\n",
+ " ('port', 4),\n",
+ " ('aquarium', 10),\n",
+ " ('westlake', 5),\n",
+ " ('olympic', 13),\n",
+ " ('sculpture', 4),\n",
+ " ('park', 35),\n",
+ " ('outfitted', 3),\n",
+ " ('plush', 13),\n",
+ " ('bedding', 9),\n",
+ " ('mini', 7),\n",
+ " ('refrigerators', 5),\n",
+ " ('large', 11),\n",
+ " ('desks', 5),\n",
+ " ('wi', 37),\n",
+ " ('fi', 37),\n",
+ " ('balconies', 3),\n",
+ " ('junior', 1),\n",
+ " ('provide', 12),\n",
+ " ('perfect', 30),\n",
+ " ('extended', 15),\n",
+ " ('stays', 7),\n",
+ " ('reimagined', 2),\n",
+ " ('its', 21),\n",
+ " ('special', 13),\n",
+ " ('perks', 3),\n",
+ " ('indoor', 16),\n",
+ " ('gym', 3),\n",
+ " ('delicious', 8),\n",
+ " ('gastropub', 1),\n",
+ " ('tempting', 1),\n",
+ " ('libations', 1),\n",
+ " ('found', 5),\n",
+ " ('popular', 12),\n",
+ " ('look', 4),\n",
+ " ('no', 15),\n",
+ " ('further', 2),\n",
+ " ('10', 11),\n",
+ " ('redesigned', 1),\n",
+ " ('venues', 8),\n",
+ " ('well', 33),\n",
+ " ('supported', 1),\n",
+ " ('edgewater', 2),\n",
+ " ('reported', 1),\n",
+ " ('cnbc', 1),\n",
+ " ('amazing', 1),\n",
+ " ('breathtaking', 4),\n",
+ " ('sunset', 3),\n",
+ " ('ship', 5),\n",
+ " ('terminals', 5),\n",
+ " ('sites', 4),\n",
+ " ('decide', 1),\n",
+ " ('turn', 2),\n",
+ " ('tub', 8),\n",
+ " ('water', 10),\n",
+ " ('67', 1),\n",
+ " ('dynamic', 2),\n",
+ " ('soul', 1),\n",
+ " ('lodging', 6),\n",
+ " ('river', 1),\n",
+ " ('rock', 3),\n",
+ " ('fireplaces', 3),\n",
+ " ('wilderness', 2),\n",
+ " ('landscapes', 1),\n",
+ " ('outside', 13),\n",
+ " ('window', 1),\n",
+ " ('treat', 1),\n",
+ " ('yourself', 8),\n",
+ " ('rewarding', 1),\n",
+ " ('springhill', 2),\n",
+ " ('south', 19),\n",
+ " ('goal', 4),\n",
+ " ('whether', 23),\n",
+ " ('night', 13),\n",
+ " ('weekend', 2),\n",
+ " ('each', 28),\n",
+ " ('spacious', 29),\n",
+ " ('separate', 6),\n",
+ " ('kitchenette', 3),\n",
+ " ('fridge', 9),\n",
+ " ('maker', 7),\n",
+ " ('microwave', 12),\n",
+ " ('convenience', 18),\n",
+ " ('every', 29),\n",
+ " ('onsite', 8),\n",
+ " ('bistro', 3),\n",
+ " ('yale', 1),\n",
+ " ('innovative', 5),\n",
+ " ('additional', 4),\n",
+ " ('highlights', 1),\n",
+ " ('include', 19),\n",
+ " ('swimming', 10),\n",
+ " ('so', 12),\n",
+ " ('pamper', 2),\n",
+ " ('last', 2),\n",
+ " ('least', 1),\n",
+ " ('makes', 5),\n",
+ " ('mall', 11),\n",
+ " ('other', 21),\n",
+ " ('plan', 6),\n",
+ " ('forward', 1),\n",
+ " ('seeing', 1),\n",
+ " ('premier', 2),\n",
+ " ('fairmont', 3),\n",
+ " ('captures', 1),\n",
+ " ('old', 7),\n",
+ " ('elegance', 4),\n",
+ " ('italian', 2),\n",
+ " ('renaissance', 2),\n",
+ " ('built', 14),\n",
+ " ('1924', 1),\n",
+ " ('legendary', 2),\n",
+ " ('architecture', 1),\n",
+ " ('acclaimed', 3),\n",
+ " ('impeccable', 1),\n",
+ " ('corridors', 2),\n",
+ " ('full', 27),\n",
+ " ('shines', 1),\n",
+ " ('named', 4),\n",
+ " ('news', 3),\n",
+ " ('report', 3),\n",
+ " ('2018', 3),\n",
+ " ('hub', 6),\n",
+ " ('retreat', 6),\n",
+ " ('cozily', 2),\n",
+ " ('activity', 2),\n",
+ " ('streets', 2),\n",
+ " ('lined', 2),\n",
+ " ('diversified', 1),\n",
+ " ('sophisticated', 4),\n",
+ " ('chic', 5),\n",
+ " ('excursion', 2),\n",
+ " ('afternoon', 4),\n",
+ " ('gasworks', 1),\n",
+ " ('quiet', 11),\n",
+ " ('beauty', 5),\n",
+ " ('have', 35),\n",
+ " ('instant', 2),\n",
+ " ('both', 15),\n",
+ " ('worlds', 1),\n",
+ " ('travel', 25),\n",
+ " ('pleasure', 8),\n",
+ " ('trips', 1),\n",
+ " ('few', 16),\n",
+ " ('corporate', 10),\n",
+ " ('vacationers', 2),\n",
+ " ('museums', 6),\n",
+ " ('less', 11),\n",
+ " ('landmark', 9),\n",
+ " ('distinctly', 1),\n",
+ " ('charming', 1),\n",
+ " ('unmistakable', 1),\n",
+ " ('sprawling', 2),\n",
+ " ('system', 5),\n",
+ " ('shows', 3),\n",
+ " ('pristine', 2),\n",
+ " ('outdoors', 3),\n",
+ " ('performing', 1),\n",
+ " ('cultural', 7),\n",
+ " ('thriving', 6),\n",
+ " ('visitor', 1),\n",
+ " ('metro', 4),\n",
+ " ('attracts', 1),\n",
+ " ('deal', 1),\n",
+ " ('professional', 5),\n",
+ " ('travelers', 23),\n",
+ " ('booming', 1),\n",
+ " ('fortune', 2),\n",
+ " ('500', 4),\n",
+ " ('costco', 1),\n",
+ " ('wholesale', 1),\n",
+ " ('microsoft', 10),\n",
+ " ('facebook', 5),\n",
+ " ('furthermore', 1),\n",
+ " ('fans', 3),\n",
+ " ('athletic', 3),\n",
+ " ('three', 11),\n",
+ " ('teams', 2),\n",
+ " ('nestled', 4),\n",
+ " ('embassy', 2),\n",
+ " ('sleeping', 5),\n",
+ " ('queen', 28),\n",
+ " ('size', 6),\n",
+ " ('sofa', 7),\n",
+ " ('bed', 41),\n",
+ " ('50', 2),\n",
+ " ('inch', 11),\n",
+ " ('hdtv', 7),\n",
+ " ('kitchenettes', 6),\n",
+ " ('dine', 6),\n",
+ " ('institution', 1),\n",
+ " ('13', 1),\n",
+ " ('coins', 1),\n",
+ " ('hand', 6),\n",
+ " ('zephyr', 1),\n",
+ " ('stop', 12),\n",
+ " ('health', 4),\n",
+ " ('includes', 8),\n",
+ " ('heated', 12),\n",
+ " ('hot', 25),\n",
+ " ('sun', 2),\n",
+ " ('deck', 5),\n",
+ " ('begin', 3),\n",
+ " ('free', 123),\n",
+ " ('made', 5),\n",
+ " ('order', 3),\n",
+ " ('evening', 8),\n",
+ " ('55', 4),\n",
+ " ('cheer', 2),\n",
+ " ('team', 6),\n",
+ " ('football', 3),\n",
+ " ('fc', 2),\n",
+ " ('baseball', 2),\n",
+ " ('game', 7),\n",
+ " ('mobile', 4),\n",
+ " ('trip', 11),\n",
+ " ('around', 14),\n",
+ " ('boat', 2),\n",
+ " ('wamu', 1),\n",
+ " ('pan', 1),\n",
+ " ('trust', 2),\n",
+ " ('settle', 4),\n",
+ " ('into', 13),\n",
+ " ('epicentre', 1),\n",
+ " ('wondering', 1),\n",
+ " ('vintage', 2),\n",
+ " ('spot', 2),\n",
+ " ('corner', 6),\n",
+ " ('spring', 3),\n",
+ " ('financial', 2),\n",
+ " ('spots', 3),\n",
+ " ('benaroya', 2),\n",
+ " ('hall', 5),\n",
+ " ('trade', 2),\n",
+ " ('transportation', 7),\n",
+ " ('easily', 7),\n",
+ " ('walkable', 3),\n",
+ " ('ferry', 3),\n",
+ " ('dozens', 2),\n",
+ " ('buses', 2),\n",
+ " ('driving', 3),\n",
+ " ('those', 4),\n",
+ " ('coming', 2),\n",
+ " ('tac', 15),\n",
+ " ...]"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ "words_freq"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[('the', 1258),\n",
+ " ('and', 1062),\n",
+ " ('of', 536),\n",
+ " ('seattle', 533),\n",
+ " ('to', 471),\n",
+ " ('in', 449),\n",
+ " ('our', 359),\n",
+ " ('you', 304),\n",
+ " ('hotel', 295),\n",
+ " ('with', 280),\n",
+ " ('is', 271),\n",
+ " ('at', 231),\n",
+ " ('from', 224),\n",
+ " ('for', 216),\n",
+ " ('your', 186),\n",
+ " ('or', 161),\n",
+ " ('center', 151),\n",
+ " ('are', 136),\n",
+ " ('downtown', 133),\n",
+ " ('on', 129),\n",
+ " ('we', 128),\n",
+ " ('free', 123),\n",
+ " ('as', 117),\n",
+ " ('located', 108),\n",
+ " ('rooms', 106),\n",
+ " ('stay', 105),\n",
+ " ('place', 102),\n",
+ " ('all', 100),\n",
+ " ('airport', 99),\n",
+ " ('space', 97),\n",
+ " ('market', 97),\n",
+ " ('enjoy', 93),\n",
+ " ('an', 91),\n",
+ " ('pike', 90),\n",
+ " ('inn', 89),\n",
+ " ('business', 87),\n",
+ " ('just', 82),\n",
+ " ('city', 79),\n",
+ " ('room', 77),\n",
+ " ('one', 75),\n",
+ " ('by', 71),\n",
+ " ('breakfast', 68),\n",
+ " ('needle', 68),\n",
+ " ('suites', 67),\n",
+ " ('washington', 67),\n",
+ " ('that', 65),\n",
+ " ('re', 64),\n",
+ " ('this', 63),\n",
+ " ('complimentary', 62),\n",
+ " ('also', 62),\n",
+ " ('amenities', 60),\n",
+ " ('offer', 59),\n",
+ " ('attractions', 59),\n",
+ " ('away', 59),\n",
+ " ('access', 59),\n",
+ " ('home', 57),\n",
+ " ('guest', 57),\n",
+ " ('can', 55),\n",
+ " ('it', 55),\n",
+ " ('guests', 54),\n",
+ " ('service', 53),\n",
+ " ('experience', 52),\n",
+ " ('area', 51),\n",
+ " ('best', 50),\n",
+ " ('ll', 48),\n",
+ " ('near', 48),\n",
+ " ('including', 47),\n",
+ " ('like', 46),\n",
+ " ('will', 46),\n",
+ " ('local', 45),\n",
+ " ('art', 44),\n",
+ " ('up', 44),\n",
+ " ('easy', 44),\n",
+ " ('make', 43),\n",
+ " ('museum', 43),\n",
+ " ('offers', 43),\n",
+ " ('be', 43),\n",
+ " ('minutes', 43),\n",
+ " ('university', 43),\n",
+ " ('pacific', 42),\n",
+ " ('northwest', 42),\n",
+ " ('24', 42),\n",
+ " ('fitness', 42),\n",
+ " ('lake', 41),\n",
+ " ('has', 41),\n",
+ " ('bed', 41),\n",
+ " ('most', 40),\n",
+ " ('views', 39),\n",
+ " ('great', 39),\n",
+ " ('day', 39),\n",
+ " ('waterfront', 38),\n",
+ " ('coffee', 38),\n",
+ " ('when', 38),\n",
+ " ('pool', 37),\n",
+ " ('wi', 37),\n",
+ " ('fi', 37),\n",
+ " ('dining', 36),\n",
+ " ('available', 36),\n",
+ " ('within', 36),\n",
+ " ('neighborhood', 35),\n",
+ " ('restaurants', 35),\n",
+ " ('heart', 35),\n",
+ " ('park', 35),\n",
+ " ('have', 35),\n",
+ " ('bar', 34),\n",
+ " ('only', 34),\n",
+ " ('modern', 34),\n",
+ " ('while', 34),\n",
+ " ('field', 34),\n",
+ " ('close', 34),\n",
+ " ('union', 33),\n",
+ " ('location', 33),\n",
+ " ('well', 33),\n",
+ " ('international', 32),\n",
+ " ('hour', 32),\n",
+ " ('restaurant', 32),\n",
+ " ('high', 32),\n",
+ " ('parking', 32),\n",
+ " ('take', 31),\n",
+ " ('find', 31),\n",
+ " ('shopping', 31),\n",
+ " ('many', 31),\n",
+ " ('shuttle', 31),\n",
+ " ('state', 30),\n",
+ " ('comfortable', 30),\n",
+ " ('perfect', 30),\n",
+ " ('event', 29),\n",
+ " ('private', 29),\n",
+ " ('more', 29),\n",
+ " ('friendly', 29),\n",
+ " ('spacious', 29),\n",
+ " ('every', 29),\n",
+ " ('hotels', 28),\n",
+ " ('square', 28),\n",
+ " ('walk', 28),\n",
+ " ('each', 28),\n",
+ " ('queen', 28),\n",
+ " ('meeting', 27),\n",
+ " ('features', 27),\n",
+ " ('style', 27),\n",
+ " ('where', 27),\n",
+ " ('district', 27),\n",
+ " ('full', 27),\n",
+ " ('light', 26),\n",
+ " ('featuring', 26),\n",
+ " ('lobby', 26),\n",
+ " ('internet', 26),\n",
+ " ('street', 26),\n",
+ " ('two', 26),\n",
+ " ('such', 26),\n",
+ " ('speed', 26),\n",
+ " ('tacoma', 26),\n",
+ " ('need', 25),\n",
+ " ('relax', 25),\n",
+ " ('work', 25),\n",
+ " ('suite', 25),\n",
+ " ('travel', 25),\n",
+ " ('hot', 25),\n",
+ " ('wifi', 25),\n",
+ " ('family', 25),\n",
+ " ('km', 25),\n",
+ " ('world', 24),\n",
+ " ('comfort', 24),\n",
+ " ('historic', 24),\n",
+ " ('convention', 24),\n",
+ " ('time', 24),\n",
+ " ('miles', 24),\n",
+ " ('out', 23),\n",
+ " ('outdoor', 23),\n",
+ " ('distance', 23),\n",
+ " ('sea', 23),\n",
+ " ('whether', 23),\n",
+ " ('travelers', 23),\n",
+ " ('some', 22),\n",
+ " ('living', 22),\n",
+ " ('flat', 22),\n",
+ " ('us', 21),\n",
+ " ('sound', 21),\n",
+ " ('offering', 21),\n",
+ " ('designed', 21),\n",
+ " ('walking', 21),\n",
+ " ('visit', 21),\n",
+ " ('blocks', 21),\n",
+ " ('its', 21),\n",
+ " ('other', 21),\n",
+ " ('hill', 21),\n",
+ " ('lounge', 20),\n",
+ " ('not', 20),\n",
+ " ('safeco', 20),\n",
+ " ('explore', 20),\n",
+ " ('steps', 20),\n",
+ " ('welcome', 20),\n",
+ " ('next', 20),\n",
+ " ('feature', 20),\n",
+ " ('site', 20),\n",
+ " ('sure', 20),\n",
+ " ('amazon', 19),\n",
+ " ('here', 19),\n",
+ " ('than', 19),\n",
+ " ('if', 19),\n",
+ " ('rail', 19),\n",
+ " ('off', 19),\n",
+ " ('building', 19),\n",
+ " ('first', 19),\n",
+ " ('south', 19),\n",
+ " ('include', 19),\n",
+ " ('property', 19),\n",
+ " ('wa', 19),\n",
+ " ('motel', 19),\n",
+ " ('leisure', 18),\n",
+ " ('everything', 18),\n",
+ " ('luxury', 18),\n",
+ " ('fresh', 18),\n",
+ " ('morning', 18),\n",
+ " ('convenience', 18),\n",
+ " ('through', 18),\n",
+ " ('minute', 18),\n",
+ " ('new', 17),\n",
+ " ('club', 17),\n",
+ " ('stylish', 17),\n",
+ " ('kitchen', 17),\n",
+ " ('beds', 17),\n",
+ " ('come', 17),\n",
+ " ('emerald', 17),\n",
+ " ('centurylink', 17),\n",
+ " ('top', 17),\n",
+ " ('short', 17),\n",
+ " ('equipped', 17),\n",
+ " ('tv', 17),\n",
+ " ('food', 17),\n",
+ " ('screen', 17),\n",
+ " ('nearby', 16),\n",
+ " ('puget', 16),\n",
+ " ('meetings', 16),\n",
+ " ('feet', 16),\n",
+ " ('across', 16),\n",
+ " ('music', 16),\n",
+ " ('unique', 16),\n",
+ " ('today', 16),\n",
+ " ('indoor', 16),\n",
+ " ('few', 16),\n",
+ " ('convenient', 16),\n",
+ " ('capitol', 16),\n",
+ " ('fun', 16),\n",
+ " ('traveling', 16),\n",
+ " ('flight', 16),\n",
+ " ('iconic', 15),\n",
+ " ('west', 15),\n",
+ " ('avenue', 15),\n",
+ " ('see', 15),\n",
+ " ('pioneer', 15),\n",
+ " ('accommodations', 15),\n",
+ " ('right', 15),\n",
+ " ('urban', 15),\n",
+ " ('there', 15),\n",
+ " ('hyatt', 15),\n",
+ " ('plus', 15),\n",
+ " ('guestrooms', 15),\n",
+ " ('famous', 15),\n",
+ " ('extended', 15),\n",
+ " ('no', 15),\n",
+ " ('both', 15),\n",
+ " ('tac', 15),\n",
+ " ('anne', 15),\n",
+ " ('over', 14),\n",
+ " ('vibrant', 14),\n",
+ " ('north', 14),\n",
+ " ('options', 14),\n",
+ " ('mile', 14),\n",
+ " ('which', 14),\n",
+ " ('situated', 14),\n",
+ " ('comforts', 14),\n",
+ " ('signature', 14),\n",
+ " ('shops', 14),\n",
+ " ('bay', 14),\n",
+ " ('built', 14),\n",
+ " ('around', 14),\n",
+ " ('start', 14),\n",
+ " ('conveniently', 14),\n",
+ " ('events', 14),\n",
+ " ('continental', 14),\n",
+ " ('grand', 13),\n",
+ " ('class', 13),\n",
+ " ('award', 13),\n",
+ " ('spa', 13),\n",
+ " ('sports', 13),\n",
+ " ('town', 13),\n",
+ " ('belltown', 13),\n",
+ " ('along', 13),\n",
+ " ('olympic', 13),\n",
+ " ('plush', 13),\n",
+ " ('special', 13),\n",
+ " ('outside', 13),\n",
+ " ('night', 13),\n",
+ " ('into', 13),\n",
+ " ('after', 13),\n",
+ " ('facilities', 13),\n",
+ " ('long', 13),\n",
+ " ('pet', 13),\n",
+ " ('medical', 13),\n",
+ " ('hilton', 12),\n",
+ " ('major', 12),\n",
+ " ('variety', 12),\n",
+ " ('starbucks', 12),\n",
+ " ('relaxing', 12),\n",
+ " ('advantage', 12),\n",
+ " ('central', 12),\n",
+ " ('contemporary', 12),\n",
+ " ('areas', 12),\n",
+ " ('five', 12),\n",
+ " ('any', 12),\n",
+ " ('inspired', 12),\n",
+ " ('staying', 12),\n",
+ " ('meet', 12),\n",
+ " ('but', 12),\n",
+ " ('drive', 12),\n",
+ " ('provide', 12),\n",
+ " ('popular', 12),\n",
+ " ('microwave', 12),\n",
+ " ('so', 12),\n",
+ " ('stop', 12),\n",
+ " ('heated', 12),\n",
+ " ('bathroom', 12),\n",
+ " ('house', 12),\n",
+ " ('laundry', 12),\n",
+ " ('beautiful', 12),\n",
+ " ('seatac', 12),\n",
+ " ('garden', 11),\n",
+ " ('000', 11),\n",
+ " ('unwind', 11),\n",
+ " ('cuisine', 11),\n",
+ " ('step', 11),\n",
+ " ('front', 11),\n",
+ " ('entertainment', 11),\n",
+ " ('book', 11),\n",
+ " ('they', 11),\n",
+ " ('spaces', 11),\n",
+ " ('winning', 11),\n",
+ " ('help', 11),\n",
+ " ('get', 11),\n",
+ " ('what', 11),\n",
+ " ('view', 11),\n",
+ " ('floor', 11),\n",
+ " ('original', 11),\n",
+ " ('about', 11),\n",
+ " ('cruise', 11),\n",
+ " ('social', 11),\n",
+ " ('studio', 11),\n",
+ " ('feel', 11),\n",
+ " ('want', 11),\n",
+ " ('upscale', 11),\n",
+ " ('air', 11),\n",
+ " ('friends', 11),\n",
+ " ('large', 11),\n",
+ " ('10', 11),\n",
+ " ('mall', 11),\n",
+ " ('quiet', 11),\n",
+ " ('less', 11),\n",
+ " ('three', 11),\n",
+ " ('inch', 11),\n",
+ " ('trip', 11),\n",
+ " ('link', 11),\n",
+ " ('check', 11),\n",
+ " ('hospitality', 11),\n",
+ " ('back', 11),\n",
+ " ('cable', 11),\n",
+ " ('was', 11),\n",
+ " ('boeing', 11),\n",
+ " ('visiting', 11),\n",
+ " ('desk', 11),\n",
+ " ('glass', 10),\n",
+ " ('daily', 10),\n",
+ " ('seahawks', 10),\n",
+ " ('renovated', 10),\n",
+ " ('wine', 10),\n",
+ " ('cozy', 10),\n",
+ " ('charm', 10),\n",
+ " ('inviting', 10),\n",
+ " ('block', 10),\n",
+ " ('elliott', 10),\n",
+ " ('then', 10),\n",
+ " ('live', 10),\n",
+ " ('seasonal', 10),\n",
+ " ('bath', 10),\n",
+ " ('aquarium', 10),\n",
+ " ('water', 10),\n",
+ " ('swimming', 10),\n",
+ " ('corporate', 10),\n",
+ " ('microsoft', 10),\n",
+ " ('fully', 10),\n",
+ " ('history', 10),\n",
+ " ('ideal', 10),\n",
+ " ('furnishings', 10),\n",
+ " ('tvs', 10),\n",
+ " ('hours', 10),\n",
+ " ('campus', 10),\n",
+ " ('door', 10),\n",
+ " ('hospital', 10),\n",
+ " ('mason', 10),\n",
+ " ('staff', 9),\n",
+ " ('snacks', 9),\n",
+ " ('provides', 9),\n",
+ " ('wireless', 9),\n",
+ " ('mariners', 9),\n",
+ " ('range', 9),\n",
+ " ('host', 9),\n",
+ " ('love', 9),\n",
+ " ('appointed', 9),\n",
+ " ('spend', 9),\n",
+ " ('during', 9),\n",
+ " ('four', 9),\n",
+ " ('mountains', 9),\n",
+ " ('ride', 9),\n",
+ " ('bedding', 9),\n",
+ " ('fridge', 9),\n",
+ " ('landmark', 9),\n",
+ " ('20', 9),\n",
+ " ('use', 9),\n",
+ " ('vacation', 9),\n",
+ " ('choice', 9),\n",
+ " ('community', 9),\n",
+ " ('businesses', 9),\n",
+ " ('value', 9),\n",
+ " ('bedroom', 9),\n",
+ " ('windows', 9),\n",
+ " ('days', 9),\n",
+ " ('courtyard', 9),\n",
+ " ('catch', 9),\n",
+ " ('extra', 9),\n",
+ " ('red', 9),\n",
+ " ('road', 9),\n",
+ " ('virginia', 9),\n",
+ " ('mansion', 9),\n",
+ " ('eclectic', 8),\n",
+ " ('proximity', 8),\n",
+ " ('activities', 8),\n",
+ " ('natural', 8),\n",
+ " ('sheraton', 8),\n",
+ " ('doors', 8),\n",
+ " ('small', 8),\n",
+ " ('skyline', 8),\n",
+ " ('planning', 8),\n",
+ " ('catering', 8),\n",
+ " ('mind', 8),\n",
+ " ('boutique', 8),\n",
+ " ('warm', 8),\n",
+ " ('rooftop', 8),\n",
+ " ('rich', 8),\n",
+ " ('known', 8),\n",
+ " ('classic', 8),\n",
+ " ('know', 8),\n",
+ " ('delicious', 8),\n",
+ " ('venues', 8),\n",
+ " ('tub', 8),\n",
+ " ('yourself', 8),\n",
+ " ('onsite', 8),\n",
+ " ('pleasure', 8),\n",
+ " ('includes', 8),\n",
+ " ('evening', 8),\n",
+ " ('directly', 8),\n",
+ " ('alike', 8),\n",
+ " ('cancer', 8),\n",
+ " ('traveler', 8),\n",
+ " ('stadium', 8),\n",
+ " ('even', 8),\n",
+ " ('escape', 8),\n",
+ " ('king', 8),\n",
+ " ('refrigerator', 8),\n",
+ " ('premium', 8),\n",
+ " ('budget', 8),\n",
+ " ('facility', 8),\n",
+ " ('renton', 8),\n",
+ " ('alfred', 8),\n",
+ " ('silver', 8),\n",
+ " ('cloud', 8),\n",
+ " ('bars', 7),\n",
+ " ('reception', 7),\n",
+ " ('dinner', 7),\n",
+ " ('exciting', 7),\n",
+ " ('recently', 7),\n",
+ " ('favorite', 7),\n",
+ " ('inside', 7),\n",
+ " ('stunning', 7),\n",
+ " ('throughout', 7),\n",
+ " ('workout', 7),\n",
+ " ('library', 7),\n",
+ " ('gorgeous', 7),\n",
+ " ('wood', 7),\n",
+ " ('seating', 7),\n",
+ " ('among', 7),\n",
+ " ('car', 7),\n",
+ " ('arts', 7),\n",
+ " ('atmosphere', 7),\n",
+ " ('fit', 7),\n",
+ " ('always', 7),\n",
+ " ('ve', 7),\n",
+ " ('locally', 7),\n",
+ " ('floors', 7),\n",
+ " ('mini', 7),\n",
+ " ('stays', 7),\n",
+ " ('maker', 7),\n",
+ " ('old', 7),\n",
+ " ('cultural', 7),\n",
+ " ('sofa', 7),\n",
+ " ('hdtv', 7),\n",
+ " ('game', 7),\n",
+ " ('transportation', 7),\n",
+ " ('easily', 7),\n",
+ " ('open', 7),\n",
+ " ('executive', 7),\n",
+ " ('apartment', 7),\n",
+ " ('kitchens', 7),\n",
+ " ('been', 7),\n",
+ " ('non', 7),\n",
+ " ('bathrooms', 7),\n",
+ " ('table', 7),\n",
+ " ('hostel', 7),\n",
+ " ('roof', 7),\n",
+ " ('ballard', 7),\n",
+ " ('proud', 7),\n",
+ " ('station', 7),\n",
+ " ('services', 7),\n",
+ " ('european', 7),\n",
+ " ('southcenter', 7),\n",
+ " ('people', 7),\n",
+ " ('100', 7),\n",
+ " ('needs', 7),\n",
+ " ('looking', 7),\n",
+ " ('don', 7),\n",
+ " ('15', 7),\n",
+ " ('shoreline', 7),\n",
+ " ('much', 7),\n",
+ " ('broadway', 7),\n",
+ " ('perfectly', 6),\n",
+ " ('companies', 6),\n",
+ " ('google', 6),\n",
+ " ('conference', 6),\n",
+ " ('cocktail', 6),\n",
+ " ('drinks', 6),\n",
+ " ('seven', 6),\n",
+ " ('collection', 6),\n",
+ " ('sleep', 6),\n",
+ " ('savor', 6),\n",
+ " ('happy', 6),\n",
+ " ('public', 6),\n",
+ " ('nordstrom', 6),\n",
+ " ('terminal', 6),\n",
+ " ('spectacular', 6),\n",
+ " ('custom', 6),\n",
+ " ('simple', 6),\n",
+ " ('fireplace', 6),\n",
+ " ('wide', 6),\n",
+ " ('culture', 6),\n",
+ " ('patio', 6),\n",
+ " ('join', 6),\n",
+ " ('setting', 6),\n",
+ " ('wake', 6),\n",
+ " ('cup', 6),\n",
+ " ('hiking', 6),\n",
+ " ('scene', 6),\n",
+ " ('getaway', 6),\n",
+ " ('theater', 6),\n",
+ " ('landmarks', 6),\n",
+ " ('down', 6),\n",
+ " ('lodging', 6),\n",
+ " ('separate', 6),\n",
+ " ('plan', 6),\n",
+ " ('hub', 6),\n",
+ " ('retreat', 6),\n",
+ " ('museums', 6),\n",
+ " ('thriving', 6),\n",
+ " ('size', 6),\n",
+ " ('kitchenettes', 6),\n",
+ " ('dine', 6),\n",
+ " ('hand', 6),\n",
+ " ('team', 6),\n",
+ " ('corner', 6),\n",
+ " ('do', 6),\n",
+ " ('ready', 6),\n",
+ " ('centers', 6),\n",
+ " ('own', 6),\n",
+ " ('experiences', 6),\n",
+ " ('complete', 6),\n",
+ " ('brings', 6),\n",
+ " ('furnished', 6),\n",
+ " ('ideally', 6),\n",
+ " ('kind', 6),\n",
+ " ('creative', 6),\n",
+ " ('discover', 6),\n",
+ " ('television', 6),\n",
+ " ('bus', 6),\n",
+ " ('homewood', 6),\n",
+ " ('nightlife', 6),\n",
+ " ('regency', 6),\n",
+ " ('level', 6),\n",
+ " ('go', 6),\n",
+ " ('flexible', 6),\n",
+ " ('grill', 6),\n",
+ " ('key', 6),\n",
+ " ('mt', 6),\n",
+ " ('rainier', 6),\n",
+ " ('buffet', 6),\n",
+ " ('zoo', 6),\n",
+ " ('matter', 6),\n",
+ " ('headquarters', 6),\n",
+ " ('play', 6),\n",
+ " ('shower', 6),\n",
+ " ('westfield', 6),\n",
+ " ('quick', 6),\n",
+ " ('smoking', 6),\n",
+ " ('channels', 6),\n",
+ " ('rate', 6),\n",
+ " ('tea', 6),\n",
+ " ('enjoyable', 6),\n",
+ " ('research', 6),\n",
+ " ('busy', 6),\n",
+ " ('fred', 6),\n",
+ " ('hutchinson', 6),\n",
+ " ('very', 6),\n",
+ " ('apartments', 6),\n",
+ " ('resort', 6),\n",
+ " ('gates', 5),\n",
+ " ('locals', 5),\n",
+ " ('american', 5),\n",
+ " ('core', 5),\n",
+ " ('diverse', 5),\n",
+ " ('america', 5),\n",
+ " ('expanded', 5),\n",
+ " ('enjoying', 5),\n",
+ " ('beverage', 5),\n",
+ " ('several', 5),\n",
+ " ('crowne', 5),\n",
+ " ('exceptional', 5),\n",
+ " ('cool', 5),\n",
+ " ('drink', 5),\n",
+ " ('better', 5),\n",
+ " ('might', 5),\n",
+ " ('half', 5),\n",
+ " ('must', 5),\n",
+ " ('houses', 5),\n",
+ " ('boutiques', 5),\n",
+ " ('pier', 5),\n",
+ " ('stadiums', 5),\n",
+ " ('mountain', 5),\n",
+ " ('newly', 5),\n",
+ " ('sleek', 5),\n",
+ " ('welcoming', 5),\n",
+ " ('elegant', 5),\n",
+ " ('residence', 5),\n",
+ " ('puts', 5),\n",
+ " ('tour', 5),\n",
+ " ('via', 5),\n",
+ " ('destination', 5),\n",
+ " ('culinary', 5),\n",
+ " ('cocktails', 5),\n",
+ " ('between', 5),\n",
+ " ('monorail', 5),\n",
+ " ('warwick', 5),\n",
+ " ('way', 5),\n",
+ " ('possible', 5),\n",
+ " ('head', 5),\n",
+ " ('craft', 5),\n",
+ " ('put', 5),\n",
+ " ('end', 5),\n",
+ " ('kimpton', 5),\n",
+ " ('enough', 5),\n",
+ " ('who', 5),\n",
+ " ('dedicated', 5),\n",
+ " ('beer', 5),\n",
+ " ('design', 5),\n",
+ " ('could', 5),\n",
+ " ('marriott', 5),\n",
+ " ('westlake', 5),\n",
+ " ('refrigerators', 5),\n",
+ " ('desks', 5),\n",
+ " ('found', 5),\n",
+ " ('ship', 5),\n",
+ " ('terminals', 5),\n",
+ " ('innovative', 5),\n",
+ " ('makes', 5),\n",
+ " ('chic', 5),\n",
+ " ('beauty', 5),\n",
+ " ('system', 5),\n",
+ " ('professional', 5),\n",
+ " ('facebook', 5),\n",
+ " ('sleeping', 5),\n",
+ " ('deck', 5),\n",
+ " ('made', 5),\n",
+ " ('hall', 5),\n",
+ " ('30', 5),\n",
+ " ('connected', 5),\n",
+ " ('customer', 5),\n",
+ " ('panel', 5),\n",
+ " ('televisions', 5),\n",
+ " ('wines', 5),\n",
+ " ('course', 5),\n",
+ " ('too', 5),\n",
+ " ('list', 5),\n",
+ " ('ensure', 5),\n",
+ " ('holiday', 5),\n",
+ " ('either', 5),\n",
+ " ('districts', 5),\n",
+ " ('fremont', 5),\n",
+ " ('choose', 5),\n",
+ " ('soft', 5),\n",
+ " ('lifestyle', 5),\n",
+ " ('adventure', 5),\n",
+ " ('adventures', 5),\n",
+ " ('simply', 5),\n",
+ " ('these', 5),\n",
+ " ('beverages', 5),\n",
+ " ('forget', 5),\n",
+ " ('store', 5),\n",
+ " ('intimate', 5),\n",
+ " ('receptions', 5),\n",
+ " ('details', 5),\n",
+ " ('century', 5),\n",
+ " ('cities', 5),\n",
+ " ('oasis', 5),\n",
+ " ('communal', 5),\n",
+ " ('express', 5),\n",
+ " ('residential', 5),\n",
+ " ('runs', 5),\n",
+ " ('woodland', 5),\n",
+ " ('personal', 5),\n",
+ " ('yet', 5),\n",
+ " ('bellevue', 5),\n",
+ " ('interior', 5),\n",
+ " ('healthy', 5),\n",
+ " ('above', 5),\n",
+ " ('stress', 5),\n",
+ " ('term', 5),\n",
+ " ('accessible', 5),\n",
+ " ('run', 5),\n",
+ " ('lodge', 5),\n",
+ " ('filled', 5),\n",
+ " ('boardroom', 5),\n",
+ " ('green', 5),\n",
+ " ('self', 5),\n",
+ " ('units', 5),\n",
+ " ('bustling', 5),\n",
+ " ('college', 5),\n",
+ " ('northgate', 5),\n",
+ " ('round', 5),\n",
+ " ('hit', 5),\n",
+ " ('won', 5),\n",
+ " ('ten', 5),\n",
+ " ('pride', 5),\n",
+ " ('landing', 5),\n",
+ " ('linens', 5),\n",
+ " ('ourselves', 5),\n",
+ " ('bill', 4),\n",
+ " ('melinda', 4),\n",
+ " ('foundation', 4),\n",
+ " ('visitors', 4),\n",
+ " ('majestic', 4),\n",
+ " ('sq', 4),\n",
+ " ('ft', 4),\n",
+ " ('technology', 4),\n",
+ " ('wedding', 4),\n",
+ " ('success', 4),\n",
+ " ('refresh', 4),\n",
+ " ('latest', 4),\n",
+ " ('tastefully', 4),\n",
+ " ('decorated', 4),\n",
+ " ('productive', 4),\n",
+ " ('lunch', 4),\n",
+ " ('gateway', 4),\n",
+ " ('plaza', 4),\n",
+ " ('touches', 4),\n",
+ " ('set', 4),\n",
+ " ('lights', 4),\n",
+ " ('addition', 4),\n",
+ " ('rest', 4),\n",
+ " ('5th', 4),\n",
+ " ('sounders', 4),\n",
+ " ('curated', 4),\n",
+ " ('paramount', 4),\n",
+ " ('luxurious', 4),\n",
+ " ('comfy', 4),\n",
+ " ('good', 4),\n",
+ " ('why', 4),\n",
+ " ('sightseeing', 4),\n",
+ " ('leading', 4),\n",
+ " ('refreshing', 4),\n",
+ " ('covered', 4),\n",
+ " ('surrounded', 4),\n",
+ " ('trendy', 4),\n",
+ " ('side', 4),\n",
+ " ('let', 4),\n",
+ " ('show', 4),\n",
+ " ('memorable', 4),\n",
+ " ('fare', 4),\n",
+ " ('got', 4),\n",
+ " ('diamond', 4),\n",
+ " ('scenic', 4),\n",
+ " ('never', 4),\n",
+ " ('tell', 4),\n",
+ " ('without', 4),\n",
+ " ('retail', 4),\n",
+ " ('adjacent', 4),\n",
+ " ('harbor', 4),\n",
+ " ('port', 4),\n",
+ " ('sculpture', 4),\n",
+ " ('look', 4),\n",
+ " ('breathtaking', 4),\n",
+ " ('sites', 4),\n",
+ " ('goal', 4),\n",
+ " ('additional', 4),\n",
+ " ('elegance', 4),\n",
+ " ('named', 4),\n",
+ " ('sophisticated', 4),\n",
+ " ('afternoon', 4),\n",
+ " ('metro', 4),\n",
+ " ('500', 4),\n",
+ " ('nestled', 4),\n",
+ " ('health', 4),\n",
+ " ('55', 4),\n",
+ " ('mobile', 4),\n",
+ " ('settle', 4),\n",
+ " ('those', 4),\n",
+ " ('getting', 4),\n",
+ " ('things', 4),\n",
+ " ('science', 4),\n",
+ " ('42', 4),\n",
+ " ('keep', 4),\n",
+ " ('watch', 4),\n",
+ " ('coast', 4),\n",
+ " ('bright', 4),\n",
+ " ('plug', 4),\n",
+ " ('selection', 4),\n",
+ " ('stand', 4),\n",
+ " ('unparalleled', 4),\n",
+ " ('expansive', 4),\n",
+ " ('energy', 4),\n",
+ " ('ours', 4),\n",
+ " ('4th', 4),\n",
+ " ('would', 4),\n",
+ " ('board', 4),\n",
+ " ('active', 4),\n",
+ " ('kick', 4),\n",
+ " ('fuel', 4),\n",
+ " ('western', 4),\n",
+ " ('romantic', 4),\n",
+ " ('clean', 4),\n",
+ " ('same', 4),\n",
+ " ('upgrade', 4),\n",
+ " ('monday', 4),\n",
+ " ('something', 4),\n",
+ " ('casual', 4),\n",
+ " ('marble', 4),\n",
+ " ('12', 4),\n",
+ " ('movie', 4),\n",
+ " ('arena', 4),\n",
+ " ('served', 4),\n",
+ " ('accommodate', 4),\n",
+ " ('week', 4),\n",
+ " ('owned', 4),\n",
+ " ('mediterranean', 4),\n",
+ " ('delivers', 4),\n",
+ " ('hip', 4),\n",
+ " ('leave', 4),\n",
+ " ('products', 4),\n",
+ " ('twin', 4),\n",
+ " ('early', 4),\n",
+ " ('train', 4),\n",
+ " ('exercise', 4),\n",
+ " ('freshly', 4),\n",
+ " ('menu', 4),\n",
+ " ('function', 4),\n",
+ " ('deluxe', 4),\n",
+ " ('redmond', 4),\n",
+ " ('snack', 4),\n",
+ " ('seafood', 4),\n",
+ " ('dip', 4),\n",
+ " ('overnight', 4),\n",
+ " ('greet', 4),\n",
+ " ('relaxed', 4),\n",
+ " ('exploring', 4),\n",
+ " ('shop', 4),\n",
+ " ('affordable', 4),\n",
+ " ('give', 4),\n",
+ " ('rewards', 4),\n",
+ " ('program', 4),\n",
+ " ('making', 4),\n",
+ " ('pillow', 4),\n",
+ " ('authentic', 4),\n",
+ " ('inspiring', 4),\n",
+ " ('how', 4),\n",
+ " ('smoke', 4),\n",
+ " ('405', 4),\n",
+ " ('hampton', 4),\n",
+ " ('kids', 4),\n",
+ " ('everyone', 4),\n",
+ " ('meals', 4),\n",
+ " ('99', 4),\n",
+ " ('11', 4),\n",
+ " ('accommodation', 4),\n",
+ " ('training', 4),\n",
+ " ('beach', 4),\n",
+ " ('points', 4),\n",
+ " ('tours', 4),\n",
+ " ('econo', 4),\n",
+ " ('microwaves', 4),\n",
+ " ('outlets', 4),\n",
+ " ('little', 4),\n",
+ " ('conditioned', 4),\n",
+ " ('americas', 4),\n",
+ " ('money', 4),\n",
+ " ('historical', 4),\n",
+ " ('year', 4),\n",
+ " ('attention', 4),\n",
+ " ('jimmy', 4),\n",
+ " ('national', 4),\n",
+ " ('theodore', 4),\n",
+ " ('warmth', 4),\n",
+ " ('maxwell', 4),\n",
+ " ('creature', 4),\n",
+ " ('definition', 4),\n",
+ " ('bacon', 4),\n",
+ " ('georgetown', 4),\n",
+ " ('grove', 4),\n",
+ " ('watertown', 4),\n",
+ " ('gaslight', 4),\n",
+ " ('oak', 4),\n",
+ " ('southport', 4),\n",
+ " ('allows', 3),\n",
+ " ('kayaking', 3),\n",
+ " ('versatile', 3),\n",
+ " ('guarantee', 3),\n",
+ " ('equipment', 3),\n",
+ " ('chihuly', 3),\n",
+ " ('earn', 3),\n",
+ " ('chef', 3),\n",
+ " ('blend', 3),\n",
+ " ('soon', 3),\n",
+ " ('concierge', 3),\n",
+ " ('fantastic', 3),\n",
+ " ('amid', 3),\n",
+ " ('incredible', 3),\n",
+ " ('haven', 3),\n",
+ " ('prime', 3),\n",
+ " ('crafted', 3),\n",
+ " ('celebrated', 3),\n",
+ " ('unforgettable', 3),\n",
+ " ('reflect', 3),\n",
+ " ('hosts', 3),\n",
+ " ('company', 3),\n",
+ " ('underground', 3),\n",
+ " ('everywhere', 3),\n",
+ " ('landscape', 3),\n",
+ " ('region', 3),\n",
+ " ('bounty', 3),\n",
+ " ('truly', 3),\n",
+ " ('pleasant', 3),\n",
+ " ('breeze', 3),\n",
+ " ('deep', 3),\n",
+ " ('waters', 3),\n",
+ " ('straight', 3),\n",
+ " ('breweries', 3),\n",
+ " ('mount', 3),\n",
+ " ('soak', 3),\n",
+ " ('before', 3),\n",
+ " ('heading', 3),\n",
+ " ('destinations', 3),\n",
+ " ('ever', 3),\n",
+ " ('reading', 3),\n",
+ " ('couldn', 3),\n",
+ " ('alexis', 3),\n",
+ " ('matched', 3),\n",
+ " ('attentive', 3),\n",
+ " ('interesting', 3),\n",
+ " ('giving', 3),\n",
+ " ('edge', 3),\n",
+ " ('indulge', 3),\n",
+ " ('loft', 3),\n",
+ " ('ceilings', 3),\n",
+ " ('them', 3),\n",
+ " ('elements', 3),\n",
+ " ('still', 3),\n",
+ " ('sit', 3),\n",
+ " ('outfitted', 3),\n",
+ " ('balconies', 3),\n",
+ " ('perks', 3),\n",
+ " ('gym', 3),\n",
+ " ('sunset', 3),\n",
+ " ('rock', 3),\n",
+ " ('fireplaces', 3),\n",
+ " ('kitchenette', 3),\n",
+ " ('bistro', 3),\n",
+ " ('fairmont', 3),\n",
+ " ('acclaimed', 3),\n",
+ " ('news', 3),\n",
+ " ('report', 3),\n",
+ " ('2018', 3),\n",
+ " ('shows', 3),\n",
+ " ('outdoors', 3),\n",
+ " ('fans', 3),\n",
+ " ('athletic', 3),\n",
+ " ('begin', 3),\n",
+ " ...]"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True) # 排序重复的次数\n",
+ "words_freq"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "这里重复最多的the我们并不是重要的信息词,后面我们需要进行怎样的优化呢"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词\n",
+ " vec = CountVectorizer().fit(corpus) # 寄存器\n",
+ " bag_of_words = vec.transform(corpus) # 将文本转数值\n",
+ " sum_words = bag_of_words.sum(axis=0) # 计算每个词重复的次数\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True) # 排序重复的次数\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[('the', 1258),\n",
+ " ('and', 1062),\n",
+ " ('of', 536),\n",
+ " ('seattle', 533),\n",
+ " ('to', 471),\n",
+ " ('in', 449),\n",
+ " ('our', 359),\n",
+ " ('you', 304),\n",
+ " ('hotel', 295),\n",
+ " ('with', 280),\n",
+ " ('is', 271),\n",
+ " ('at', 231),\n",
+ " ('from', 224),\n",
+ " ('for', 216),\n",
+ " ('your', 186),\n",
+ " ('or', 161),\n",
+ " ('center', 151),\n",
+ " ('are', 136),\n",
+ " ('downtown', 133),\n",
+ " ('on', 129)]"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "common_words"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " desc | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " the | \n",
+ " 1258 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " and | \n",
+ " 1062 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " of | \n",
+ " 536 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " seattle | \n",
+ " 533 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " to | \n",
+ " 471 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " desc count\n",
+ "0 the 1258\n",
+ "1 and 1062\n",
+ "2 of 536\n",
+ "3 seattle 533\n",
+ "4 to 471"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "df_common_words.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "scrolled": false
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAaUAAAEWCAYAAADGjIh1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nO3deZgdVbX38e/PEEJCJoaoYWzDG8QQJTHNlUjCDYIIqAhXFDAKATWPom9Arkziq9yLURQuIk4YEEEERJlEFCIXSYLMHcgIRKYwyWAYAiEYIKz3j9pNTprTU9Knq86p3+d5+uk6VfvUWbuLnMWu2rVKEYGZmVkRvC3vAMzMzFo5KZmZWWE4KZmZWWE4KZmZWWE4KZmZWWE4KZmZWWE4KZmZWWE4KZkVjKSlkvas0b4/Kulvkl6Q9JSkcyQNqtjeT9J5kl5M24+pRRxm7XFSMiuXIcB3gC2A9wBbAadVbD8ZGAlsC+wOHCdp716O0UrMScmsQCRdCGwD/FHSCknHpfX7SVqcRjizJL2n4j1LJZ0o6R5Jz0v6laSNqu0/Ii6OiOsiYmVEPA+cA+xa0eRQ4JSIeD4i7k3bp9Sou2Zv4aRkViAR8TngUeDjETEwIn4gaXvgEuBoYBjwZ7KktWHFWycDHwG2A7YHvtnFj9wNWAwgaROyEdT8iu3zgR3XvUdm3eOkZFZ8BwF/iojrI+I14HSgP/DBijY/iYjHIuI5YDpwSGc7lfRh4DDgW2nVwPR7eUWz5cAgzHqJk5JZ8W0BPNL6IiLeAB4Dtqxo81jF8iPpPe2StAtwMXBgRPw9rV6Rfg+uaDoYeGndwjbrPicls+JpW7r/H2QTDwCQJGBr4ImKNltXLG+T3lOVpLHA1cAREXHDmx+aXWN6EtipovlOpNN7Zr3BScmseJ4GRlS8/h3wUUl7SOoL/CewCrilos1XJG0laVPgG8Cl1XYsaTRwHfB/I+KPVZr8GvimpE0k7QB8ETh/fTtk1lVOSmbF8z2yxPCCpK9HxBLgs8CPgWXAx8kmQrxa8Z6Lgb8AD6Wf77Sz7/8kmyzxyzS7b4WkypHQt4EHyU4BzgZOi4jrerBvZh2SH/JnVt8kLQW+EBH/m3csZuvLIyUzMysMJyUzMysMn74zM7PC8EjJzMwKY4O8A6hnm2++eTQ1NeUdhplZXZk7d+6yiBhWbZuT0npoamqipaUl7zDMzOqKpEfa2+bTd2ZmVhhOSmZmVhhOSmZmVhhOSmZmVhhOSmZmVhhOSmZmVhhOSmZmVhhOSmZmVhi+eXY9LHxiOU0n/CnXGJae+tFcP9/MrCd5pGRmZoXhpGRmZoVR2qQk6RhJi9LP0ZKaJN0r6RxJiyX9RVL/vOM0MyuTUiYlSeOAw4EPALsAXwQ2AUYCP42IHYEXgE9Wee9USS2SWlavXN6LUZuZNb5SJiVgAnBlRLwcESuAK4CJwMMRMS+1mQs0tX1jRMyIiOaIaO4zYEivBWxmVgZlTUpqZ/2qiuXVeHaimVmvKmtSmgPsL2mApI2BA4Cbco7JzKz0SjkSiIi7JJ0P3JFWnQs83939vHfLIbT4PiEzsx5TyqQEEBFnAGe0WT26YvvpvRuRmZmVNin1hCJUdABXdTCzxtFr15QknSzp6zXYb5Okz/T0fs3MrPc1wkSHJsBJycysAdQ0KUk6SdISSf8LvDutGyPpNkkLJF0paRNJb5c0N23fSVJI2ia9fjDNkjtf0lmSbpH0kKQD08ecCkyUNE/S1yRtJOlXkhZKulvS7mk/f5b0vrR8t6RvpeVTJH1B0iRJsyRdJuk+SRdJam/quJmZ1UDNklKqmnAwMBb4D2DntOnXwPER8T5gIfDtiHgG2EjSYLKbWFvIEs22wDMRsTK9dzjZja8fI0tGACcAN0XEmIj4IfAVgIh4L3AIcIGkjcimgU9Mn/E6sGt6/wTWTAcfCxwNjAJGVLSp7JcrOpiZ1UgtR0oTyaomrIyIF4GrgY2BoRExO7W5ANgtLd9ClgR2A76bfk9k7fuHroqINyLiHuAd7XzuBOBCgIi4D3gE2D7tZ7e0/U/AQEkDgKaIWJLee0dEPB4RbwDzcEUHM7NeVevZd9GNtjeRJaFtgT8Ax6f3X1PRprLiQnun1tpbfyfQDDwEXA9sTlbzbm47+3dFBzOzXlbLkdIc4ABJ/SUNAj4OvAw8L2liavM5YHZF+88C96eRynPAvsDNnXzOS8CgNp87GUDS9sA2wJKIeBV4DPg0cBtZEvw6ruRgZlYYNRsJpKoJl5KdBnuENV/+hwFnp1NnD5FV6yYilqZ5BXNSu78BW0VEZ5UWFgCvS5oPnA/8LO1/Idm1oykR0ToCugnYIyJWSroJ2Ir1SEqu6GBm1rMU0Z0zbFapubk5Wlpa8g7DzKyuSJobEc3VtvmayXooSkUHcFUHM2sMjXDzrJmZNQgnpURSn7xjMDMru9IkJUlXSZorabGkqWndCkn/Lel2YLykcZJmp3YzJQ3POWwzs1Ip0zWlIyLiOUn9gTslXU52M++iiPiWpL5k09M/ERH/lHQQMB04onInKaFNBegzeFjv9sDMrMGVKSlNk3RAWt4aGEl2g+zlad27yZ6ndH2amt4HeLLtTiJiBjADoN/wkZ66aGbWg0qRlCRNAvYExqd7lGYBGwH/iojVrc2AxRExPp8ozcysLNeUhgDPp4S0A7BLlTZLgGGSxgNI6itpx94M0sys7EoxUgKuA74kaQFZ8rmtbYOIeDU9DuMsSUPI/jZnAovb26krOpiZ9axSJKVUZmifKpsGtmk3jzVVy83MrJeVIinVSpEqOrTlCg9mVo/Kck0JSUMlHZl3HGZm1r7SJCVgKNCtpKRMmf5GZma5qpsvXEmHSlogab6kCyUNk3S5pDvTz66p3cmSzpM0S9JDkqalXZwKbCdpnqTTUttj03sXSPqvtK5J0r2SfgbcRXZPk5mZ9YK6uKaUpmafBOwaEcskbQr8BPhhRPxN0jbATOA96S07ALuTPfxviaSfAycAoyNiTNrnXmQ30P4b2T1KV0vaDXiU7EbawyPiLSMrV3QwM6udukhKwIeAyyJiGUAqF7QnMCpVXwAYnJ5wC/CnNONulaRngHdU2ede6efu9HogWZJ6FHgkIt4ybTx9tis6mJnVSL0kJQFtE8DbyCo0vLJWwyxJrapYtZrq/RTwvYj4RZv3N5E9tt3MzHpZvVxTugH4tKTNANLpu78AX21tIGlMJ/t4iex0XquZwBGSBqb3bynp7T0atZmZdUtdjJQiYrGk6cBsSavJTrlNA36aqjRsAMwBvtTBPp6VdLOkRcC1EXGspPcAt6bR1Qrgs2Qjqy5xRQczs56lCF8WWVfNzc3R0tKSdxhmZnVF0tyIaK62rS5GSkVV5IoOrVzZwczqSb1cUzIzsxJwUuqAKzqYmfWu0n/hSjpG0qL0c7QrOpiZ5afU15QkjQMOBz5Adt/S7cBsXNHBzCwXZR8pTQCujIiXI2IFcAUwkU4qOkREc0Q09xkwpDdjNTNreGVPSmpnvSs6mJnloOxJaQ6wv6QBkjYGDgBuyjkmM7PSKvU1pYi4S9L5wB1p1bnA8119vys6mJn1rFInJYCIOAM4o83q0XnEYmZWdqVPSuujHio6dMTVHsysaMp+TcnMzArESakDkvrkHYOZWZk0TFKSdIqkoypeT5d0lKTTUrWGhZIOStsmSbqmou1PJE1Jy0slfUvS34BP9XY/zMzKrGGSEvBL4DCAVK/uYOBxYAywE7AncJqk4V3Y178iYkJE/LbtBklTJbVIalm9cnnPRW9mZo2TlCJiKfCspLHAXmQPApwAXBIRqyPiabISQjt3YXeXdvA5ruhgZlYjjTb77lxgCvBO4Dyy5FTN66ydkDdqs90VHczMctAwI6XkSmBvstHQTLKKDQdJ6iNpGLAb2Y2yjwCjJPWTNATYI6+AzcxsjYYaKUXEq5JuBF6IiNWSrgTGA/OBAI6LiKcAJP0OWADcT3aqr9tc0cHMrGcpIvKOocekCQ53AZ+KiPtr/XnNzc3R0tJS648xM2sokuZGRHO1bQ0zUpI0CriG7FEUNU9IUP8VHTriag9mloeGSUoRcQ8worN2kqYBXwbuiojJNQ/MzMy6rGGSUjccCewTEQ931lDSBhHxei/EZGZmlCwpSTqbbDR1dXpkxcT0eiUwNSIWSDoZ2AJoApYBn8klWDOzEmq0KeEdiogvAf8AdidLOndHxPuAbwC/rmg6DvhERLwlIbmig5lZ7ZQqKbUxAbgQICL+CmyW7lkCuDoiXqn2Jld0MDOrnTInJVVZ1zo/3hUdzMxyUOakNAeYDFnVcGBZRLyYa0RmZiVXqokObZwM/ErSArKJDod1dweu6GBm1rNKl5Qioqni5SeqbD+514IxM7O1lC4p9aRGrujQypUdzKw3Ncw1JUnTJN0r6aK8YzEzs3XTSCOlt1RqcEUGM7P60hBJqU2lhm3InhzbBCyTdATwc6CZ7OF+x0TEjZKmAPsDfYDRwP8AGwKfA1YB+0bEc73cFTOzUmuI03dtKjX8kLUrMnwltXkvcAhwgaTWJ82OJisj9G/AdGBlRIwFbgUOrfZZruhgZlY7DZGUqqisyFBZueE+sqfObp+23RgRL0XEP4HlwB/T+oVkI623cEUHM7PaadSkVFmRoVrlhlarKpbfqHj9Bg1yatPMrJ40alKqVFm5YXtgG2BJrhGZmVlVZRgN/Aw4W9JCsokOUyJildTRAKprXNHBzKxnKSI6b2VVNTc3R0tLS95hmJnVFUlzI6K52rYyjJRqpgwVHVq5soOZ9YYyXFPqNknfyDsGM7MyclKqzknJzCwHpT99J+kqYGtgI+BHZJUh+kuaByyOiMl5xmdmVialT0rAERHxnKT+wJ3AvwNfjYgx1RpLmgpMBegzeFjvRWlmVgI+fQfTJM0HbiMbMY3sqLErOpiZ1U6pR0rpMeh7AuMjYqWkWWSn8czMLAddGilJ2ljS29Ly9pL2k9S3tqH1iiHA8ykh7QDskta/1iD9MzOrK126eVbSXGAisAnZaa4WsoradT0JQFI/4CpgS7LSQ8OAk4F9gP2Auzrqo2+eNTPrvp64eVZpNPF54McR8QNJd/dciPmIiFVkCaitWcDxvRuNmZl1OSlJGk9W2PTz3XxvwypTRYeucuUHM1sfXZ19dzRwInBlRCyWNAK4sXZh9S5Jt+Qdg5mZdXG0ExGzgdkVrx8CptUqqN4WER/MOwYzM+skKUn6I9DuTIiI2K/HI8qBpBURMVDScOBSYDDZ3+bLEXFTvtGZmZVHZyOl09Pv/wDeCfwmvT4EWFqjmPL0GWBmREyX1AcY0LaBKzqYmdVOh0kpnbZD0ikRsVvFpj9KmlPTyPJxJ3BeukfpqoiY17ZBRMwAZgD0Gz7SD6MyM+tBXZ3oMCxNbgBA0rvI7ulpKBExB9gNeAK4UNKhOYdkZlYqXZ3W/TVglqSH0usm0imsRiJpW+CJiDhH0sbA+4Ff5xyWmVlpdJqUUnmhF8kKle6QVt+XbjxtNJOAYyW9BqwAOhwpvXfLIbT4vhwzsx7TaVKKiDck/U9EjAfm90JMvS4iBqbfFwAX5ByOmVlpdfX03V8kfRK4IrpSLK8kXNGhfa7sYGbroqsTHY4Bfg+8KulFSS9JerGGcdWMpD9LGpp+jqxYP0nSNXnGZmZWdl1KShExKCLeFhF9I2Jwej241sHVQkTsGxEvAEOBIztrb2ZmvafLT55Nz1A6Pf18rJZBrQ9Jx0malpZ/KOmvaXkPSb+RtFTS5sCpwHaS5kk6Lb19oKTLJN0n6SJJyqkbZmal1NWH/J0KHAXck36OSuuKaA7Zs58AmskSTV9gAlBZMugE4MGIGBMRx6Z1Y8mKz44CRgC7tt25pKmSWiS1rF65vFZ9MDMrpa6OlPYFPhwR50XEecDeaV0RzQXGSRoErAJuJUtOE1k7KVVzR0Q8HhFvAPPI7sdaS0TMiIjmiGjuM2BIz0ZuZlZyXT59R3YNplVhv40j4jWyunyHA7eQJaLdge2Aezt5e+W9V6vxM6PMzHpVV790vwvcJWkWILJSPCfWKqgeMAf4OnAEsBA4A5gbEVFxmeglYFA+4ZmZWTVdTUofBc4DngceBY6PiKdqFtX6uwk4Cbg1Il6W9C/anLqLiGcl3SxpEXAt0O0bjlzRwcysZ6kr98JK+hDZRIGJZBMA5gFzIuJHtQ2v2Jqbm6OlpSXvMMzM6oqkuRHRXHVbVws0pOcL7Ux2feZLwCsRsUPH72ps/YaPjOGHnZl3GHXBFR7MrFVHSamrU8JvAG4GDgKWADsXJSFJakqn4LrafoqkLbrQ7nxJB65fdGZm1h1dnX23AHgVGA28DxgtqX/NoqqtKUCnScnMzHpfV8sMfS09efYA4FngV8ALtQysm/pIOkfSYkl/kdRf0hhJt0laIOlKSZukkU8zcFGq5NBf0jhJsyXNlTRT0vC8O2NmVlZdPX33VUmXkk1w2J9sJt4+tQysm0YCP42IHcmS5SfJHs53fES8j2xa+Lcj4jKgBZgcEWOA14EfAwdGxDiyfk3v6INc0cHMrHa6OiW8P2vu9Xm9hvGsq4cjYl5ankt2o+zQiJid1l1AVuW8rXeTnZK8Pt2/1Ad4sqMPiogZwAzIJjqsf+hmZtaqS0kpIk7rvFWu2lZiGNpewzYELE4PMDQzs5x1p8xQPVkOPC+ptTDr54DWUVNlJYclwDBJ4wEk9ZW0Y69GamZmb2rk2m6HAWdLGgA8RFYLD+D8tP4VYDxwIHCWpCFkf48zgcVd+QBXdDAz61ldvnnW3soVHczMuq+jm2cbeaRUcwufWE7TCd0umVd6ru5gZu1p1GtKZmZWh5yUzMysMBo2KUk6RdJRFa+nSzpK0mmSFklaKOmgtG2SpGsq2v5E0pQcwjYzK7WGTUrAL8lm4CHpbcDBwOPAGGAnYE/gtO6WFXJFBzOz2mnYpBQRS4FnJY0F9gLuJnsm1CURsToinia7d2nnbu53RkQ0R0RznwGFfSq8mVldavTZd+eSVQV/J1ldu73aafc6ayfojWoblpmZVdOwI6XkSmBvstHQTGAOcJCkPpKGAbsBdwCPAKMk9Us30e6RV8BmZmXW0COliHhV0o3ACxGxWtKVZFUc5gMBHBcRTwFI+h3Zc6PuJzvV1ylXdDAz61kNXdEhTXC4C/hURNzf0/t3RQczs+4rZUUHSaOAa4Ara5GQwBUdepOrQJiVQ8MmpYi4BxiRdxxmZtZ1jT7RYb1IatikbWZWRKVISpKOSVUcFkk6WlKTpEUV278u6eS0PEvSdyXNBo5qb59mZtbzGn4kIGkc2bOUPkD2pNnbWfPAv/YMjYh/b2d/U4GpAH0GD+vBSM3MrAwjpQlkkx1ejogVwBXAxE7ec2l7G1zRwcysdsqQlFRl3VA6ruDwcu3CMTOz9pQhKc0B9pc0QNLGwAHAtcDbJW0mqR/wsVwjNDMzoATXlCLiLknnk5UTAjg3Iu6U9N9k15ceBu5bl327ooOZWc9q6IoOteaKDmZm3VfKig69wRUdGoOrRZgVRxmuKXVI0i15x2BmZpnSJ6WI+GDeMZiZWab0SUnSivR7UqrmcJmk+yRdJKnadHIzM6uR0ielNsYCRwOjyIq57tq2gaSpkloktaxeuby34zMza2hOSmu7IyIej4g3gHlAU9sGruhgZlY7TkprW1WxvBrPTjQz61VOSmZmVhgeCawHV3QwM+tZpU9KETEw/Z4FzKpY/9WcQjIzK63SJ6X14YoOjc2VHsx6X6mvKUkaKunIvOMwM7NMqZMS2XOVnJTMzAqi7KfvTgW2kzQPuD6t2wcI4DsR0e4TaM3MrOeVfaR0AvBgRIwBbgPGADsBewKnSRre9g2u6GBmVjtlT0qVJgCXRMTqiHgamA3s3LaRKzqYmdWOk9IaLr5qZpazsiell4BBaXkOcJCkPpKGAbux5hHqZmbWC0o90SEinpV0s6RFwLXAAmA+2USH4yLiqY7e74oOZmY9q9RJCSAiPtNm1bG5BGJmZk5K68MVHawjrghh1n0Nf01J0v6SRlW8niJpi4rXsyQ15xOdmZlVavikBOxP9iTZVlOALao3NTOzPBU6KUnaWNKfJM2XtEjSQZLGSZotaa6kma03uEr6oqQ7U9vLJQ2Q9EFgP7IbYedJOh5oBi5Kr/u3+by9JN0q6S5Jv5c0sPd7bWZWXoVOSsDewD8iYqeIGA1cB/wYODAixgHnAdNT2ysiYueI2Am4F/h8RNwCXA0cGxFjIuL7QAswOb1+pfWDJG0OfBPYMyLen9od0zYgV3QwM6udok90WAicLun7wDXA88Bo4HpJAH2AJ1Pb0ZK+Q1ZkdSAws5uftQvZab6b0743BG5t2ygiZgAzAPoNHxnd/AwzM+tAoZNSRPxd0jhgX+B7ZEVTF0fE+CrNzwf2j4j5kqYAk7r5cQKuj4hD1j1iMzNbH4U+fZdmya2MiN8ApwMfAIZJGp+295W0Y2o+CHhSUl9gcsVuKqs2VHvd6jZgV0n/J+17gKTte7RDZmbWoUKPlID3kk1SeAN4Dfgy8DpwlqQhZPGfCSwG/h9wO/AI2Wm/1sTzW+AcSdOAA8lGVGdLegV4c8QVEf9MI6xLJPVLq78J/L3d4FzRwcysRynCl0XWVXNzc7S0tOQdhplZXZE0NyKq3h9a9JFSobmig3XEFR3Muq/Q15TyImmapHslXZR3LGZmZeKRUnVHAvtExMN5B2JmVialHylJOiZVi1gk6WhJZwMjgKslfS3v+MzMyqTUI6V0D9ThZFPNRTZ777NklSR2j4hlVd4zFZgK0GfwsN4L1sysBMo+UpoAXBkRL0fECuAKYGJHb4iIGRHRHBHNfQYM6ZUgzczKouxJSXkHYGZma5Q9Kc0B9k/VGzYGDgBuyjkmM7PSKvU1pYi4S9L5wB1p1bkRcXcqyNopV3QwM+tZpU5KABFxBnBGm3VN+URjZlZupU9K68MVHcysu1zpo2Nlv6bULkkr8o7BzKxsnJTMzKwwGjopSbpK0lxJi9NNr0haIWm6pPmSbpP0jrT+XZJulXSnpFPyjdzMrJwaOikBR0TEOKAZmCZpM2Bj4LaI2IlsSvgXU9sfAT+PiJ2Bp9rboaSpkloktaxeubzG4ZuZlUujJ6VpkuaTPVV2a2Ak8CpwTdo+F2hKy7sCl6TlC9vboSs6mJnVTsPOvpM0CdgTGB8RKyXNAjYCXos1TzZczdp/Az/x0MwsR408UhoCPJ8S0g7ALp20vxk4OC1PrmlkZmZWVcOOlIDrgC9JWgAsITuF15GjgIslHQVc3pUPcEUHM7OepTVnsqy7mpubo6WlJe8wzMzqiqS5EdFcbVsjj5RqzhUdzKyMalmVopGvKQEgaaikI9PyJEnXdPYeMzPLR8MnJWAocGTeQZiZWefKcPruVGA7SfOA14CXJV0GjCa7T+mzERHp0ehnAAOBZcCUiHgyr6DNzMqoDCOlE4AHI2IMcCwwFjgaGAWMAHaV1Bf4MXBgqgBxHjC92s5c0cHMrHbKMFJq646IeBwgjZ6agBfIRk7Xpwf89QGqjpIiYgYwA6Df8JGeumhm1oPKmJRWVSy3VnQQsDgixucTkpmZQTlO370EDOqkzRJgmKTxAJL6Stqx5pGZmdlaGn6kFBHPSrpZ0iLgFeDpKm1elXQgcJakIWR/lzOBxR3t2xUdzMx6VsMnJYCI+Ew7679asTwP2K3XgjIzs7cow+k7MzOrE05KZmZWGE5KZmZWGE5KZmZWGE5KZmZWGE5KZmZWGE5KZmZWGE5KZmZWGH4c+nqQ9BJZiaJGsDnZIzsagftSPI3SD3BfesK2ETGs2oZSVHSooSXtPWe+3khqcV+Kp1H60ij9APel1nz6zszMCsNJyczMCsNJaf3MyDuAHuS+FFOj9KVR+gHuS015ooOZmRWGR0pmZlYYTkpmZlYYTkrrSNLekpZIekDSCXnH0xFJW0u6UdK9khZLOiqt31TS9ZLuT783Sesl6azUtwWS3p9vD95KUh9Jd0u6Jr1+l6TbU18ulbRhWt8vvX4gbW/KM+62JA2VdJmk+9LxGV+Px0XS19J/W4skXSJpo3o5JpLOk/RMejp167puHwNJh6X290s6rEB9OS3997VA0pWShlZsOzH1ZYmkj1Ssz+/7LSL8080foA/wIDAC2BCYD4zKO64O4h0OvD8tDwL+DowCfgCckNafAHw/Le8LXAsI2AW4Pe8+VOnTMcDFwDXp9e+Ag9Py2cCX0/KRwNlp+WDg0rxjb9OPC4AvpOUNgaH1dlyALYGHgf4Vx2JKvRwTsidOvx9YVLGuW8cA2BR4KP3eJC1vUpC+7AVskJa/X9GXUem7qx/wrvSd1ifv77fc/4Ouxx9gPDCz4vWJwIl5x9WN+P8AfJisGsXwtG442c3AAL8ADqlo/2a7IvwAWwE3AB8CrklfEMsq/uG9eXyAmcD4tLxBaqe8+5DiGZy+zNVmfV0dl5SUHktfyBukY/KRejomQFObL/JuHQPgEOAXFevXapdnX9psOwC4KC2v9b3Velzy/n7z6bt10/qPsNXjaV3hpVMlY4HbgXdExJMA6ffbU7Oi9+9M4DjgjfR6M+CFiHg9va6M982+pO3LU/siGAH8E/hVOhV5rqSNqbPjEhFPAKcDjwJPkv2N51Kfx6RVd49BIY9NFUeQjfSgoH1xUlo3qrKu8HPrJQ0ELgeOjogXO2paZV0h+ifpY8AzETG3cnWVptGFbXnbgOxUy88jYizwMtmpovYUsi/pessnyE4BbQFsDOxTpWk9HJPOtBd74fsk6STgdeCi1lVVmuXeFyeldfM4sHXF662Af+QUS5dI6kuWkC6KiCvS6qclDU/bhwPPpPVF7t+uwH6SlgK/JTuFdyYwVFJrLcfKeN/sS9o+BHiuNwPuwOPA4xFxe3p9GVmSqrfjsifwcET8MyJeA64APkh9HpNW3T0GRT02QDYJA/gYMDnSOTkK2hcnpXVzJzAyzS7akOxi7dU5x9QuSQJ+CdwbEWdUbLoaaJ0ldBjZtabW9YemmUa7AMtbT2XkLSJOjIitIqKJ7O/+14iYDNwIHJiate1Lax8PTO0L8X+wEQgCwhYAAAJ1SURBVPEU8Jikd6dVewD3UH/H5VFgF0kD0n9rrf2ou2NSobvHYCawl6RN0shxr7Qud5L2Bo4H9ouIlRWbrgYOTrMh3wWMBO4g7++3PC7ENcIP2Sycv5PNUjkp73g6iXUC2fB7ATAv/exLdh7/BuD+9HvT1F7AT1PfFgLNefehnX5NYs3suxFk/6AeAH4P9EvrN0qvH0jbR+Qdd5s+jAFa0rG5imzmVt0dF+C/gPuARcCFZDO66uKYAJeQXQt7jWyU8Pl1OQZk12seSD+HF6gvD5BdI2r9t392RfuTUl+WAPtUrM/t+81lhszMrDB8+s7MzArDScnMzArDScnMzArDScnMzArDScnMzArDScms5CQdLWlA3nGYgZ88a1Z6qTpGc0QsyzsWM4+UzOqApEPT83DmS7pQ0raSbkjrbpC0TWp3vqQDK963Iv2eJGmW1jy76aJUlWAaWb26GyXdmE/vzNbYoPMmZpYnSTuS3Xm/a0Qsk7Qp2XOYfh0RF0g6AjgL2L+TXY0FdiSrY3Zz2t9Zko4BdvdIyYrAIyWz4vsQcFlr0oiI58ieeXNx2n4hWSmpztwREY9HxBtk5WaaahCr2XpxUjIrPtH5owNat79O+nediqNuWNFmVcXyanymxArIScms+G4APi1pM4B0+u4WsurNAJOBv6XlpcC4tPwJoG8X9v8SMKingjVbH/4/JbOCi4jFkqYDsyWtBu4GpgHnSTqW7Om1h6fm5wB/kHQHWTJ7uQsfMQO4VtKTEbF7z/fArOs8JdzMzArDp+/MzKwwnJTMzKwwnJTMzKwwnJTMzKwwnJTMzKwwnJTMzKwwnJTMzKww/j/5MpZU5GEvRAAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看到top20大多数是无关紧要的词"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词,并增加停用词过滤\n",
+ " vec = CountVectorizer(stop_words='english').fit(corpus) # 增加停用词,即自动过滤掉某些字或词\n",
+ " bag_of_words = vec.transform(corpus)\n",
+ " sum_words = bag_of_words.sum(axis=0)\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True)\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " desc | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " seattle | \n",
+ " 533 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " hotel | \n",
+ " 295 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " center | \n",
+ " 151 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " downtown | \n",
+ " 133 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " free | \n",
+ " 123 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " desc count\n",
+ "0 seattle 533\n",
+ "1 hotel 295\n",
+ "2 center 151\n",
+ "3 downtown 133\n",
+ "4 free 123"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "df_common_words.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看到这次的top 20清晰了很多,如最多的seattle、hotle、center等,这里还是一个词一个词去分的,词组起来连贯后意思会不同,如在机场的便利店附近的酒店,这个酒店除了在便利店附近,还得是机场附近。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词,并增加停用词,增加连贯词\n",
+ " vec = CountVectorizer(stop_words='english',ngram_range=(2,2)).fit(corpus) # 增加两次词连贯的\n",
+ " bag_of_words = vec.transform(corpus)\n",
+ " sum_words = bag_of_words.sum(axis=0)\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True)\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "这样所有的词都连起来了,第一个词Pike Place是西雅图的一个广场、以及wifi等关键字眼。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ }
+ },
"nbformat": 4,
"nbformat_minor": 2
}
diff --git a/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/酒店推荐.ipynb b/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/酒店推荐.ipynb
index 64d22dd..4c5ff63 100644
--- a/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/酒店推荐.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/基于相似度的酒店推荐系统/酒店推荐.ipynb
@@ -12,23 +12,9 @@
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 26,
"metadata": {},
"outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "D:\\Anaconda3\\lib\\importlib\\_bootstrap.py:219: RuntimeWarning:\n",
- "\n",
- "numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject\n",
- "\n",
- "D:\\Anaconda3\\lib\\importlib\\_bootstrap.py:219: RuntimeWarning:\n",
- "\n",
- "numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject\n",
- "\n"
- ]
- },
{
"data": {
"text/html": [
@@ -64,6 +50,7 @@
"import re\n",
"import random\n",
"import cufflinks # pip install cufflinks\n",
+ "import matplotlib.pyplot as plt\n",
"from plotly.offline import iplot\n",
"cufflinks.go_offline()"
]
@@ -163,7 +150,7 @@
}
],
"source": [
- "df = pd.read_csv(\"data/Seattle_Hotels.csv\", encoding=\"latin-1\")\n",
+ "df = pd.read_csv(\"data/Seattle_Hotels.csv\", encoding=\"latin-1\") # 西雅图酒店推荐数据\n",
"df.head()"
]
},
@@ -2348,6 +2335,383 @@
"这里重复最多的the我们并不是重要的信息词,后面我们需要进行怎样的优化呢"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词\n",
+ " vec = CountVectorizer().fit(corpus) # 寄存器\n",
+ " bag_of_words = vec.transform(corpus) # 将文本转数值\n",
+ " sum_words = bag_of_words.sum(axis=0) # 计算每个词重复的次数\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True) # 排序重复的次数\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[('the', 1258),\n",
+ " ('and', 1062),\n",
+ " ('of', 536),\n",
+ " ('seattle', 533),\n",
+ " ('to', 471),\n",
+ " ('in', 449),\n",
+ " ('our', 359),\n",
+ " ('you', 304),\n",
+ " ('hotel', 295),\n",
+ " ('with', 280),\n",
+ " ('is', 271),\n",
+ " ('at', 231),\n",
+ " ('from', 224),\n",
+ " ('for', 216),\n",
+ " ('your', 186),\n",
+ " ('or', 161),\n",
+ " ('center', 151),\n",
+ " ('are', 136),\n",
+ " ('downtown', 133),\n",
+ " ('on', 129)]"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "common_words"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " desc | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " the | \n",
+ " 1258 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " and | \n",
+ " 1062 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " of | \n",
+ " 536 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " seattle | \n",
+ " 533 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " to | \n",
+ " 471 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " desc count\n",
+ "0 the 1258\n",
+ "1 and 1062\n",
+ "2 of 536\n",
+ "3 seattle 533\n",
+ "4 to 471"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "df_common_words.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "scrolled": false
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看到top20大多数是无关紧要的词"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词,并增加停用词过滤\n",
+ " vec = CountVectorizer(stop_words='english').fit(corpus) # 增加停用词,即自动过滤掉某些字或词\n",
+ " bag_of_words = vec.transform(corpus)\n",
+ " sum_words = bag_of_words.sum(axis=0)\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True)\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " desc | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " seattle | \n",
+ " 533 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " hotel | \n",
+ " 295 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " center | \n",
+ " 151 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " downtown | \n",
+ " 133 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " free | \n",
+ " 123 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " desc count\n",
+ "0 seattle 533\n",
+ "1 hotel 295\n",
+ "2 center 151\n",
+ "3 downtown 133\n",
+ "4 free 123"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "df_common_words.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看到这次的top 20清晰了很多,如最多的seattle、hotle、center等,这里还是一个词一个词去分的,词组起来连贯后意思会不同,如在机场的便利店附近的酒店,这个酒店除了在便利店附近,还得是机场附近。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_n_words(corpus, n=None):\n",
+ " # 获取某数据中最长出现的n个词,并增加停用词,增加连贯词\n",
+ " vec = CountVectorizer(stop_words='english',ngram_range=(2,2)).fit(corpus) # 增加两次词连贯的\n",
+ " bag_of_words = vec.transform(corpus)\n",
+ " sum_words = bag_of_words.sum(axis=0)\n",
+ " words_freq = [(word, sum_words[0,idx]) for word,idx in vec.vocabulary_.items()] # 得到词及对应出现的次数\n",
+ " words_freq = sorted(words_freq, key=lambda x:x[1],reverse=True)\n",
+ " return words_freq[:n]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Text(0.5, 1.0, 'top 20')"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "common_words = get_top_n_words(df['desc'], 20)\n",
+ "df_common_words = pd.DataFrame(common_words, columns=['desc', 'count'])\n",
+ "plt.barh(df_common_words['desc'], df_common_words['count'])\n",
+ "plt.xlabel('count')\n",
+ "plt.ylabel('words')\n",
+ "plt.title('top 20')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "这样所有的词都连起来了,第一个词Pike Place是西雅图的一个广场、以及wifi等关键字眼。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,