Add. lower()

pull/2/head
benjas 4 years ago
parent 5f1497ce6b
commit 7e5207979b

@ -12,7 +12,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 71,
"metadata": {},
"outputs": [
{
@ -57,7 +57,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 2,
"metadata": {},
"outputs": [
{
@ -144,7 +144,7 @@
"4 Situated amid incredible shopping and iconic a... "
]
},
"execution_count": 4,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
@ -163,7 +163,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"outputs": [
{
@ -172,7 +172,7 @@
"(152, 3)"
]
},
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@ -183,7 +183,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 4,
"metadata": {},
"outputs": [
{
@ -192,7 +192,7 @@
"\"Located on the southern tip of Lake Union, the Hilton Garden Inn Seattle Downtown hotel is perfectly located for business and leisure. \\nThe neighborhood is home to numerous major international companies including Amazon, Google and the Bill & Melinda Gates Foundation. A wealth of eclectic restaurants and bars make this area of Seattle one of the most sought out by locals and visitors. Our proximity to Lake Union allows visitors to take in some of the Pacific Northwest's majestic scenery and enjoy outdoor activities like kayaking and sailing. over 2,000 sq. ft. of versatile space and a complimentary business center. State-of-the-art A/V technology and our helpful staff will guarantee your conference, cocktail reception or wedding is a success. Refresh in the sparkling saltwater pool, or energize with the latest equipment in the 24-hour fitness center. Tastefully decorated and flooded with natural light, our guest rooms and suites offer everything you need to relax and stay productive. Unwind in the bar, and enjoy American cuisine for breakfast, lunch and dinner in our restaurant. The 24-hour Pavilion Pantry? stocks a variety of snacks, drinks and sundries.\""
]
},
"execution_count": 6,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -211,7 +211,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@ -221,7 +221,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 6,
"metadata": {},
"outputs": [
{
@ -236,7 +236,7 @@
" [0, 0, 0, ..., 1, 0, 0]], dtype=int64)"
]
},
"execution_count": 8,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@ -247,7 +247,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 7,
"metadata": {},
"outputs": [
{
@ -256,7 +256,7 @@
"(152, 3200)"
]
},
"execution_count": 9,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@ -267,7 +267,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 8,
"metadata": {},
"outputs": [
{
@ -276,7 +276,7 @@
"matrix([[ 1, 11, 11, ..., 2, 6, 2]], dtype=int64)"
]
},
"execution_count": 10,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@ -288,7 +288,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 9,
"metadata": {},
"outputs": [
{
@ -1297,7 +1297,7 @@
" ...]"
]
},
"execution_count": 13,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
@ -1309,7 +1309,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 10,
"metadata": {},
"outputs": [
{
@ -2318,7 +2318,7 @@
" ...]"
]
},
"execution_count": 14,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
@ -2337,7 +2337,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
@ -2353,7 +2353,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 12,
"metadata": {},
"outputs": [
{
@ -2381,7 +2381,7 @@
" ('on', 129)]"
]
},
"execution_count": 16,
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
@ -2393,7 +2393,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 13,
"metadata": {},
"outputs": [
{
@ -2460,7 +2460,7 @@
"4 to 471"
]
},
"execution_count": 17,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
@ -2472,7 +2472,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 14,
"metadata": {
"scrolled": false
},
@ -2483,7 +2483,7 @@
"Text(0.5, 1.0, 'top 20')"
]
},
"execution_count": 28,
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
@ -2516,7 +2516,7 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
@ -2532,7 +2532,7 @@
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 16,
"metadata": {},
"outputs": [
{
@ -2599,7 +2599,7 @@
"4 free 123"
]
},
"execution_count": 30,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
@ -2612,7 +2612,7 @@
},
{
"cell_type": "code",
"execution_count": 31,
"execution_count": 17,
"metadata": {},
"outputs": [
{
@ -2621,7 +2621,7 @@
"Text(0.5, 1.0, 'top 20')"
]
},
"execution_count": 31,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
},
@ -2654,7 +2654,7 @@
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
@ -2670,7 +2670,7 @@
},
{
"cell_type": "code",
"execution_count": 33,
"execution_count": 19,
"metadata": {},
"outputs": [
{
@ -2679,7 +2679,7 @@
"Text(0.5, 1.0, 'top 20')"
]
},
"execution_count": 33,
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
},
@ -2712,6 +2712,335 @@
"这样所有的词都连起来了第一个词Pike Place是西雅图的一个广场、以及wifi等关键字眼。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 文本清洗\n",
"描述的一些统计信息"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"df['word_count'] = df['desc'].apply(lambda x:len(str(x).split())) # 计算每个描述的长度"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>address</th>\n",
" <th>desc</th>\n",
" <th>word_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Hilton Garden Seattle Downtown</td>\n",
" <td>1821 Boren Avenue, Seattle Washington 98101 USA</td>\n",
" <td>Located on the southern tip of Lake Union, the...</td>\n",
" <td>184</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Sheraton Grand Seattle</td>\n",
" <td>1400 6th Avenue, Seattle, Washington 98101 USA</td>\n",
" <td>Located in the city's vibrant core, the Sherat...</td>\n",
" <td>152</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Crowne Plaza Seattle Downtown</td>\n",
" <td>1113 6th Ave, Seattle, WA 98101</td>\n",
" <td>Located in the heart of downtown Seattle, the ...</td>\n",
" <td>147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Kimpton Hotel Monaco Seattle</td>\n",
" <td>1101 4th Ave, Seattle, WA98101</td>\n",
" <td>What?s near our hotel downtown Seattle locatio...</td>\n",
" <td>150</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>The Westin Seattle</td>\n",
" <td>1900 5th Avenue, Seattle, Washington 98101 USA</td>\n",
" <td>Situated amid incredible shopping and iconic a...</td>\n",
" <td>151</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name \\\n",
"0 Hilton Garden Seattle Downtown \n",
"1 Sheraton Grand Seattle \n",
"2 Crowne Plaza Seattle Downtown \n",
"3 Kimpton Hotel Monaco Seattle \n",
"4 The Westin Seattle \n",
"\n",
" address \\\n",
"0 1821 Boren Avenue, Seattle Washington 98101 USA \n",
"1 1400 6th Avenue, Seattle, Washington 98101 USA \n",
"2 1113 6th Ave, Seattle, WA 98101 \n",
"3 1101 4th Ave, Seattle, WA98101 \n",
"4 1900 5th Avenue, Seattle, Washington 98101 USA \n",
"\n",
" desc word_count \n",
"0 Located on the southern tip of Lake Union, the... 184 \n",
"1 Located in the city's vibrant core, the Sherat... 152 \n",
"2 Located in the heart of downtown Seattle, the ... 147 \n",
"3 What?s near our hotel downtown Seattle locatio... 150 \n",
"4 Situated amid incredible shopping and iconic a... 151 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAANR0lEQVR4nO3db4xld13H8ffHLv8K1QKdktoyThubBkKgJRNorVEsYBbaUB/0AQ1g1Zp5IloMCW5DIvFZjQbQaNCNrZDYFGOB0LRRuikQYoLF3bLAlm1pwRXWVrak/DFoLNWvD+ZsMw67M7P3npnZ79z3K7m59/zub+75/s7c+cyZc8/5TaoKSVI/P7HdBUiSJmOAS1JTBrgkNWWAS1JTBrgkNbVrK1d2zjnn1MLCwlauUpLaO3DgwHeqam51+5YG+MLCAvv379/KVUpSe0n+9UTtHkKRpKYMcElqygCXpKYMcElqygCXpKYMcElqygCXpKYMcElqygCXpKa29EpM9bCw556TPnfklqu3sBJJa3EPXJKaMsAlqSkDXJKaMsAlqSkDXJKaMsAlqSkDXJKaMsAlqSkDXJKaMsAlqSkDXJKaWjfAk9yW5FiSQyva/ijJQ0m+nOQTSc7e3DIlSattZA/8w8DuVW37gFdU1SuBrwE3j1yXJGkd6wZ4VX0OeHJV271V9fSw+E/ABZtQmyRpDWMcA/8N4O9HeB1J0imYKsCTvBd4Grh9jT5LSfYn2f/EE09MszpJ0goTB3iSG4BrgLdVVZ2sX1XtrarFqlqcm5ubdHWSpFUm+o88SXYDvwf8YlX957glSZI2YiOnEd4BfB64JMnRJDcCfwacBexLcjDJX2xynZKkVdbdA6+q60/QfOsm1CJJOgVeiSlJTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTRngktSUAS5JTa0b4EluS3IsyaEVbS9Ksi/JI8P9Cze3TEnSahvZA/8wsHtV2x7gvqq6GLhvWJYkbaF1A7yqPgc8uar5WuAjw+OPAL8ycl2SpHVMegz8JVX1OMBwf+54JUmSNmLXZq8gyRKwBDA/P7/Zq9MJLOy554TtR265eosrkTSmSffAv53kPIDh/tjJOlbV3qparKrFubm5CVcnSVpt0gC/C7hheHwD8MlxypEkbdRGTiO8A/g8cEmSo0luBG4B3pjkEeCNw7IkaQutewy8qq4/yVOvH7kWSdIp8EpMSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpjZ9OllNbrOngT3Z60/yNZs9Na1T4ko/zj1wSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpgxwSWrKAJekpqYK8CS/m+TBJIeS3JHkuWMVJkla28QBnuR84HeAxap6BXAG8NaxCpMkrW3aQyi7gOcl2QWcCTw2fUmSpI2YeD7wqvq3JH8MfBP4L+Deqrp3db8kS8ASwPz8/KSr0w7j/N7S9KY5hPJC4FrgQuCngecnefvqflW1t6oWq2pxbm5u8kolSf/PNIdQ3gD8S1U9UVU/Aj4O/Nw4ZUmS1jNNgH8TuDzJmUkCvB44PE5ZkqT1TBzgVXU/cCfwAPCV4bX2jlSXJGkdU/1T46p6H/C+kWqRJJ0Cr8SUpKYMcElqygCXpKYMcElqygCXpKYMcElqygCXpKYMcElqygCXpKYMcElqaqpL6Ts62TzUcPK5qLvMXb3W2CTtPO6BS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNTVVgCc5O8mdSR5KcjjJFWMVJkla27SzEf4J8A9VdV2SZwNnjlCTJGkDJg7wJD8J/ALwawBV9RTw1DhlSZLWM80e+EXAE8BfJ3kVcAC4qap+uLJTkiVgCWB+fn6K1Z2a03Fu7C7zio/pVL8Pp+P3TTpdTXMMfBfwauBDVXUZ8ENgz+pOVbW3qharanFubm6K1UmSVpomwI8CR6vq/mH5TpYDXZK0BSYO8Kr6d+BbSS4Zml4PfHWUqiRJ65r2LJTfBm4fzkD5BvDr05ckSdqIqQK8qg4CiyPVIkk6BV6JKUlNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNGeCS1JQBLklNTR3gSc5I8sUkd49RkCRpY8bYA78JODzC60iSTsFUAZ7kAuBq4K/GKUeStFG7pvz6DwLvAc46WYckS8ASwPz8/JSr25kW9tyzqf13spNtiyO3XL2prz/mOqRJTbwHnuQa4FhVHVirX1XtrarFqlqcm5ubdHWSpFWmOYRyJfCWJEeAjwJXJfmbUaqSJK1r4gCvqpur6oKqWgDeCny6qt4+WmWSpDV5HrgkNTXth5gAVNVngc+O8VqSpI1xD1ySmjLAJakpA1ySmjLAJakpA1ySmjLAJakpA1ySmjLAJakpA1ySmjLAJampUS6l3ymcl1tSJ+6BS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNWWAS1JTBrgkNTVxgCd5aZLPJDmc5MEkN41ZmCRpbdPMRvg08O6qeiDJWcCBJPuq6qsj1SZJWsPEe+BV9XhVPTA8/g/gMHD+WIVJktY2ynzgSRaAy4D7T/DcErAEMD8/P8bqdBrqMjf6yeo8csvVW1yJNL2pP8RM8gLgY8C7quoHq5+vqr1VtVhVi3Nzc9OuTpI0mCrAkzyL5fC+vao+Pk5JkqSNmOYslAC3Aoer6v3jlSRJ2ohp9sCvBN4BXJXk4HB780h1SZLWMfGHmFX1j0BGrEWSdAq8ElOSmjLAJakpA1ySmjLAJakpA1ySmjLAJakpA1ySmjLAJakpA1ySmjLAJampUeYD3wpd5pvW6eFU3y9jvr9Odc7xsdbdaU7z021e9rHqWet7uRljcw9ckpoywCWpKQNckpoywCWpKQNckpoywCWpKQNckpoywCWpKQNckpoywCWpKQNckpoywCWpqakCPMnuJA8neTTJnrGKkiStb+IAT3IG8OfAm4CXA9cneflYhUmS1jbNHvhrgEer6htV9RTwUeDaccqSJK0nVTXZFybXAbur6jeH5XcAr62qd67qtwQsDYuXAA+veqlzgO9MVMTOMMvjn+Wxg+Of5fGf6th/pqrmVjdO8w8dcoK2H/ttUFV7gb0nfZFkf1UtTlFHa7M8/lkeOzj+WR7/WGOf5hDKUeClK5YvAB6brhxJ0kZNE+D/DFyc5MIkzwbeCtw1TlmSpPVMfAilqp5O8k7gU8AZwG1V9eAEL3XSwyszYpbHP8tjB8c/y+MfZewTf4gpSdpeXokpSU0Z4JLU1LYF+Cxchp/ktiTHkhxa0faiJPuSPDLcv3BoT5I/HbbHl5O8evsqH0eSlyb5TJLDSR5MctPQvuO3QZLnJvlCki8NY/+Dof3CJPcPY//b4QQAkjxnWH50eH5hO+sfS5Izknwxyd3D8syMP8mRJF9JcjDJ/qFt1Pf+tgT4DF2G/2Fg96q2PcB9VXUxcN+wDMvb4uLhtgR8aItq3ExPA++uqpcBlwO/NXyfZ2Eb/DdwVVW9CrgU2J3kcuAPgQ8MY/8ucOPQ/0bgu1X1s8AHhn47wU3A4RXLszb+X6qqS1ec8z3ue7+qtvwGXAF8asXyzcDN21HLFox1ATi0Yvlh4Lzh8XnAw8PjvwSuP1G/nXIDPgm8cda2AXAm8ADwWpavvts1tD/zc8Dy2VxXDI93Df2y3bVPOe4LhpC6Crib5Yv/Zmn8R4BzVrWN+t7frkMo5wPfWrF8dGibBS+pqscBhvtzh/YdvU2GP4kvA+5nRrbBcPjgIHAM2Ad8HfheVT09dFk5vmfGPjz/feDFW1vx6D4IvAf432H5xczW+Au4N8mBYUoRGPm9P82l9NPY0GX4M2bHbpMkLwA+Bryrqn6QnGioy11P0NZ2G1TV/wCXJjkb+ATwshN1G+531NiTXAMcq6oDSV53vPkEXXfk+AdXVtVjSc4F9iV5aI2+E41/u/bAZ/ky/G8nOQ9guD82tO/IbZLkWSyH9+1V9fGheaa2QVV9D/gsy58DnJ3k+I7TyvE9M/bh+Z8CntzaSkd1JfCWJEdYnqn0Kpb3yGdl/FTVY8P9MZZ/gb+Gkd/72xXgs3wZ/l3ADcPjG1g+Lny8/VeHT6MvB75//E+trrK8q30rcLiq3r/iqR2/DZLMDXveJHke8AaWP8z7DHDd0G312I9vk+uAT9dwMLSjqrq5qi6oqgWWf74/XVVvY0bGn+T5Sc46/hj4ZeAQY7/3t/EA/5uBr7F8XPC92/2BwyaN8Q7gceBHLP+GvZHl43r3AY8M9y8a+oblM3O+DnwFWNzu+kcY/8+z/Gfgl4GDw+3Ns7ANgFcCXxzGfgj4/aH9IuALwKPA3wHPGdqfOyw/Ojx/0XaPYcRt8Trg7lka/zDOLw23B49n3NjvfS+ll6SmvBJTkpoywCWpKQNckpoywCWpKQNckpoywCWpKQNckpr6PxsSaXwiGw21AAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.hist(df['word_count'], bins=50)\n",
"plt.show() # 绝大多数是250内的不会是太长的"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"# 过滤掉不需要保留的\n",
"from nltk.corpus import stopwords\n",
"\n",
"set_stopwords = set(stopwords.words('english'))"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [],
"source": [
"def clean_txt(text):\n",
" sub_replace = re.compile('[^0-9a-z]') # 去掉非数值及英文的\n",
" text = sub_replace.sub(' ', text)\n",
" return text"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>address</th>\n",
" <th>desc</th>\n",
" <th>word_count</th>\n",
" <th>desc_clean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Hilton Garden Seattle Downtown</td>\n",
" <td>1821 Boren Avenue, Seattle Washington 98101 USA</td>\n",
" <td>Located on the southern tip of Lake Union, the...</td>\n",
" <td>184</td>\n",
" <td>located southern tip lake union hilton garden...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Sheraton Grand Seattle</td>\n",
" <td>1400 6th Avenue, Seattle, Washington 98101 USA</td>\n",
" <td>Located in the city's vibrant core, the Sherat...</td>\n",
" <td>152</td>\n",
" <td>located city vibrant core sheraton grand seat...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Crowne Plaza Seattle Downtown</td>\n",
" <td>1113 6th Ave, Seattle, WA 98101</td>\n",
" <td>Located in the heart of downtown Seattle, the ...</td>\n",
" <td>147</td>\n",
" <td>located heart downtown seattle award winning ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Kimpton Hotel Monaco Seattle</td>\n",
" <td>1101 4th Ave, Seattle, WA98101</td>\n",
" <td>What?s near our hotel downtown Seattle locatio...</td>\n",
" <td>150</td>\n",
" <td>near hotel downtown seattle location better ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>The Westin Seattle</td>\n",
" <td>1900 5th Avenue, Seattle, Washington 98101 USA</td>\n",
" <td>Situated amid incredible shopping and iconic a...</td>\n",
" <td>151</td>\n",
" <td>situated amid incredible shopping iconic attra...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name \\\n",
"0 Hilton Garden Seattle Downtown \n",
"1 Sheraton Grand Seattle \n",
"2 Crowne Plaza Seattle Downtown \n",
"3 Kimpton Hotel Monaco Seattle \n",
"4 The Westin Seattle \n",
"\n",
" address \\\n",
"0 1821 Boren Avenue, Seattle Washington 98101 USA \n",
"1 1400 6th Avenue, Seattle, Washington 98101 USA \n",
"2 1113 6th Ave, Seattle, WA 98101 \n",
"3 1101 4th Ave, Seattle, WA98101 \n",
"4 1900 5th Avenue, Seattle, Washington 98101 USA \n",
"\n",
" desc word_count \\\n",
"0 Located on the southern tip of Lake Union, the... 184 \n",
"1 Located in the city's vibrant core, the Sherat... 152 \n",
"2 Located in the heart of downtown Seattle, the ... 147 \n",
"3 What?s near our hotel downtown Seattle locatio... 150 \n",
"4 Situated amid incredible shopping and iconic a... 151 \n",
"\n",
" desc_clean \n",
"0 located southern tip lake union hilton garden... \n",
"1 located city vibrant core sheraton grand seat... \n",
"2 located heart downtown seattle award winning ... \n",
"3 near hotel downtown seattle location better ... \n",
"4 situated amid incredible shopping iconic attra... "
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['desc_clean'] = df['desc'].str.lower() # 全部转小写\n",
"df['desc_clean'] = df['desc_clean'].apply(clean_txt)\n",
"df['desc_clean'] = df['desc_clean'].str.split(' ').apply(lambda x: ' '.join(k for k in x if k not in set_stopwords))\n",
"# df['desc_clean'] = df['desc_clean'].apply(clean_txt)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Located on the southern tip of Lake Union, the Hilton Garden Inn Seattle Downtown hotel is perfectly located for business and leisure. \\nThe neighborhood is home to numerous major international companies including Amazon, Google and the Bill & Melinda Gates Foundation. A wealth of eclectic restaurants and bars make this area of Seattle one of the most sought out by locals and visitors. Our proximity to Lake Union allows visitors to take in some of the Pacific Northwest's majestic scenery and enjoy outdoor activities like kayaking and sailing. over 2,000 sq. ft. of versatile space and a complimentary business center. State-of-the-art A/V technology and our helpful staff will guarantee your conference, cocktail reception or wedding is a success. Refresh in the sparkling saltwater pool, or energize with the latest equipment in the 24-hour fitness center. Tastefully decorated and flooded with natural light, our guest rooms and suites offer everything you need to relax and stay productive. Unwind in the bar, and enjoy American cuisine for breakfast, lunch and dinner in our restaurant. The 24-hour Pavilion Pantry? stocks a variety of snacks, drinks and sundries.\""
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['desc'][0]"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'located southern tip lake union hilton garden inn seattle downtown hotel perfectly located business leisure neighborhood home numerous major international companies including amazon google bill melinda gates foundation wealth eclectic restaurants bars make area seattle one sought locals visitors proximity lake union allows visitors take pacific northwest majestic scenery enjoy outdoor activities like kayaking sailing 2 000 sq ft versatile space complimentary business center state art v technology helpful staff guarantee conference cocktail reception wedding success refresh sparkling saltwater pool energize latest equipment 24 hour fitness center tastefully decorated flooded natural light guest rooms suites offer everything need relax stay productive unwind bar enjoy american cuisine breakfast lunch dinner restaurant 24 hour pavilion pantry stocks variety snacks drinks sundries '"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['desc_clean'][0]"
]
},
{
"cell_type": "code",
"execution_count": null,

Loading…
Cancel
Save