You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Data-Science-For-Beginners/2-Working-With-Data/07-python/notebook-papers.ipynb

2336 lines
853 KiB

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Analyzing COVID-19 Papers\r\n",
"\r\n",
"In this challenge, we will continue with the topic of COVID pandemic, and focus on processing scientific papers on the subject. There is [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) with more than 7000 (at the time of writing) papers on COVID, available with metadata and abstracts (and for about half of them there is also full text provided).\r\n",
"\r\n",
"A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=academic-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 145,
"source": [
"import pandas as pd\r\n",
"import numpy as np\r\n",
"import matplotlib.pyplot as plt"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Getting the Data\r\n",
"\r\n",
"First, we need get the metadata for CORD papers that we will be working with.\r\n",
"\r\n",
"**NOTE**: We do not provide a copy of the dataset as part of this repository. You may first need to download the [`metadata.csv`](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) file from [this dataset on Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Registration with Kaggle may be required. You may also download the dataset without registration [from here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), but it will include all full texts in addition to metadata file.\r\n",
"\r\n",
"We will try to get the data directly from online source, however, if it fails, you need to download the data as described above. Also, it makes sense to download the data if you plan to experiment with it further, to save on waiting time.\r\n",
"\r\n",
"> **NOTE** that dataset is quite large, around 1 Gb in size, and the following line of code can take a long time to complete! (~5 mins)"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 146,
"source": [
"df = pd.read_csv(\"https://datascience4beginners.blob.core.windows.net/cord/metadata.csv.zip\",compression='zip')\r\n",
"# df = pd.read_csv(\"metadata.csv\")\r\n",
"df.head()"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"C:\\winapp\\Miniconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3441: DtypeWarning:\n",
"\n",
"Columns (1,4,5,6,13,14,15,16) have mixed types.Specify dtype option on import or set low_memory=False.\n",
"\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cord_uid</th>\n",
" <th>sha</th>\n",
" <th>source_x</th>\n",
" <th>title</th>\n",
" <th>doi</th>\n",
" <th>pmcid</th>\n",
" <th>pubmed_id</th>\n",
" <th>license</th>\n",
" <th>abstract</th>\n",
" <th>publish_time</th>\n",
" <th>authors</th>\n",
" <th>journal</th>\n",
" <th>mag_id</th>\n",
" <th>who_covidence_id</th>\n",
" <th>arxiv_id</th>\n",
" <th>pdf_json_files</th>\n",
" <th>pmc_json_files</th>\n",
" <th>url</th>\n",
" <th>s2_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ug7v899j</td>\n",
" <td>d1aafb70c066a2068b02786f8929fd9c900897fb</td>\n",
" <td>PMC</td>\n",
" <td>Clinical features of culture-proven Mycoplasma...</td>\n",
" <td>10.1186/1471-2334-1-6</td>\n",
" <td>PMC35282</td>\n",
" <td>11472636</td>\n",
" <td>no-cc</td>\n",
" <td>OBJECTIVE: This retrospective chart review des...</td>\n",
" <td>2001-07-04</td>\n",
" <td>Madani, Tariq A; Al-Ghamdi, Aisha A</td>\n",
" <td>BMC Infect Dis</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>document_parses/pdf_json/d1aafb70c066a2068b027...</td>\n",
" <td>document_parses/pmc_json/PMC35282.xml.json</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>02tnwd4m</td>\n",
" <td>6b0567729c2143a66d737eb0a2f63f2dce2e5a7d</td>\n",
" <td>PMC</td>\n",
" <td>Nitric oxide: a pro-inflammatory mediator in l...</td>\n",
" <td>10.1186/rr14</td>\n",
" <td>PMC59543</td>\n",
" <td>11667967</td>\n",
" <td>no-cc</td>\n",
" <td>Inflammatory diseases of the respiratory tract...</td>\n",
" <td>2000-08-15</td>\n",
" <td>Vliet, Albert van der; Eiserich, Jason P; Cros...</td>\n",
" <td>Respir Res</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>document_parses/pdf_json/6b0567729c2143a66d737...</td>\n",
" <td>document_parses/pmc_json/PMC59543.xml.json</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ejv2xln0</td>\n",
" <td>06ced00a5fc04215949aa72528f2eeaae1d58927</td>\n",
" <td>PMC</td>\n",
" <td>Surfactant protein-D and pulmonary host defense</td>\n",
" <td>10.1186/rr19</td>\n",
" <td>PMC59549</td>\n",
" <td>11667972</td>\n",
" <td>no-cc</td>\n",
" <td>Surfactant protein-D (SP-D) participates in th...</td>\n",
" <td>2000-08-25</td>\n",
" <td>Crouch, Erika C</td>\n",
" <td>Respir Res</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>document_parses/pdf_json/06ced00a5fc04215949aa...</td>\n",
" <td>document_parses/pmc_json/PMC59549.xml.json</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2b73a28n</td>\n",
" <td>348055649b6b8cf2b9a376498df9bf41f7123605</td>\n",
" <td>PMC</td>\n",
" <td>Role of endothelin-1 in lung disease</td>\n",
" <td>10.1186/rr44</td>\n",
" <td>PMC59574</td>\n",
" <td>11686871</td>\n",
" <td>no-cc</td>\n",
" <td>Endothelin-1 (ET-1) is a 21 amino acid peptide...</td>\n",
" <td>2001-02-22</td>\n",
" <td>Fagan, Karen A; McMurtry, Ivan F; Rodman, David M</td>\n",
" <td>Respir Res</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>document_parses/pdf_json/348055649b6b8cf2b9a37...</td>\n",
" <td>document_parses/pmc_json/PMC59574.xml.json</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9785vg6d</td>\n",
" <td>5f48792a5fa08bed9f56016f4981ae2ca6031b32</td>\n",
" <td>PMC</td>\n",
" <td>Gene expression in epithelial cells in respons...</td>\n",
" <td>10.1186/rr61</td>\n",
" <td>PMC59580</td>\n",
" <td>11686888</td>\n",
" <td>no-cc</td>\n",
" <td>Respiratory syncytial virus (RSV) and pneumoni...</td>\n",
" <td>2001-05-11</td>\n",
" <td>Domachowske, Joseph B; Bonville, Cynthia A; Ro...</td>\n",
" <td>Respir Res</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>document_parses/pdf_json/5f48792a5fa08bed9f560...</td>\n",
" <td>document_parses/pmc_json/PMC59580.xml.json</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cord_uid sha source_x \\\n",
"0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC \n",
"1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC \n",
"2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC \n",
"3 2b73a28n 348055649b6b8cf2b9a376498df9bf41f7123605 PMC \n",
"4 9785vg6d 5f48792a5fa08bed9f56016f4981ae2ca6031b32 PMC \n",
"\n",
" title doi \\\n",
"0 Clinical features of culture-proven Mycoplasma... 10.1186/1471-2334-1-6 \n",
"1 Nitric oxide: a pro-inflammatory mediator in l... 10.1186/rr14 \n",
"2 Surfactant protein-D and pulmonary host defense 10.1186/rr19 \n",
"3 Role of endothelin-1 in lung disease 10.1186/rr44 \n",
"4 Gene expression in epithelial cells in respons... 10.1186/rr61 \n",
"\n",
" pmcid pubmed_id license \\\n",
"0 PMC35282 11472636 no-cc \n",
"1 PMC59543 11667967 no-cc \n",
"2 PMC59549 11667972 no-cc \n",
"3 PMC59574 11686871 no-cc \n",
"4 PMC59580 11686888 no-cc \n",
"\n",
" abstract publish_time \\\n",
"0 OBJECTIVE: This retrospective chart review des... 2001-07-04 \n",
"1 Inflammatory diseases of the respiratory tract... 2000-08-15 \n",
"2 Surfactant protein-D (SP-D) participates in th... 2000-08-25 \n",
"3 Endothelin-1 (ET-1) is a 21 amino acid peptide... 2001-02-22 \n",
"4 Respiratory syncytial virus (RSV) and pneumoni... 2001-05-11 \n",
"\n",
" authors journal mag_id \\\n",
"0 Madani, Tariq A; Al-Ghamdi, Aisha A BMC Infect Dis NaN \n",
"1 Vliet, Albert van der; Eiserich, Jason P; Cros... Respir Res NaN \n",
"2 Crouch, Erika C Respir Res NaN \n",
"3 Fagan, Karen A; McMurtry, Ivan F; Rodman, David M Respir Res NaN \n",
"4 Domachowske, Joseph B; Bonville, Cynthia A; Ro... Respir Res NaN \n",
"\n",
" who_covidence_id arxiv_id \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" pdf_json_files \\\n",
"0 document_parses/pdf_json/d1aafb70c066a2068b027... \n",
"1 document_parses/pdf_json/6b0567729c2143a66d737... \n",
"2 document_parses/pdf_json/06ced00a5fc04215949aa... \n",
"3 document_parses/pdf_json/348055649b6b8cf2b9a37... \n",
"4 document_parses/pdf_json/5f48792a5fa08bed9f560... \n",
"\n",
" pmc_json_files \\\n",
"0 document_parses/pmc_json/PMC35282.xml.json \n",
"1 document_parses/pmc_json/PMC59543.xml.json \n",
"2 document_parses/pmc_json/PMC59549.xml.json \n",
"3 document_parses/pmc_json/PMC59574.xml.json \n",
"4 document_parses/pmc_json/PMC59580.xml.json \n",
"\n",
" url s2_id \n",
"0 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3... NaN \n",
"1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN \n",
"2 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN \n",
"3 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN \n",
"4 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN "
]
},
"metadata": {},
"execution_count": 146
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"We will now convert publication date column to `datetime`, and plot the histogram to see the range of publication dates."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 147,
"source": [
"df['publish_time'] = pd.to_datetime(df['publish_time'])\r\n",
"df['publish_time'].hist()\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkIAAAGdCAYAAAD+JxxnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA+yElEQVR4nO3dfVRUh53/8Q9FGIHKBENgHMVIH5ZqINkWW0TTxVQBc0Sb427tKck0dF3WrkaXoCetsWeLbsXURbQL23TrumqDHtquIfVESwdNIuHwIGHhVNTVpPGxYSRNRvABhwne3x853F9GfMLoqNz36xzPydz7mfvw7dB8cu8dCTEMwxAAAIAFfeZOHwAAAMCdQhECAACWRRECAACWRRECAACWRRECAACWRRECAACWRRECAACWRRECAACWNexOH8Dd7tKlS3rvvfc0YsQIhYSE3OnDAQAAN8AwDJ09e1ZOp1Of+czVr/tQhK7jvffeU0JCwp0+DAAAcBNOnjypMWPGXHU9Reg6RowYIenjQUZHR9/ho7n1/H6/3G63srKyFBYWdqcPZ0hj1sHBnIOHWQcHc7453d3dSkhIMP89fjUUoevovx0WHR09ZItQZGSkoqOj+QG7zZh1cDDn4GHWwcGcP53rPdbCw9IAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyKEIAAMCyht3pAwAAALfGuB/uvNOHMGjHXph5R/fPFSEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZFCEAAGBZgypC48aNU0hIyIA/CxculCQZhqGioiI5nU5FRERo6tSpOnDgQMA2fD6fFi1apNjYWEVFRWn27Nk6depUQMbr9crlcslut8tut8vlcunMmTMBmRMnTmjWrFmKiopSbGysFi9erN7e3oDM/v37lZGRoYiICI0ePVorV66UYRiDOWUAADCEDaoINTc3q6Ojw/xTU1MjSfrWt74lSVqzZo1KS0tVXl6u5uZmORwOZWZm6uzZs+Y2CgoKVFVVpcrKStXV1encuXPKyclRX1+fmcnNzVVbW5uqq6tVXV2ttrY2uVwuc31fX59mzpyp8+fPq66uTpWVldq+fbuWLFliZrq7u5WZmSmn06nm5maVlZWppKREpaWlNzcpAAAw5AwbTPiBBx4IeP3CCy/o85//vDIyMmQYhtavX6/ly5drzpw5kqQtW7YoPj5e27Zt0/z589XV1aWNGzfqpZde0vTp0yVJFRUVSkhI0O7du5Wdna1Dhw6purpajY2NSktLkyRt2LBB6enpOnz4sJKSkuR2u3Xw4EGdPHlSTqdTkrR27Vrl5eVp1apVio6O1tatW3Xx4kVt3rxZNptNycnJOnLkiEpLS1VYWKiQkJBPPTwAAHBvG1QR+qTe3l5VVFSYpeLdd9+Vx+NRVlaWmbHZbMrIyFB9fb3mz5+vlpYW+f3+gIzT6VRycrLq6+uVnZ2thoYG2e12swRJ0qRJk2S321VfX6+kpCQ1NDQoOTnZLEGSlJ2dLZ/Pp5aWFj322GNqaGhQRkaGbDZbQGbZsmU6duyYEhMTr3hePp9PPp/PfN3d3S1J8vv98vv9Nzuuu1b/OQ3Fc7vbMOvgYM7Bw6yDYzBztoXee49/3K7Pz41u96aL0CuvvKIzZ84oLy9PkuTxeCRJ8fHxAbn4+HgdP37czISHhysmJmZApv/9Ho9HcXFxA/YXFxcXkLl8PzExMQoPDw/IjBs3bsB++tddrQitXr1aK1asGLDc7XYrMjLyiu8ZCvpvc+L2Y9bBwZyDh1kHx43Mec3XgnAgt9iuXbtuy3YvXLhwQ7mbLkIbN27U448/HnBVRtKAW06GYVz3NtTlmSvlb0Wm/0Hpax3PsmXLVFhYaL7u7u5WQkKCsrKyFB0dfc3zuBf5/X7V1NQoMzNTYWFhd/pwhjRmHRzMOXiYdXAMZs7JRX8I0lHdOu1F2bdlu/13dK7nporQ8ePHtXv3br388svmMofDIenjqy2jRo0yl3d2dppXYhwOh3p7e+X1egOuCnV2dmry5Mlm5vTp0wP2+f777wdsp6mpKWC91+uV3+8PyPRfHfrkfqSBV60+yWazBdxO6xcWFjakf9CH+vndTZh1cDDn4GHWwXEjc/b13XvPv96uz86Nbvem/h6hTZs2KS4uTjNnzjSXJSYmyuFwBFy66+3t1d69e82Sk5qaqrCwsIBMR0eH2tvbzUx6erq6urq0b98+M9PU1KSurq6ATHt7uzo6OsyM2+2WzWZTamqqmamtrQ34Sr3b7ZbT6RxwywwAAFjToIvQpUuXtGnTJj399NMaNuz/X1AKCQlRQUGBiouLVVVVpfb2duXl5SkyMlK5ubmSJLvdrnnz5mnJkiXas2ePWltb9dRTTyklJcX8Ftn48eM1Y8YM5efnq7GxUY2NjcrPz1dOTo6SkpIkSVlZWZowYYJcLpdaW1u1Z88eLV26VPn5+ebtq9zcXNlsNuXl5am9vV1VVVUqLi7mG2MAAMA06Ftju3fv1okTJ/T3f//3A9Y999xz6unp0YIFC+T1epWWlia3260RI0aYmXXr1mnYsGGaO3euenp6NG3aNG3evFmhoaFmZuvWrVq8eLH57bLZs2ervLzcXB8aGqqdO3dqwYIFmjJliiIiIpSbm6uSkhIzY7fbVVNTo4ULF2rixImKiYlRYWFhwPM/AADA2gZdhLKysq76tzOHhISoqKhIRUVFV33/8OHDVVZWprKysqtmRo4cqYqKimsex9ixY/Xqq69eM5OSkqLa2tprZgAAgHXxu8YAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlUYQAAIBlDboI/fnPf9ZTTz2l+++/X5GRkfrrv/5rtbS0mOsNw1BRUZGcTqciIiI0depUHThwIGAbPp9PixYtUmxsrKKiojR79mydOnUqIOP1euVyuWS322W32+VyuXTmzJmAzIkTJzRr1ixFRUUpNjZWixcvVm9vb0Bm//79ysjIUEREhEaPHq2VK1fKMIzBnjYAABiCBlWEvF6vpkyZorCwMP3+97/XwYMHtXbtWt13331mZs2aNSotLVV5ebmam5vlcDiUmZmps2fPmpmCggJVVVWpsrJSdXV1OnfunHJyctTX12dmcnNz1dbWpurqalVXV6utrU0ul8tc39fXp5kzZ+r8+fOqq6tTZWWltm/friVLlpiZ7u5uZWZmyul0qrm5WWVlZSopKVFpaenNzAoAAAwxwwYT/ulPf6qEhARt2rTJXDZu3Djznw3D0Pr167V8+XLNmTNHkrRlyxbFx8dr27Ztmj9/vrq6urRx40a99NJLmj59uiSpoqJCCQkJ2r17t7Kzs3Xo0CFVV1ersbFRaWlpkqQNGzYoPT1dhw8fVlJSktxutw4ePKiTJ0/K6XRKktauXau8vDytWrVK0dHR2rp1qy5evKjNmzfLZrMpOTlZR44cUWlpqQoLCxUSEvKphgcAAO5tgypCO3bsUHZ2tr71rW9p7969Gj16tBYsWKD8/HxJ0tGjR+XxeJSVlWW+x2azKSMjQ/X19Zo/f75aWlrk9/sDMk6nU8nJyaqvr1d2drYaGhpkt9vNEiRJkyZNkt1uV319vZKSktTQ0KDk5GSzBElSdna2fD6fWlpa9Nhjj6mhoUEZGRmy2WwBmWXLlunYsWNKTEwccI4+n08
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"297.190125pt\" version=\"1.1\" viewBox=\"0 0 416.695 297.190125\" width=\"416.695pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:44:57.208677</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 297.190125 \r\nL 416.695 297.190125 \r\nL 416.695 0 \r\nL 0 0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 52.375 273.312 \r\nL 409.495 273.312 \r\nL 409.495 7.2 \r\nL 52.375 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"patch_3\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 68.607727 273.312 \r\nL 101.073182 273.312 \r\nL 101.073182 273.31166 \r\nL 68.607727 273.31166 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_4\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 101.073182 273.312 \r\nL 133.538636 273.312 \r\nL 133.538636 273.310302 \r\nL 101.073182 273.310302 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_5\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 133.538636 273.312 \r\nL 166.004091 273.312 \r\nL 166.004091 273.309284 \r\nL 133.538636 273.309284 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_6\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 166.004091 273.312 \r\nL 198.469545 273.312 \r\nL 198.469545 273.306907 \r\nL 166.004091 273.306907 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_7\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 198.469545 273.312 \r\nL 230.935 273.312 \r\nL 230.935 273.307586 \r\nL 198.469545 273.307586 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_8\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 230.935 273.312 \r\nL 263.400455 273.312 \r\nL 263.400455 273.306567 \r\nL 230.935 273.306567 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_9\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 263.400455 273.312 \r\nL 295.865909 273.312 \r\nL 295.865909 273.28857 \r\nL 263.400455 273.28857 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_10\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 295.865909 273.312 \r\nL 328.331364 273.312 \r\nL 328.331364 272.90385 \r\nL 295.865909 272.90385 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_11\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 328.331364 273.312 \r\nL 360.796818 273.312 \r\nL 360.796818 269.774475 \r\nL 328.331364 269.774475 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_12\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 360.796818 273.312 \r\nL 393.262273 273.312 \r\nL 393.262273 19.872 \r\nL 360.796818 19.872 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"matplotlib.axis_1\">\r\n <g id=\"xtick_1\">\r\n <g id=\"line2d_1\">\r\n <path clip-path=\"url(#p39d176d7b8)\" d=\"M 93.039439 273.312 \r\nL 93.039439 7.2 \r\n\" style=\"fill:none;stroke:#b0b0b0;stroke-linecap:square;stroke-width:0.8;\"/>\r\n </g>\r\n <g id=\"line2d_2\">\r\n <defs>\r\n <path d=\"M 0 0 \r\nL 0 3.5 \r\n\" id=\"m8aecb0884d\" style=\"stroke:#000000;stroke-width:0.8;\"/>\r\n </defs>\r
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Interestingly, there are coronavirus-related papers that date back to 1880!"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Structured Data Extraction\r\n",
"\r\n",
"Let's see what kind of information we can easily extract from abstracts. One thing we might be interested in is to see which treatment strategies exist, and how they evolved over time. To begin with, we can manually compile the list of possible medications used to treat COVID, and also the list of diagnoses. We then go over them and search corresponding terms in the abstracts of papers."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 148,
"source": [
"medications = [\r\n",
" 'hydroxychloroquine', 'chloroquine', 'tocilizumab', 'remdesivir', 'azithromycin', \r\n",
" 'lopinavir', 'ritonavir', 'dexamethasone', 'heparin', 'favipiravir', 'methylprednisolone']\r\n",
"diagnosis = [\r\n",
" 'covid','sars','pneumonia','infection','diabetes','coronavirus','death'\r\n",
"]\r\n",
"\r\n",
"for m in medications:\r\n",
" print(f\" + Processing medication: {m}\")\r\n",
" df[m] = df['abstract'].apply(lambda x: str(x).lower().count(' '+m))\r\n",
" \r\n",
"for m in diagnosis:\r\n",
" print(f\" + Processing diagnosis: {m}\")\r\n",
" df[m] = df['abstract'].apply(lambda x: str(x).lower().count(' '+m))"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" + Processing medication: hydroxychloroquine\n",
" + Processing medication: chloroquine\n",
" + Processing medication: tocilizumab\n",
" + Processing medication: remdesivir\n",
" + Processing medication: azithromycin\n",
" + Processing medication: lopinavir\n",
" + Processing medication: ritonavir\n",
" + Processing medication: dexamethasone\n",
" + Processing medication: heparin\n",
" + Processing medication: favipiravir\n",
" + Processing medication: methylprednisolone\n",
" + Processing diagnosis: covid\n",
" + Processing diagnosis: sars\n",
" + Processing diagnosis: pneumonia\n",
" + Processing diagnosis: infection\n",
" + Processing diagnosis: diabetes\n",
" + Processing diagnosis: coronavirus\n",
" + Processing diagnosis: death\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"We have added a bunch of columns to our dataframe that contain number of times a given medicine/diagnosis is present in the abstract. \r\n",
"\r\n",
"> **Note** that we add space to the beginning of the word when looking for a substring. If we do not do that, we might get wrong results, because *chloroquine* would also be found inside substring *hydroxychloroquine*. Also, we force conversion of abstacts column to `str` to get rid of an error - try removing `str` and see what happens.\r\n",
"\r\n",
"To make working with data easier, we can extract the sub-frame with only medication counts, and compute the accumulated number of occurrences. This gives is the most popular medication: "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 149,
"source": [
"dfm = df[medications]\r\n",
"dfm = dfm.sum().reset_index().rename(columns={ 'index' : 'Name', 0 : 'Count'})\r\n",
"dfm.sort_values('Count',ascending=False)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>hydroxychloroquine</td>\n",
" <td>9806</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>remdesivir</td>\n",
" <td>7861</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>tocilizumab</td>\n",
" <td>6118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>chloroquine</td>\n",
" <td>4578</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>heparin</td>\n",
" <td>4161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>lopinavir</td>\n",
" <td>3811</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>azithromycin</td>\n",
" <td>3585</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>dexamethasone</td>\n",
" <td>3340</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>favipiravir</td>\n",
" <td>2439</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>methylprednisolone</td>\n",
" <td>1600</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>ritonavir</td>\n",
" <td>948</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Count\n",
"0 hydroxychloroquine 9806\n",
"3 remdesivir 7861\n",
"2 tocilizumab 6118\n",
"1 chloroquine 4578\n",
"8 heparin 4161\n",
"5 lopinavir 3811\n",
"4 azithromycin 3585\n",
"7 dexamethasone 3340\n",
"9 favipiravir 2439\n",
"10 methylprednisolone 1600\n",
"6 ritonavir 948"
]
},
"metadata": {},
"execution_count": 149
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 150,
"source": [
"dfm.set_index('Name').plot(kind='bar')\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjoAAAIsCAYAAAD7xwNGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAABqFklEQVR4nO3dd1hU18I18DX0ItKUpqhgR8B+FY29Rg1RE0tQFGxJNCCWWG5ijzW2RGNLolijxmhiohc1WCJ2UUQUsQMWJCqCilL394cf8zoOICbDnDPH9Xueea6c2cCCS4bFOfvsrRJCCBAREREpkJHUAYiIiIhKC4sOERERKRaLDhERESkWiw4REREpFosOERERKRaLDhERESkWiw4REREplonUAaSUn5+PO3fuwMbGBiqVSuo4REREVAJCCDx+/Bhubm4wMir+nM1bXXTu3LkDd3d3qWMQERHRP5CcnIyKFSsWO+atLjo2NjYAXnyjypYtK3EaIiIiKomMjAy4u7urf48X560uOgWXq8qWLcuiQ0REZGBKMu2Ek5GJiIhIsVh0iIiISLFYdIiIiEix3uo5OkRERP9UXl4ecnJypI6hSKampjA2NtbJx3rjovPXX3/h66+/RnR0NO7evYsdO3age/fu6ueFEJg2bRpWrVqFtLQ0NGnSBN999x3q1KmjHpOVlYWxY8fip59+wrNnz9CuXTssW7ZM4xaxtLQ0hIaGYufOnQAAf39/LFmyBHZ2duoxSUlJGDFiBPbv3w9LS0sEBARg/vz5MDMz+wffCiIiotcTQiAlJQWPHj2SOoqi2dnZwcXF5V+vc/fGRefp06eoW7cugoOD8cEHH2g9P2/ePCxcuBDh4eGoUaMGvvrqK3To0AEJCQnq28DCwsLw+++/Y/PmzXB0dMSYMWPQrVs3REdHqxtcQEAAbt26hYiICADAsGHDEBgYiN9//x3AiybdtWtXlC9fHlFRUXjw4AEGDhwIIQSWLFnyj78hRERExSkoOU5OTrCysuKCszomhEBmZiZSU1MBAK6urv/6A/5jAMSOHTvUb+fn5wsXFxcxZ84c9bHnz58LW1tbsWLFCiGEEI8ePRKmpqZi8+bN6jG3b98WRkZGIiIiQgghxMWLFwUAcfz4cfWYY8eOCQDi0qVLQgghdu/eLYyMjMTt27fVY3766Sdhbm4u0tPTS5Q/PT1dACjxeCIiervl5uaKixcvivv370sdRfHu378vLl68KHJzc7Wee5Pf3zqdjHzjxg2kpKSgY8eO6mPm5uZo1aoVjh49CgCIjo5GTk6Oxhg3Nzd4e3urxxw7dgy2trZo0qSJekzTpk1ha2urMcbb2xtubm7qMZ06dUJWVhaio6MLzZeVlYWMjAyNBxERUUkVzMmxsrKSOInyFXyP/+08KJ0WnZSUFACAs7OzxnFnZ2f1cykpKTAzM4O9vX2xY5ycnLQ+vpOTk8aYVz+Pvb09zMzM1GNeNXv2bNja2qof3P6BiIj+CV6uKn26+h6Xyu3lr4YTQrw28KtjChv/T8a8bOLEiUhPT1c/kpOTi81EREREhk2nRcfFxQUAtM6opKamqs++uLi4IDs7G2lpacWOuXfvntbH//vvvzXGvPp50tLSkJOTo3Wmp4C5ubl6uwdu+0BERKR8Ol1Hx8PDAy4uLti3bx/q168PAMjOzsahQ4cwd+5cAEDDhg1hamqKffv2oXfv3gCAu3fvIi4uDvPmzQMA+Pn5IT09HSdPnsR//vMfAMCJEyeQnp6OZs2aqcfMnDkTd+/eVc/I3rt3L8zNzdGwYUNdfllERETFqjJhl14/3805XfX6+QzZG5/RefLkCWJiYhATEwPgxQTkmJgYJCUlQaVSISwsDLNmzcKOHTsQFxeHoKAgWFlZISAgAABga2uLwYMHY8yYMYiMjMTZs2fRv39/+Pj4oH379gCA2rVro3Pnzhg6dCiOHz+O48ePY+jQoejWrRtq1qwJAOjYsSO8vLwQGBiIs2fPIjIyEmPHjsXQoUN5poaIiKgQKSkpCAkJgaenJ8zNzeHu7o733nsPkZGRes2hUqnw66+/6uVzvfEZndOnT6NNmzbqt0ePHg0AGDhwIMLDwzFu3Dg8e/YMw4cPVy8YuHfvXo2t1BctWgQTExP07t1bvWBgeHi4xiqIGzduRGhoqPruLH9/fyxdulT9vLGxMXbt2oXhw4ejefPmGgsGEhERkaabN2+iefPmsLOzw7x58+Dr64ucnBzs2bMHI0aMwKVLl6SOWCpUQgghdQipZGRkwNbWFunp6TwLREREr/X8+XPcuHEDHh4esLCwUB83hEtXXbp0QWxsLBISEmBtba3x3KNHj2BnZ4ekpCSEhIQgMjISRkZG6Ny5M5YsWaKe+xoUFIRHjx5pnI0JCwtDTEwMDh48CABo3bo1fH19YWFhgR9++AFmZmb45JNPMHXqVABAlSpVkJiYqH7/ypUr4+bNm1p5i/peA2/2+5t7Xb2B0vxB5vVWIiIqLQ8fPkRERARmzpypVXKAF9stCCHQvXt3WFtb49ChQ8jNzcXw4cPRp08fdYkpqbVr12L06NE4ceIEjh07hqCgIDRv3hwdOnTAqVOn4OTkhDVr1qBz584629OqKCw6RERECnf16lUIIVCrVq0ix/z555+IjY3FjRs31OvMrV+/HnXq1MGpU6fQuHHjEn8+X19fTJkyBQBQvXp1LF26FJGRkejQoQPKly8P4P/2siptpbKODhEREclHwSyV4ta0i4+Ph7u7u8Ziul5eXrCzs0N8fPwbfT5fX1+Nt11dXdV7V+kbiw4REZHCVa9eHSqVqtjCUtSCuy8fNzIywqtTewvbosHU1FTjbZVKhfz8/H8S/V9j0SEiIlI4BwcHdOrUCd999x2ePn2q9fyjR4/g5eWFpKQkjV0DLl68iPT0dNSuXRsAUL58edy9e1fjfQuWm3kTpqamyMvLe+P3+ydYdIiIiN4Cy5YtQ15eHv7zn//gl19+wZUrVxAfH49vv/0Wfn5+aN++PXx9fdGvXz+cOXMGJ0+exIABA9CqVSs0atQIANC2bVucPn0a69atw5UrVzBlyhTExcW9cZYqVaogMjISKSkpWjsl6BonIxMREf1LhnDnrIeHB86cOYOZM2dizJgxuHv3LsqXL4+GDRti+fLl6kX8QkJC0LJlS43bywt06tQJkyZNwrhx4/D8+XMMGjQIAwYMwPnz598oy4IFCzB69Gh8//33qFChQqG3l+sK19F5g3V0eHs5EdHbrbi1XUi3dLWODi9dERERkWKx6BAREZFisegQERGRYrHoEBERvaG3eHqr3ujqe8yiQ0REVEIFC+FlZmZKnET5Cr7Hry4++KZ4ezkREVEJGRsbw87OTr2dgZWVVbHbKtCbE0IgMzMTqampsLOz+9ebfrLoEBERvYGCjSil2rvpbaGrTT9ZdIiIiN6ASqWCq6srnJycCt3nif49U1PTf30mpwCLDhER0T9gbGyss1/GVHo4GZmIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFItFh4iIiBSLRYeIiIgUi0WHiIiIFEvnRSc3NxdffvklPDw8YGlpCU9PT0yfPh35+fnqMUIITJ06FW5ubrC0tETr1q1x4cIFjY+TlZWFkJAQlCtXDtbW1vD398etW7c0xqSlpSEwMBC2trawtbVFYGAgHj16pOsviYiIiAyUzovO3LlzsWLFCixduhTx8fGYN28evv76ayx
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"400.60575pt\" version=\"1.1\" viewBox=\"0 0 410.3325 400.60575\" width=\"410.3325pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:46:30.556876</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 400.60575 \r\nL 410.3325 400.60575 \r\nL 410.3325 -0 \r\nL 0 -0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 46.0125 273.312 \r\nL 403.1325 273.312 \r\nL 403.1325 7.2 \r\nL 46.0125 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"patch_3\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 54.128864 273.312 \r\nL 70.361591 273.312 \r\nL 70.361591 19.872 \r\nL 54.128864 19.872 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_4\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 86.594318 273.312 \r\nL 102.827045 273.312 \r\nL 102.827045 154.991755 \r\nL 86.594318 154.991755 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_5\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 119.059773 273.312 \r\nL 135.2925 273.312 \r\nL 135.2925 115.189838 \r\nL 119.059773 115.189838 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_6\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 151.525227 273.312 \r\nL 167.757955 273.312 \r\nL 167.757955 70.141305 \r\nL 151.525227 70.141305 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_7\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 183.990682 273.312 \r\nL 200.223409 273.312 \r\nL 200.223409 180.656238 \r\nL 183.990682 180.656238 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_8\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 216.456136 273.312 \r\nL 232.688864 273.312 \r\nL 232.688864 174.815178 \r\nL 216.456136 174.815178 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_9\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 248.921591 273.312 \r\nL 265.154318 273.312 \r\nL 265.154318 248.81056 \r\nL 248.921591 248.81056 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_10\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 281.387045 273.312 \r\nL 297.619773 273.312 \r\nL 297.619773 186.988361 \r\nL 281.387045 186.988361 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_11\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 313.8525 273.312 \r\nL 330.085227 273.312 \r\nL 330.085227 165.769287 \r\nL 313.8525 165.769287 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_12\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 346.317955 273.312 \r\nL 362.550682 273.312 \r\nL 362.550682 210.275068 \r\nL 346.317955 210.275068 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"patch_13\">\r\n <path clip-path=\"url(#p1c81dc0e23)\" d=\"M 378.783409 273.312 \r\nL 395.016136 273.312 \r\nL 395.016136 231.959359 \r\nL 378.783409 231.959359 \r\nz\r\n\" style=\"fill:#1f77b4;\"/>\r\n </g>\r\n <g id=\"matplotlib.axis_1\">\r\n <g id=\"xtick_1\">\r\n <g id=\"line2d_1\">\r\n <defs>\r\n <path d=\"M 0 0 \r\nL 0 3.5 \r\n\" id=\"m2263dfa67e\" style=\"stroke:#000000;stroke-width:0.8;\"/>\
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Looking for Trends in Treatment Strategy\r\n",
"\r\n",
"In the example above we have `sum`ed all values, but we can also do the same on a monthly basis:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 151,
"source": [
"dfm = df[['publish_time']+medications].set_index('publish_time')\r\n",
"dfm = dfm[(dfm.index>=\"2020-01-01\") & (dfm.index<=\"2021-07-31\")]\r\n",
"dfmt = dfm.groupby([dfm.index.year,dfm.index.month]).sum()\r\n",
"dfmt"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>hydroxychloroquine</th>\n",
" <th>chloroquine</th>\n",
" <th>tocilizumab</th>\n",
" <th>remdesivir</th>\n",
" <th>azithromycin</th>\n",
" <th>lopinavir</th>\n",
" <th>ritonavir</th>\n",
" <th>dexamethasone</th>\n",
" <th>heparin</th>\n",
" <th>favipiravir</th>\n",
" <th>methylprednisolone</th>\n",
" </tr>\n",
" <tr>\n",
" <th>publish_time</th>\n",
" <th>publish_time</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"12\" valign=\"top\">2020</th>\n",
" <th>1</th>\n",
" <td>3672</td>\n",
" <td>1773</td>\n",
" <td>1779</td>\n",
" <td>2134</td>\n",
" <td>1173</td>\n",
" <td>1430</td>\n",
" <td>370</td>\n",
" <td>561</td>\n",
" <td>984</td>\n",
" <td>666</td>\n",
" <td>331</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>18</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>12</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>45</td>\n",
" <td>72</td>\n",
" <td>5</td>\n",
" <td>27</td>\n",
" <td>12</td>\n",
" <td>52</td>\n",
" <td>16</td>\n",
" <td>3</td>\n",
" <td>21</td>\n",
" <td>11</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>188</td>\n",
" <td>238</td>\n",
" <td>50</td>\n",
" <td>124</td>\n",
" <td>68</td>\n",
" <td>113</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>77</td>\n",
" <td>48</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>459</td>\n",
" <td>191</td>\n",
" <td>158</td>\n",
" <td>209</td>\n",
" <td>132</td>\n",
" <td>135</td>\n",
" <td>41</td>\n",
" <td>12</td>\n",
" <td>92</td>\n",
" <td>48</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>381</td>\n",
" <td>149</td>\n",
" <td>243</td>\n",
" <td>186</td>\n",
" <td>110</td>\n",
" <td>132</td>\n",
" <td>18</td>\n",
" <td>48</td>\n",
" <td>84</td>\n",
" <td>30</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>381</td>\n",
" <td>178</td>\n",
" <td>202</td>\n",
" <td>165</td>\n",
" <td>108</td>\n",
" <td>138</td>\n",
" <td>29</td>\n",
" <td>58</td>\n",
" <td>117</td>\n",
" <td>56</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>307</td>\n",
" <td>115</td>\n",
" <td>172</td>\n",
" <td>165</td>\n",
" <td>145</td>\n",
" <td>91</td>\n",
" <td>24</td>\n",
" <td>56</td>\n",
" <td>95</td>\n",
" <td>45</td>\n",
" <td>35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>319</td>\n",
" <td>123</td>\n",
" <td>185</td>\n",
" <td>190</td>\n",
" <td>91</td>\n",
" <td>98</td>\n",
" <td>28</td>\n",
" <td>90</td>\n",
" <td>111</td>\n",
" <td>46</td>\n",
" <td>26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>319</td>\n",
" <td>96</td>\n",
" <td>212</td>\n",
" <td>227</td>\n",
" <td>72</td>\n",
" <td>127</td>\n",
" <td>39</td>\n",
" <td>97</td>\n",
" <td>117</td>\n",
" <td>81</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>272</td>\n",
" <td>66</td>\n",
" <td>170</td>\n",
" <td>197</td>\n",
" <td>79</td>\n",
" <td>104</td>\n",
" <td>27</td>\n",
" <td>77</td>\n",
" <td>124</td>\n",
" <td>77</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>255</td>\n",
" <td>102</td>\n",
" <td>229</td>\n",
" <td>271</td>\n",
" <td>98</td>\n",
" <td>76</td>\n",
" <td>31</td>\n",
" <td>76</td>\n",
" <td>87</td>\n",
" <td>56</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"7\" valign=\"top\">2021</th>\n",
" <th>1</th>\n",
" <td>2191</td>\n",
" <td>780</td>\n",
" <td>1787</td>\n",
" <td>2523</td>\n",
" <td>892</td>\n",
" <td>841</td>\n",
" <td>198</td>\n",
" <td>1208</td>\n",
" <td>1096</td>\n",
" <td>805</td>\n",
" <td>474</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>163</td>\n",
" <td>66</td>\n",
" <td>184</td>\n",
" <td>173</td>\n",
" <td>85</td>\n",
" <td>76</td>\n",
" <td>9</td>\n",
" <td>86</td>\n",
" <td>61</td>\n",
" <td>52</td>\n",
" <td>63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>172</td>\n",
" <td>85</td>\n",
" <td>190</td>\n",
" <td>295</td>\n",
" <td>87</td>\n",
" <td>100</td>\n",
" <td>17</td>\n",
" <td>150</td>\n",
" <td>82</td>\n",
" <td>85</td>\n",
" <td>36</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>198</td>\n",
" <td>70</td>\n",
" <td>125</td>\n",
" <td>161</td>\n",
" <td>83</td>\n",
" <td>60</td>\n",
" <td>13</td>\n",
" <td>130</td>\n",
" <td>144</td>\n",
" <td>60</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>141</td>\n",
" <td>55</td>\n",
" <td>138</td>\n",
" <td>179</td>\n",
" <td>69</td>\n",
" <td>55</td>\n",
" <td>21</td>\n",
" <td>108</td>\n",
" <td>141</td>\n",
" <td>106</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>144</td>\n",
" <td>29</td>\n",
" <td>138</td>\n",
" <td>182</td>\n",
" <td>75</td>\n",
" <td>41</td>\n",
" <td>12</td>\n",
" <td>128</td>\n",
" <td>116</td>\n",
" <td>66</td>\n",
" <td>42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>112</td>\n",
" <td>49</td>\n",
" <td>96</td>\n",
" <td>270</td>\n",
" <td>64</td>\n",
" <td>59</td>\n",
" <td>5</td>\n",
" <td>169</td>\n",
" <td>106</td>\n",
" <td>44</td>\n",
" <td>50</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hydroxychloroquine chloroquine tocilizumab \\\n",
"publish_time publish_time \n",
"2020 1 3672 1773 1779 \n",
" 2 0 19 0 \n",
" 3 45 72 5 \n",
" 4 188 238 50 \n",
" 5 459 191 158 \n",
" 6 381 149 243 \n",
" 7 381 178 202 \n",
" 8 307 115 172 \n",
" 9 319 123 185 \n",
" 10 319 96 212 \n",
" 11 272 66 170 \n",
" 12 255 102 229 \n",
"2021 1 2191 780 1787 \n",
" 2 163 66 184 \n",
" 3 172 85 190 \n",
" 4 198 70 125 \n",
" 5 141 55 138 \n",
" 6 144 29 138 \n",
" 7 112 49 96 \n",
"\n",
" remdesivir azithromycin lopinavir ritonavir \\\n",
"publish_time publish_time \n",
"2020 1 2134 1173 1430 370 \n",
" 2 3 3 18 11 \n",
" 3 27 12 52 16 \n",
" 4 124 68 113 13 \n",
" 5 209 132 135 41 \n",
" 6 186 110 132 18 \n",
" 7 165 108 138 29 \n",
" 8 165 145 91 24 \n",
" 9 190 91 98 28 \n",
" 10 227 72 127 39 \n",
" 11 197 79 104 27 \n",
" 12 271 98 76 31 \n",
"2021 1 2523 892 841 198 \n",
" 2 173 85 76 9 \n",
" 3 295 87 100 17 \n",
" 4 161 83 60 13 \n",
" 5 179 69 55 21 \n",
" 6 182 75 41 12 \n",
" 7 270 64 59 5 \n",
"\n",
" dexamethasone heparin favipiravir \\\n",
"publish_time publish_time \n",
"2020 1 561 984 666 \n",
" 2 1 3 12 \n",
" 3 3 21 11 \n",
" 4 14 77 48 \n",
" 5 12 92 48 \n",
" 6 48 84 30 \n",
" 7 58 117 56 \n",
" 8 56 95 45 \n",
" 9 90 111 46 \n",
" 10 97 117 81 \n",
" 11 77 124 77 \n",
" 12 76 87 56 \n",
"2021 1 1208 1096 805 \n",
" 2 86 61 52 \n",
" 3 150 82 85 \n",
" 4 130 144 60 \n",
" 5 108 141 106 \n",
" 6 128 116 66 \n",
" 7 169 106 44 \n",
"\n",
" methylprednisolone \n",
"publish_time publish_time \n",
"2020 1 331 \n",
" 2 19 \n",
" 3 14 \n",
" 4 14 \n",
" 5 21 \n",
" 6 29 \n",
" 7 27 \n",
" 8 35 \n",
" 9 26 \n",
" 10 37 \n",
" 11 44 \n",
" 12 59 \n",
"2021 1 474 \n",
" 2 63 \n",
" 3 36 \n",
" 4 37 \n",
" 5 44 \n",
" 6 42 \n",
" 7 50 "
]
},
"metadata": {},
"execution_count": 151
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"This gives us a good picture of treatment strategies. Let's visualize it!"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 152,
"source": [
"dfmt.plot()\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjEAAAGxCAYAAACTN+exAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAD/aElEQVR4nOzdd3hUVfrA8e+dlkmdNNIkJKGDFCkqAVaigqDSRLGAaBRhLaCsAnaN2BUL4orKKrCIsusK/NRlg4CA0ouiCJGaGJCEQEghddr9/TGZm0waqQT0/TzPPCYz5957bogzb855z3sUVVVVhBBCCCEuMLqW7oAQQgghRENIECOEEEKIC5IEMUIIIYS4IEkQI4QQQogLkgQxQgghhLggSRAjhBBCiAuSBDFCCCGEuCBJECOEEEKIC5KhpTvQXJxOJ8ePH8ff3x9FUVq6O0IIIYSoA1VVOXPmDFFRUeh0tY+1/GGDmOPHjxMdHd3S3RBCCCFEAxw9epTWrVvX2uYPG8T4+/sDrh9CQEBAC/dGCCGEEHWRn59PdHS09jlemz9sEOOeQgoICJAgRgghhLjA1CUVRBJ7hRBCCHFBkiBGCCGEEBckCWKEEEIIcUH6w+bECCHEhcrhcGCz2Vq6G0I0C6PRiF6vb5JzSRAjhBDnCVVVyczMJDc3t6W7IkSzCgwMJCIiotF13CSIEUKI84Q7gAkLC8PHx0cKdYo/HFVVKSoqIisrC4DIyMhGnU+CGCGEOA84HA4tgAkJCWnp7gjRbLy9vQHIysoiLCysUVNLktgrhBDnAXcOjI+PTwv3RIjm5/49b2zulwQxQghxHpEpJPFn0FS/5xLECCGEaJSEhASmTZt23p+zqaWlpaEoCrt3766xzcKFCwkMDDxnfWqsC62/EsQIIYQQAoBbbrmFAwcOtHQ36kwSe4UQQlxwrFYrJpOppbtxTthsNoxG4zm5lre3t5Z4eyGQkZh6yiuysfVINjvTTrd0V4QQ4rzhdDqZOXMmwcHBREREkJSUBMDdd9/N8OHDPdra7XYiIiL4+OOPASgsLOSOO+7Az8+PyMhI3njjjSrnj42N5YUXXiAxMRGLxcKkSZMA+OKLL7j44ovx8vIiNjbW49hZs2YRFRVFdna29tzIkSO54oorcDqddeqb0+nk1VdfpX379nh5edGmTRtefPFFj2OOHDnClVdeiY+PDz179mTLli21/qzmzZtHu3btMJlMdOrUicWLF3u8rigK77//PqNGjcLX15cXXnihTscdPHiQK664ArPZTNeuXVm9ejWKorBixQoA1q9fj6IoHnWIdu/ejaIopKWlAVWnk5KSkrjkkktYvHgxsbGxWCwWbr31Vs6cOaO1UVWV1157jbZt2+Lt7U3Pnj35z3/+U+vPoMmof1B5eXkqoObl5TXpeVfvzVRjHv1aHTH3+yY9rxDiz624uFjdt2+fWlxc3NJdqbdBgwapAQEBalJSknrgwAF10aJFqqIo6jfffKNu2rRJ1ev16vHjx7X2//d//6f6+vqqZ86cUVVVVe+77z61devW6jfffKP+/PPP6vDhw1U/Pz/1oYce0o6JiYlRAwIC1Ndff109ePCgevDgQXXnzp2qTqdTZ82ape7fv19dsGCB6u3trS5YsEBVVVW12+1qfHy8Onr0aFVVVXXevHmqxWJR09LSVFVV69S3mTNnqkFBQerChQvVQ4cOqd9//706f/58VVVVNTU1VQXUzp07q19//bW6f/9+9aabblJjYmJUm82mqqqqLliwQLVYLNr5ly1bphqNRvXvf/+7un//fvWNN95Q9Xq9+u2332ptADUsLEz96KOP1MOHD6tpaWlnPc7hcKjdunVTExIS1B9//FHdsGGD2qtXLxVQly9frqqqqq5bt04F1JycHO1aP/74owqoqamp1fb32WefVf38/NQxY8aoe/bsUb/77js1IiJCfeKJJ7Q2TzzxhNq5c2c1OTlZPXz4sLpgwQLVy8tLXb9+fY2/M7X9vtfn81uCmHramZatxjz6tTrw1bVNel4hxJ9bdW/qTqdTLSy1tcjD6XTWue+DBg1SBw4c6PHcpZdeqj766KOqqqpq165d1VdffVV7bfTo0WpiYqKqqqp65swZ1WQyqUuXLtVez87OVr29vasEMe5gxG3cuHHqkCFDPJ6bMWOG2rVrV+37w4cPq/7+/uqjjz6q+vj4qJ988olH+9r6lp+fr3p5eWlBS2XuIOYf//iH9tzevXtVQE1JSVFVtWpQ0L9/f3XSpEke5xk7dqx63XXXad8D6rRp0zzanO24VatWqXq9Xj169Kj2+v/+978mCWJ8fHzU/Px87bkZM2aol19+uaqqqlpQUKCazWZ18+bNHn2bOHGietttt1X5mbk1VRAjOTH1FOjjmoPNLZR9TYQQzavY5qDrM6ta5Nr7Zg3Fx1T3j4gePXp4fB8ZGalVZb3nnnv48MMPmTlzJllZWfz3v/9l7dq1ABw+fBir1Up8fLx2bHBwMJ06dapyjb59+3p8n5KSwqhRozyeGzBgAG+//TYOhwO9Xk/btm2ZPXs2f/3rX7nlllsYP368R/va+paSkkJpaSlXX311ne/dXYE2KyuLzp07V2mbkpLC5MmTq/R5zpw5Z73X2o5LSUmhTZs2tG7dWnu94s+0MWJjY/H399e+r/hvu2/fPkpKShgyZIjHMVarlV69ejXJ9WsjQUw9BZUFMWdK7dgcTox6SSsSQojKiaeKouB0OgG44447eOyxx9iyZQtbtmwhNjaWv/zlL4Arn6KufH19Pb5XVbVKvZHqzvfdd9+h1+tJS0vDbrdjMJR/9NXWt7omuFa8d3d/3Pdener6XPm5yvd6tuOqu+/K7XU6XZW2dSk2V9u/rfu///3vf7nooos82nl5eZ313I0lQUw9WbyNKAqoKuQW2Wjl3/z/SEKIPydvo559s4a22LWbSkhICKNHj2bBggVs2bKFu+66S3utffv2GI1Gtm7dSps2bQDIycnhwIEDDBo0qNbzdu3alY0bN3o8t3nzZjp27KiVsv/Xv/7FsmXLWL9+PbfccgvPP/88zz33XJ361qFDB7y9vVm7di333HNPo38OAF26dGHjxo3ccccdHn3u0qVLo47r2rUr6enpHD9+nKioKIAqCcatWrUCICMjg6CgIIBaa9zURdeuXfHy8iI9Pf2s/17NQYKYetLrFALMRvKKbeQWWSWIEUI0G0VR6jWlcz675557GD58OA6HgzvvvFN73s/Pj4kTJzJjxgxCQkIIDw/nySef1EYNavPII49w6aWX8vzzz3PLLbewZcsW3n33Xd577z0Ajh07xn333cerr77KwIEDWbhwIddffz3XXnst/fr1O2vfzGYzjz76KDNnzsRkMjFgwABOnjzJ3r17mThxYoN+DjNmzODmm2+md+/eXH311Xz11VcsW7aMNWvWNOq4wYMH06lTJ+644w7eeOMN8vPzefLJJz3O0b59e6Kjo0lKSuKFF17g4MGD1a4Eqw9/f3+mT5/O3/72N5xOJwMHDiQ/P5/Nmzfj5+fn8fNsDn+M/zvOsWBfE3nFNnKKJC9GCCHqYvDgwURGRnLxxRdrIwVur7/+OgUFBYwcORJ/f38eeeQR8vLyznrO3r178+9//5tnnnmG559/nsjISGbNmkViYiKqqpKYmMhll13GlClTABgyZAhTpkzh9ttvZ/fu3fj5+Z21b08//TQGg4FnnnmG48ePExkZyb333tvgn8Po0aOZM2cOr7/+Og8++CBxcXEsWLCAhISERh2n0+lYvnw5EydO5LLLLiM2NpZ33nmHYcOGaecwGo189tln3HffffTs2ZNLL72UF154gbFjxzb4fgCef/55wsLCePnllzly5AiBgYH07t2bJ554olHnrQtFrc+E5AUkPz8fi8VCXl4eAQEBTXruG97bxI/pubx/ex+GdYto0nMLIf6cSkpKSE1NJS4uDrPZ3NLdaXJFRUVERUXx8ccfM2bMmJbujofzuW+NpSgKy5cvZ/To0S3dFQ+1/b7X5/NbRmIawJ3cm1tkbeGeCCHE+c3pdJKZmckbb7yBxWJh5Mi
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"311.146375pt\" version=\"1.1\" viewBox=\"0 0 403.97 311.146375\" width=\"403.97pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:46:31.892876</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 311.146375 \r\nL 403.97 311.146375 \r\nL 403.97 0 \r\nL 0 0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 39.65 273.312 \r\nL 396.77 273.312 \r\nL 396.77 7.2 \r\nL 39.65 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"matplotlib.axis_1\">\r\n <g id=\"xtick_1\">\r\n <g id=\"line2d_1\">\r\n <defs>\r\n <path d=\"M 0 0 \r\nL 0 3.5 \r\n\" id=\"m355705f450\" style=\"stroke:#000000;stroke-width:0.8;\"/>\r\n </defs>\r\n <g>\r\n <use style=\"stroke:#000000;stroke-width:0.8;\" x=\"55.882727\" xlink:href=\"#m355705f450\" y=\"273.312\"/>\r\n </g>\r\n </g>\r\n <g id=\"text_1\">\r\n <!-- (2020, 1) -->\r\n <g transform=\"translate(32.89679 287.910438)scale(0.1 -0.1)\">\r\n <defs>\r\n <path d=\"M 1984 4856 \r\nQ 1566 4138 1362 3434 \r\nQ 1159 2731 1159 2009 \r\nQ 1159 1288 1364 580 \r\nQ 1569 -128 1984 -844 \r\nL 1484 -844 \r\nQ 1016 -109 783 600 \r\nQ 550 1309 550 2009 \r\nQ 550 2706 781 3412 \r\nQ 1013 4119 1484 4856 \r\nL 1984 4856 \r\nz\r\n\" id=\"DejaVuSans-28\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 1228 531 \r\nL 3431 531 \r\nL 3431 0 \r\nL 469 0 \r\nL 469 531 \r\nQ 828 903 1448 1529 \r\nQ 2069 2156 2228 2338 \r\nQ 2531 2678 2651 2914 \r\nQ 2772 3150 2772 3378 \r\nQ 2772 3750 2511 3984 \r\nQ 2250 4219 1831 4219 \r\nQ 1534 4219 1204 4116 \r\nQ 875 4013 500 3803 \r\nL 500 4441 \r\nQ 881 4594 1212 4672 \r\nQ 1544 4750 1819 4750 \r\nQ 2544 4750 2975 4387 \r\nQ 3406 4025 3406 3419 \r\nQ 3406 3131 3298 2873 \r\nQ 3191 2616 2906 2266 \r\nQ 2828 2175 2409 1742 \r\nQ 1991 1309 1228 531 \r\nz\r\n\" id=\"DejaVuSans-32\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 2034 4250 \r\nQ 1547 4250 1301 3770 \r\nQ 1056 3291 1056 2328 \r\nQ 1056 1369 1301 889 \r\nQ 1547 409 2034 409 \r\nQ 2525 409 2770 889 \r\nQ 3016 1369 3016 2328 \r\nQ 3016 3291 2770 3770 \r\nQ 2525 4250 2034 4250 \r\nz\r\nM 2034 4750 \r\nQ 2819 4750 3233 4129 \r\nQ 3647 3509 3647 2328 \r\nQ 3647 1150 3233 529 \r\nQ 2819 -91 2034 -91 \r\nQ 1250 -91 836 529 \r\nQ 422 1150 422 2328 \r\nQ 422 3509 836 4129 \r\nQ 1250 4750 2034 4750 \r\nz\r\n\" id=\"DejaVuSans-30\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 750 794 \r\nL 1409 794 \r\nL 1409 256 \r\nL 897 -744 \r\nL 494 -744 \r\nL 750 256 \r\nL 750 794 \r\nz\r\n\" id=\"DejaVuSans-2c\" transform=\"scale(0.015625)\"/>\r\n <path id=\"DejaVuSans-20\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 794 531 \r\nL 1825 531 \r\nL 1825 4091 \r\nL 703 3866 \r\nL 703 4441 \r\nL 1819 4666 \r\nL 2450 4666 \r\nL 2450 531 \r\nL 3481 531 \r\nL 3481 0 \r\nL 794 0 \r\nL 794 531 \r\nz\r\n\" id=\"DejaVuSans-31\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 513 4856 \r\nL 1013 4856 \r\nQ 1481 4119 1714 3412 \r\nQ 1947 2706 1947 2009 \r\nQ 1947 1309 1714 600 \r\nQ 1481 -109 1013 -844 \r\nL 513 -844 \r\nQ 928 -128 1133 580 \r\nQ 1338 1288 133
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"An interesting observation is that we have huge spikes at two locations: January, 2020 and January, 2021. It is caused by the fact that some papers do not have a clearly specified data of publication, and they are specified as January of the respecive year.\r\n",
"\r\n",
"To make more sense of the data, let's visualize just a few medicines. We will also \"erase\" data for January, and fill it in by some medium value, in order to make nicer plot:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 153,
"source": [
"meds = ['hydroxychloroquine','tocilizumab','favipiravir']\r\n",
"dfmt.loc[(2020,1)] = np.nan\r\n",
"dfmt.loc[(2021,1)] = np.nan\r\n",
"dfmt.fillna(method='pad',inplace=True)\r\n",
"fig, ax = plt.subplots(1,len(meds),figsize=(10,3))\r\n",
"for i,m in enumerate(meds):\r\n",
" dfmt[m].plot(ax=ax[i])\r\n",
" ax[i].set_title(m)\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAz8AAAE6CAYAAAA4B+zZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAACjkUlEQVR4nOzde1zT9f4H8NfG2NhgjDsDQUAFb3jBa6InKRXzlmX9tMzSso7lpTza6WSdis4pOdlJLT3ZqUxNs9tJS7NU1CQNzUuigDdUkOsYIIz7xrbv74/t+4VxH+wK7+fjsUfx3Xfb54vsu32+n/fn9eExDMOAEEIIIYQQQro5vr0bQAghhBBCCCG2QJ0fQgghhBBCSI9AnR9CCCGEEEJIj0CdH0IIIYQQQkiPQJ0fQgghhBBCSI9AnR9CCCGEEEJIj0CdH0IIIYQQQkiPQJ0fQgghhBBCSI9AnR9CCCGEEEJIj0Cdny5KSEgAj8dDSUmJ1V4jOzsbPB4P27dvt9pr2MKiRYvg4eHRoX15PB4SEhKs2yAbYv9OCHFUKSkpSEhIQHl5udVeo6X3QVxcHOLi4ky2dZf3vy0+HwhxBl9//TUGDx4MsVgMHo+H1NRUiz7/9u3bwePxkJ2d3anH2/Occ/z4cfB4PBw/ftwur98TCezdAEJ6gqeffhr33XefvZtBSKtSUlLw5ptvYtGiRfDy8rLKa3T0fXDq1CmEhIRYpQ2EENsqLi7G448/jvvuuw8ffvghRCIRoqKiLPoaM2bMwKlTpxAUFNSpx9vznDNixAicOnUKgwYNssvr90TU+elGampqIJFI7N0Mh1NfXw8ejweBwH5/7iEhIfRljvR4HX0f3HXXXTZoDSHEFq5fv476+nosWLAAEydOtMpr+Pv7w9/fv9OP78g5p7a2Fm5ubhav4vD09OzQ69N3PMuhsjcLKSoqwqOPPgqZTIbAwEA89dRTUKlUAIBJkyZhwIABYBjG5DEMw6Bfv36YMWMGt62goABz586FVCqFTCbDvHnzoFAomr0eW0KWlpaG+Ph4SKVSTJo0CQBw584dLF26FL169YJQKESfPn3w6quvQq1WAwDq6uoQExODfv36cW0EAIVCAblcjri4OOh0OuzcuRM8Hg+nTp1q9vr/+Mc/4OrqioKCAm7bwYMHMWnSJMhkMkgkEgwcOBCJiYnNHnvjxg1Mnz4dHh4eCA0NxerVq7m2tSU9PR2zZ8+Gt7c33NzcMHz4cOzYscNkH3b4eOfOnVi9ejV69eoFkUiEGzduAAA+++wzDBs2DG5ubvDx8cGDDz6IK1euNHut7du3o3///hCJRBg4cCA+//xzLFq0COHh4c1eq+lQdUtlii2V+4SHh2PmzJk4ePAgRowYAbFYjAEDBuCzzz5r1h6FQoElS5YgJCQEQqEQERERePPNN6HVatv9vRHSnoSEBPz1r38FAERERIDH43F/23q9HuvWrcOAAQMgEokQEBCAJ554Anl5ec2ep71zQEfLP5uWoISHh3Ntanpj339N359tvSaPx8Py5cuxbds29O/fH2KxGKNGjcLp06fBMAzeffddREREwMPDA/feey93/mAlJSVh9uzZCAkJgZubG/r164clS5a0Wt6Wm5uLOXPmwNPTEzKZDAsWLEBxcXG7vwdCnN2iRYswYcIEAMC8efPA4/EQFxeHc+fO4ZFHHkF4eDjEYjHCw8Px6KOP4vbt29xjL168CB6Ph61btzZ73p9//hk8Hg/79u0D0HLZW1xcHKKjo3HixAncddddEIvF6NWrF1577TXodDqT52t6zmGf7/Dhw3jqqafg7+8PiUQCtVqNGzdu4Mknn0RkZCQkEgl69eqFWbNmIS0tjXt8cXExhEIhXnvttWZtv3r1Kng8Hj744AMALX+XaOs7Huk66vxYyEMPPYSoqCh89913ePnll7F792785S9/AQC88MILuHbtGo4ePWrymJ9//hk3b97EsmXLABiuKkyePBmHDx9GYmIivv32W8jlcsybN6/F19RoNLj//vtx77334ocffsCbb76Juro63HPPPfj888+xatUqHDhwAAsWLMC6deswZ84cAICbmxu++eYbKJVKPPXUUwAAvV6Pxx57DAzD4Msvv4SLiwvmzZsHuVyO//znPyavq9Vq8d///hcPPvgggoODAQBbt27F9OnTodfr8dFHH2H//v14/vnnm31Bqq+vx/33349Jkybhhx9+wFNPPYUNGzbgnXfeafP3e+3aNcTGxiIjIwMffPAB9uzZg0GDBmHRokVYt25ds/3XrFmDnJwcri0BAQFITEzE4sWLMXjwYOzZswfvv/8+Ll26hHHjxiEzM5N77Pbt2/Hkk09i4MCB+O677/D3v/8d//znP3Hs2LE229gZFy9exOrVq/GXv/wFP/zwA4YOHYrFixfj119/5fZRKBQYM2YMDh06hNdffx0///wzFi9ejMTERDzzzDMWbxPpeZ5++mmsWLECALBnzx6cOnUKp06dwogRI/Dcc8/hb3/7G6ZMmYJ9+/bhn//8Jw4ePIjY2FiTL/sdPQd0xt69e7k2nTp1Cr/99huGDBkCd3d39O7du1PP+eOPP+LTTz/Fv/71L3z55ZeorKzEjBkzsHr1avz222/YvHkzPv74Y1y+fBkPPfSQycWrmzdvYty4cdiyZQsOHz6M119/Hb///jsmTJiA+vr6Zq/14IMPol+/fvjf//6HhIQEfP/995g6dWqL+xLSnbz22mvcd4i1a9fi1KlT+PDDD5GdnY3+/ftj48aNOHToEN555x0UFhZi9OjR3Hll2LBhiImJwbZt25o97/bt2xEQEIDp06e3+foKhQKPPPIIHnvsMfzwww94+OGH8dZbb+GFF17oUPufeuopuLq6YufOnfjf//7HXfT19fXFv/71Lxw8eBD/+c9/IBAIMHbsWFy7dg2AYSRq5syZ2LFjB/R6vclzbtu2DUKhEI899libr93SdzxiIQzpkjfeeIMBwKxbt85k+9KlSxk3NzdGr9czOp2O6dOnDzN79myTfaZNm8b07duX0ev1DMMwzJYtWxgAzA8//GCy3zPPPMMAYLZt28ZtW7hwIQOA+eyzz0z2/eijjxgAzDfffGOy/Z133mEAMIcPH+a2ff311wwAZuPGjczrr7/O8Pl8k/vZ4xMKhUxRUVGzxyUnJzMMwzCVlZWMp6cnM2HCBO5YWsK2uWnbpk+fzvTv399kGwDmjTfe4H5+5JFHGJFIxOTk5JjsN23aNEYikTDl5eUMwzDML7/8wgBg7r77bpP9ysrKGLFYzEyfPt1ke05ODiMSiZj58+czDMMwOp2OCQ4OZkaMGGFyLNnZ2YyrqysTFhbGbWNf65dffjF5zqysrGb/XuzfSWNhYWGMm5sbc/v2bW5bbW0t4+PjwyxZsoTbtmTJEsbDw8NkP4ZhmH//+98MACYjI4MhpKveffddBgCTlZXFbbty5QoDgFm6dKnJvr///jsDgHnllVcYhun4OaCl98HEiROZiRMnmmxr+v5vavny5YxAIGB++uknbtvChQtN3p9tvSYARi6XM1VVVdy277//ngHADB8+3OQYNm7cyABgLl261GJb9Ho9U19fz9y+fbvZ+Zt97b/85S8mj/niiy8YAMyuXbtaPUZCugv2s/Lbb79tdR+tVstUVVUx7u7uzPvvv89t/+CDDxgAzLVr17htd+7cYUQiEbN69Wpu27Zt25qdvyZOnNjqdyo+n2/ymdr0nMM+3xNPPNHu8Wm1Wkaj0TCRkZEm7/V9+/Y1+96l1WqZ4OBg5qGHHuK2tfRdorXveMQyaOTHQu6//36Tn4cOHYq6ujoolUrw+XwsX74cP/74I3JycgAYrhwePHgQS5cu5UoyfvnlF0il0mbPNX/+/FZf96GHHjL5+dixY3B3d8fDDz9ssn3RokUAYDL6NHfuXDz33HP461//irfeeguvvPIKpkyZYvK45557DgDwySefcNs2b96MIUOG4O677wZgmChdUVFhciyt4fF4mDVrlsm2oUOHmgx1t+TYsWOYNGkSQkNDmx1XTU1Ns9K8pr+XU6dOoba2lvs9sEJ
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"226.4725pt\" version=\"1.1\" viewBox=\"0 0 598.4875 226.4725\" width=\"598.4875pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:46:32.621331</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 226.4725 \r\nL 598.4875 226.4725 \r\nL 598.4875 -0 \r\nL 0 -0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 33.2875 188.638125 \r\nL 197.405147 188.638125 \r\nL 197.405147 22.318125 \r\nL 33.2875 22.318125 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"matplotlib.axis_1\">\r\n <g id=\"xtick_1\">\r\n <g id=\"line2d_1\">\r\n <defs>\r\n <path d=\"M 0 0 \r\nL 0 3.5 \r\n\" id=\"m051b8bbadd\" style=\"stroke:#000000;stroke-width:0.8;\"/>\r\n </defs>\r\n <g>\r\n <use style=\"stroke:#000000;stroke-width:0.8;\" x=\"75.852772\" xlink:href=\"#m051b8bbadd\" y=\"188.638125\"/>\r\n </g>\r\n </g>\r\n <g id=\"text_1\">\r\n <!-- (2020, 6) -->\r\n <g transform=\"translate(52.866835 203.236563)scale(0.1 -0.1)\">\r\n <defs>\r\n <path d=\"M 1984 4856 \r\nQ 1566 4138 1362 3434 \r\nQ 1159 2731 1159 2009 \r\nQ 1159 1288 1364 580 \r\nQ 1569 -128 1984 -844 \r\nL 1484 -844 \r\nQ 1016 -109 783 600 \r\nQ 550 1309 550 2009 \r\nQ 550 2706 781 3412 \r\nQ 1013 4119 1484 4856 \r\nL 1984 4856 \r\nz\r\n\" id=\"DejaVuSans-28\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 1228 531 \r\nL 3431 531 \r\nL 3431 0 \r\nL 469 0 \r\nL 469 531 \r\nQ 828 903 1448 1529 \r\nQ 2069 2156 2228 2338 \r\nQ 2531 2678 2651 2914 \r\nQ 2772 3150 2772 3378 \r\nQ 2772 3750 2511 3984 \r\nQ 2250 4219 1831 4219 \r\nQ 1534 4219 1204 4116 \r\nQ 875 4013 500 3803 \r\nL 500 4441 \r\nQ 881 4594 1212 4672 \r\nQ 1544 4750 1819 4750 \r\nQ 2544 4750 2975 4387 \r\nQ 3406 4025 3406 3419 \r\nQ 3406 3131 3298 2873 \r\nQ 3191 2616 2906 2266 \r\nQ 2828 2175 2409 1742 \r\nQ 1991 1309 1228 531 \r\nz\r\n\" id=\"DejaVuSans-32\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 2034 4250 \r\nQ 1547 4250 1301 3770 \r\nQ 1056 3291 1056 2328 \r\nQ 1056 1369 1301 889 \r\nQ 1547 409 2034 409 \r\nQ 2525 409 2770 889 \r\nQ 3016 1369 3016 2328 \r\nQ 3016 3291 2770 3770 \r\nQ 2525 4250 2034 4250 \r\nz\r\nM 2034 4750 \r\nQ 2819 4750 3233 4129 \r\nQ 3647 3509 3647 2328 \r\nQ 3647 1150 3233 529 \r\nQ 2819 -91 2034 -91 \r\nQ 1250 -91 836 529 \r\nQ 422 1150 422 2328 \r\nQ 422 3509 836 4129 \r\nQ 1250 4750 2034 4750 \r\nz\r\n\" id=\"DejaVuSans-30\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 750 794 \r\nL 1409 794 \r\nL 1409 256 \r\nL 897 -744 \r\nL 494 -744 \r\nL 750 256 \r\nL 750 794 \r\nz\r\n\" id=\"DejaVuSans-2c\" transform=\"scale(0.015625)\"/>\r\n <path id=\"DejaVuSans-20\" transform=\"scale(0.015625)\"/>\r\n <path d=\"M 2113 2584 \r\nQ 1688 2584 1439 2293 \r\nQ 1191 2003 1191 1497 \r\nQ 1191 994 1439 701 \r\nQ 1688 409 2113 409 \r\nQ 2538 409 2786 701 \r\nQ 3034 994 3034 1497 \r\nQ 3034 2003 2786 2293 \r\nQ 2538 2584 2113 2584 \r\nz\r\nM 3366 4563 \r\nL 3366 3988 \r\nQ 3128 4100 2886 4159 \r\nQ 2644 4219 2406 4219 \r\nQ 1781 4219 1451 3797 \r\nQ 1122 3375 1075 2522 \r\nQ 1259 2794 1537 2939 \r\nQ 1816 3084 2150 3084 \r\n
"text/plain": [
"<Figure size 1000x300 with 3 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Observe how popularity of hydroxychloroquine was on the rise in the first few months, and then started to decline, while number of mentions of favipiravir shows stable rise. Another good way to visualize relative popularity is to use **stack plot** (or **area plot** in Pandas terminology):"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 154,
"source": [
"dfmt.plot.area()\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjEAAAGxCAYAAACTN+exAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydd3gc1dm375nZolWvVnGTO+4NjI1pBmMwGHAIAUIgdEjyQcJLJ5RAKMFJCCEh1BCKgdB7MWDAGLANbsK9S5bVe1lJuzvlfH/MVnXZqs7cvnRZ2jlz5sy2+c1TJSGEwMLCwsLCwsJigCH39QIsLCwsLCwsLA4GS8RYWFhYWFhYDEgsEWNhYWFhYWExILFEjIWFhYWFhcWAxBIxFhYWFhYWFgMSS8RYWFhYWFhYDEgsEWNhYWFhYWExILFEjIWFhYWFhcWAxNbXC+gpDMOgqKiIuLg4JEnq6+VYWFhYWFhYdAIhBPX19WRlZSHL7dtaDlsRU1RUxNChQ/t6GRYWFhYWFhYHwYEDBxgyZEi7Yw5bERMXFweYT0J8fHwfr8bCwsLCwsKiM9TV1TF06NDgdbw9DlsRE3AhxcfHWyLGwsLCwsJigNGZUBArsNfCwsLCwsJiQGKJGAsLCwsLC4sBiSViLCwsLCwsLAYkloixsLCwsLCwGJBYIsbCwsLCwsJiQGKJGAsLCwsLC4sBiSViLCwsLCwsLAYkloixsLCwsLCwGJBYIsbCwsLCwsJiQGKJGAsLCwsLC4sBiSViLCwsLCwsLAYkloixsLCwsLCwGJBYIsbCwsLCwsJiQGKJGIsBz6o9FazZV9nXy7CwsLCw6GVsfb0AC4tDobrBxyXP/YCqC644dgR3nD4eWe64fbuFhYWFxcDHssRYDGg2F9ai6gKAZ7/N5RfPfk+DV+vjVVlYWFhY9AaWiLEY0Gwpqo34e/XeShY8spKC6sY+WpGFhYWFRW9hiRiLAc3WoroWjxXWNHHK31ayNteKk7GwsLA4nLFEjMWAZmuhaYnRE+wRjzepOuc9vYaX1uzvi2VZWFhYWPQCloixGLDUe1TyKk23kTojBXV0HCJsuxBw57tbuPWtTWi60TeLtLCwsLDoMSwRYzFg2eZ3JYkoBRwK+qh4tImJiGbJSa+tPcC5T66mtkntg1VaWFhYWPQUloixGLAE4mGMODs0aSAE+pAY1GnJiGZp1jkHapj/t6/ZV+7ui6VaWFhYWPQAloixGLAEMpNEvB250gu6AENgDHLhOzIFYY98e5fXeznt0W/4emdZXyy3QwqqGzlQZWVVWVhYWHQWS8RYDFgC7iQj3h/Ua5NBAJqBSHLim5VquprC8GkGlzy3lidX7EEIQX9g/f5qrlm6juOWfMXJD3/NrtL6vl6ShYWFxYDAEjEWAxKPqrOr1HQNGfEOhNP/VlYk80c1ELF2vEenYcS2LEz90LKdXPffjfi0vgn4NQzB8m2l/OzJVfz0iVV8urUUAfh0g/97LadP1mRhYWEx0LDaDlgMSHaW1GMIgbDLSJqBiHeENkqS+c726RCl4JuVhmNjJXK1L2KODzcVs6fMzctXHk1KrLNX1u3VdN7bWMRTK/eyt7whtGQIZlZtLarjx4Iapg5J7JU1WVhYWAxULEuMxYAkEA9jxNuRGjRwRrqNkCRwKODVwS7jm5mKPiiqxTw7Suo5+W9fs6OkZdG87qS2SeWJFXs5bslX3PLWpqCACYQfN3ds3fLmph5dj4WFhcXhgCViLAYkgcwkEW8HtR2XkNMvZBQJdVoy2pDoFkNqGlUW/eNbPtlc3O3rLK5t4sGPt3PMQ1+wZNkOyuq9gKmxoKV4CbCzpJ51eVXdvh4LCwuLwwlLxFgMSAKVeoNBve0REDKShDYxCXVUXAvxoBmCX7+8gb9+urNbAn53ltRz4+s/ctySr3h65T4avDoQZnnpxCFufcuyxlhYWFi0hxUTYzHgUHWD7cVmBo+IcxDfVI4hNOqlhLZ3cipmjIxDQR8dD04F2/YapGZi4rGv9rC1qJYnLppJlF1pfa42EELwfW4VT329l692lrc+po3HjEFRaNmxSIbAvr4SScDe8gbW7K1k9qiULq3DwsLC4n8FyxJjMeDYW+7GpxsIRQLJYEnybfydX3O8+LJ9E4dDMV1PQqAPDRTFaznsq53lnPb3lZTUejq1Ht0QfLy5mMX/+o4Lnl7TpoBpjpBAz3ThmzsIdXoKIsmJkRKFPjQmOOa2ty1rTGsIIaxWEhYWFpaIsRh4bC0MxcPEeuuIk9xE4eUa/sX/4++4RDsF4+xys6J4qQi71GJYXmUj8//2NRvzq9ucyqPqLF2zn5MeXsFvXt7AjwWmi6vlbJEIGbQh0fiOTUedkoyINeN65EozXkYbERdsnZBX2cjKXZ0TRf8reDWdy59fy7Q/fs7O4p4NyLawsOjfWCLGYsARzEyKs5MhmcG41SShI3MM3/IANzFS7Gl7ghZF8dJaFMUDcHs1fvrEKt5YdyDi8eoGH48u383ch77krne3sN/fhLKtTKMAQpHQhsfiPS4DbWISItoGXh0lrx6p0oOR5DAFVpSCPiRkjfn9O5s79bz8L6Abghte+5Gvdpbj9mpc8eK6flO00MLCovfpsohZuXIlZ555JllZWUiSxLvvvtvm2GuuuQZJkvj73/8e8bjX6+W6664jNTWVmJgYzjrrLAoKCiLGVFdXc/HFF5OQkEBCQgIXX3wxNTU1XV2uxWFIwBJjxNsZbTfFyiamomKjjjjSKeUP/J7TxXtIog2XQ2tF8WJahogZAm5+cxP3vL+F/MpG7nl/K8c89CWPLN9FZYNZd6ZD8WKT0EbG4T0+He2IBIhSoElDya1HqvWhD49FZESDLAU/kdqI2OB8BdVNLN9WehDP1OGFEIJ73t/KR2FZZAXVTby29kA7e1lYWBzOdFnENDQ0MHXqVB577LF2x7377rt8//33ZGVltdh2/fXX88477/Dqq6/y7bff4na7WbRoEbquB8dceOGF5OTksGzZMpYtW0ZOTg4XX3xxV5drcZhhGIKtxYGeSQ6m29ebvyMRhY846iknDRs6v+BFbuZB4kVt65NJEtikUFG8o9MwEh2tDn1+1X6O/8tXPL8qjyZVj9jWpnhxyKhj4vGekIE2Jh4cClKDhrKvHqlRQ8+ORQxyhfKtA2vSBbhs6GHp4He9t6VzT9BhzD++2MPSNfuBSJfdHz/chmrFx1hY/E/SZRGzcOFC7r//fs4555w2xxQWFnLttdfy8ssvY7dHpsDW1tby7LPP8vDDDzN//nymT5/OSy+9xObNm1m+fDkA27dvZ9myZfz73/9mzpw5zJkzh2eeeYYPP/yQnTt3dnXJFocR+6saafDqZkCuAkfYzPdDDGYLAglIo5waElCxM5WNPMgNTBQ/tj5h86J4R7ZeFK8riCgFdXwC3uMz0EfGgU1GqldR9tUhVAN9ZBwiJSpSvITj/1TqI0Kp4MW1nggLxP8aL63ZzyPLdwHm0xYuHBt9Ovd9uK1vFmZhYdGndHtMjGEYXHzxxdx8881MnDixxfb169ejqioLFiwIPpaVlcWkSZNYtWoVAKtXryYhIYGjjz46OGb27NkkJCQEx1j8b7I10Lk61k60px5FMqglgcEURoxLpBYQVJNIEjXcxn2cL15CEVrrE3eiKF5HGNE21EmJeI9LRx8WC4qEVONDya1HAPrIeGjD0hOBJIEhENE29KzQOu59f2uX13Q48PHmYu5617RESbSegPbSmv1UuL29uzALC4s+p9tFzJIlS7DZbPz2t79tdXtJSQkOh4OkpKSIx9PT0ykpKQmOGTRoUIt9Bw0aFBzTHK/XS11dXcSPxeHHlrB4mHTJfC/sJ5sMWr4v7GgkUUM5qcgIzuId7uIuUkVZ65N3oiheaxhxdnxTk/AdOwh9cAzIElKlBzm3HmGX0UfEQVwnivKF4zfS6CNDayir9/LuxsI2dzkcWbWngt+9uhFBZH+p5hgCfvvfjb24MgsLi/5At4qY9evX8+ijj/L8888jtWUqbwMhRMQ+re3ffEw4f/rTn4JBwAkJCQwdOrRri7cYEAQ
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"311.146375pt\" version=\"1.1\" viewBox=\"0 0 403.97 311.146375\" width=\"403.97pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:46:33.182329</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 311.146375 \r\nL 403.97 311.146375 \r\nL 403.97 0 \r\nL 0 0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 39.65 273.312 \r\nL 396.77 273.312 \r\nL 396.77 7.2 \r\nL 39.65 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"PolyCollection_1\">\r\n <defs>\r\n <path d=\"M 55.882727 -37.834375 \r\nL 55.882727 -37.834375 \r\nL 73.919091 -37.834375 \r\nL 91.955455 -37.834375 \r\nL 109.991818 -37.834375 \r\nL 128.028182 -37.834375 \r\nL 146.064545 -37.834375 \r\nL 164.100909 -37.834375 \r\nL 182.137273 -37.834375 \r\nL 200.173636 -37.834375 \r\nL 218.21 -37.834375 \r\nL 236.246364 -37.834375 \r\nL 254.282727 -37.834375 \r\nL 272.319091 -37.834375 \r\nL 290.355455 -37.834375 \r\nL 308.391818 -37.834375 \r\nL 326.428182 -37.834375 \r\nL 344.464545 -37.834375 \r\nL 362.500909 -37.834375 \r\nL 380.537273 -37.834375 \r\nL 380.537273 -56.78316 \r\nL 380.537273 -56.78316 \r\nL 362.500909 -62.197099 \r\nL 344.464545 -61.689542 \r\nL 326.428182 -71.33312 \r\nL 308.391818 -66.934295 \r\nL 290.355455 -65.411625 \r\nL 272.319091 -80.976698 \r\nL 254.282727 -80.976698 \r\nL 236.246364 -83.852853 \r\nL 218.21 -91.804575 \r\nL 200.173636 -91.804575 \r\nL 182.137273 -89.774348 \r\nL 164.100909 -102.294081 \r\nL 146.064545 -102.294081 \r\nL 128.028182 -115.490557 \r\nL 109.991818 -69.641264 \r\nL 91.955455 -45.447726 \r\nL 73.919091 -37.834375 \r\nL 55.882727 -37.834375 \r\nz\r\n\" id=\"m431d8e7a78\" style=\"stroke:#1f77b4;\"/>\r\n </defs>\r\n <g clip-path=\"url(#p16276f1e15)\">\r\n <use style=\"fill:#1f77b4;stroke:#1f77b4;\" x=\"0\" xlink:href=\"#m431d8e7a78\" y=\"311.146375\"/>\r\n </g>\r\n </g>\r\n <g id=\"PolyCollection_2\">\r\n <defs>\r\n <path d=\"M 55.882727 -37.834375 \r\nL 55.882727 -37.834375 \r\nL 73.919091 -37.834375 \r\nL 91.955455 -45.447726 \r\nL 109.991818 -69.641264 \r\nL 128.028182 -115.490557 \r\nL 146.064545 -102.294081 \r\nL 164.100909 -102.294081 \r\nL 182.137273 -89.774348 \r\nL 200.173636 -91.804575 \r\nL 218.21 -91.804575 \r\nL 236.246364 -83.852853 \r\nL 254.282727 -80.976698 \r\nL 272.319091 -80.976698 \r\nL 290.355455 -65.411625 \r\nL 308.391818 -66.934295 \r\nL 326.428182 -71.33312 \r\nL 344.464545 -61.689542 \r\nL 362.500909 -62.197099 \r\nL 380.537273 -56.78316 \r\nL 380.537273 -65.073254 \r\nL 380.537273 -65.073254 \r\nL 362.500909 -67.10348 \r\nL 344.464545 -70.994749 \r\nL 326.428182 -83.176111 \r\nL 308.391818 -81.315069 \r\nL 290.355455 -76.577873 \r\nL 272.319091 -98.233627 \r\nL 254.282727 -98.233627 \r\nL 236.246364 -95.019101 \r\nL 218.21 -108.046391 \r\nL 200.173636 -112.614402 \r\nL 182.137273 -109.23069 \r\nL 164.100909 -132.409115 \r\nL 146.064545 -127.502733 \r\nL 128.028182 -147.805003 \r\nL 109.991818 -109.907432 \r\nL 91.955455 -57.629088 \r\nL 73.919091 -41.048901 \r\nL 55.882727 -37.834375 \r\nz\r\n\" id=\"mbacaa32bb5\" style=\"stroke:#ff7f0e;\"/>\r\n </defs>\r\n <g clip-path=
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"Even further, we can compute relative popularity in percents:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 155,
"source": [
"dfmtp = dfmt.iloc[:,:].apply(lambda x: x/x.sum(), axis=1)\r\n",
"dfmtp.plot.area()\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGxCAYAAACwbLZkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOy9d5xkaVn3/b1PqFyd83SY7sl5d2d2NrGwuywLSxRUEBEMiILyIIK8DxjwFVQUFXkQgQfFFwOKAoqSFljYzM7m2dmZnRw758rpnHPf7x+nqrp7unu6e6bjzPnuZ7YrnnOfrq6q37nC7xJKKYWHh4eHh4eHxwqhrfQCPDw8PDw8PK5tPDHi4eHh4eHhsaJ4YsTDw8PDw8NjRfHEiIeHh4eHh8eK4okRDw8PDw8PjxXFEyMeHh4eHh4eK4onRjw8PDw8PDxWFE+MeHh4eHh4eKwoxkovYD5IKenr6yMajSKEWOnleHh4eHh4eMwDpRTJZJKWlhY0bfb4x5oQI319fbS1ta30Mjw8PDw8PDwug+7ublpbW2e9f02IkWg0CrgHU1FRscKr8fDw8PDw8JgPiUSCtra28vf4bKwJMVJKzVRUVHhixMPDw8PDY40xV4mFV8Dq4eHh4eHhsaJ4YsTDw8PDw8NjRfHEiIeHh4eHh8eK4okRDw8PDw8PjxXFEyMeHh4eHh4eK4onRjw8PDw8PDxWFE+MeHh4eHh4eKwonhjx8PDw8PDwWFE8MeLh4eHh4eGxonhixMPDw8PDw2NFWbAYefjhh3nd615HS0sLQgi++c1vzvmchx56iL179xIIBOjq6uILX/jC5azVw8PDw8PD4ypkwWIknU6zZ88ePvvZz87r8WfPnuXVr341t99+O8899xy/+7u/y/ve9z6+8Y1vLHixHh4eHh4eHlcfCx6Ud++993LvvffO+/Ff+MIXaG9v59Of/jQA27Zt4+mnn+Yv//Iv+emf/umF7t7Dw8PDw8PjKmPJp/Y+/vjj3HPPPVNue+UrX8mXvvQlLMvCNM2lXsKKMp4u8N8He8nZcqWX4uHh4eHhMSt3b2tgY0N0Rfa95GJkYGCAxsbGKbc1NjZi2zYjIyM0NzdPe04+nyefz5evJxKJpV7mkvG5B0/xd4+cXelleHh4eHh4zIG6esUIgBBiynWl1Iy3l/jEJz7BH/3RHy35upaDnvFs+fLMR+vh4eHh4bGyKODsSGbF9r/kYqSpqYmBgYEptw0NDWEYBrW1tTM+5yMf+Qgf+MAHytcTiQRtbW1Lus6lIpaxypfVCq7Dw8PDw8NjNgxN0FwZWLn9L/UObrnlFr71rW9Nue0HP/gB+/btm7VexO/34/f7l3ppy0Isa839IA8PDw8Pj2uYBYuRVCrFqVOnytfPnj3LwYMHqampob29nY985CP09vbyT//0TwC8+93v5rOf/Swf+MAHeNe73sXjjz/Ol770Jf7t3/5t8Y5iFRPLFFZ6CYuCEqAqfShdTOSbRPGywP1f6bIANeX6bM+ZuK5K2wCEJREFB/LuT5GXYEkvzeXh4eFxlbJgMfL0009z5513lq+X0im/+Iu/yJe//GX6+/u5cOFC+f7Ozk6++93v8tu//dv87d/+LS0tLXzmM5+5Ztp6x9e4GFECZFMQe0MUFV7BziepoDAhTqaJlcm3W8oTLh4eHh5riAWLkTvuuKNcgDoTX/7yl6fd9rKXvYxnn312obta8+Qsh5y1Nlt6ZxQhlkTkHFBqogBGFf930XUx5fqkn5MfO+X2SejCjcCYGsqngU8HTUBARwX0uWtvpIK8gyjIGYSLe7sCd5vlf7j7nHx9yv0CpV38nOm3Tb4uChKRtREZG5Fx3MvZ4mVndVYQKQCf5v6ei/+Qyj2WQlH0FSTYnuDz8PBYPJalm+ZaJbEG60VmFCEFB70vgwwaqIaAm1qZazuLtSBHQdZ2oyKORDhuN5ZQuP7BQqAMDQyB8utgaq4YCBqo4MoWDStTQ4VneYsVnAmBkrERWQeRsdGyDuScJfmiV+D+nkpCI2hMFR2ln9o89j45UlWQE2KvIKfcXhYwa1OTe3h4LBOeGFlC1lLx6lwixOmIzEuELDq6KywoCotpgZbJOBIyNlgOwlYIWRQu4IoWDdA1MIW7AalAUo70CFmM2pQiP5OjObPtdHLtC6CEAKEmamOk+0ShCVecBHQ30uPTUT4dVeWbvk2pyuKkHEmZR1RF6WKKuJgsMMrXjXlMgFDKFRc5x40q6SXBV4xUlQTffCNVALacJlTIS4RVFC1Z9xjJe7VBHh7XIp4YWUImt/WuVlatCLkcdA1CGmBcWriU0owX+98s4dLK25YKMrabLnKUe10rpqV8OiroRiZU2JhXVAVdQwWLQsOc56ipguMKjbwEW06IL02g9KLgCOioShMlJoklpdxIVUnwOWpqpEoUt6EJNwJjauAvRloMzRU0oTl+z46aFi2aIsy8CMuaR/k07PYwKmxOEqYz1IJ5r/U1hSdGlpDVXLx6VYmQhbKSx6UJCBmokDGzSLIU5CyELcGBBUdVbDfKQN5xtyGV+2KLYuTE1BBBHRUxURUL/D0IV2S40ZU5BF/5eIrREFu6QsOZVDNUEi96McUW0N0oTMRERWYpls6V0ln21DRXxvE6rlY5KqBjr4/gtIbdiOdcWHJCqBRkud5rcrG6yBfvX50lWB4LwBMjS0i8GBkRrB7Ds2tahKx2hACfAJ9v5hrfmaIqtgQE6JQjESqoQ9Sc9W9u2f4WhZgQT3PtuyRccg7CKooohPulZbrRHwytnBpymMGHyJZTIymTL+e8L6yVQoYMnM4ITkuoXI8kYgVEynKjcEIgtGKNlVksWC+97qYG4RmE+8VYrlhhkmiZcj3nFq97nXarF0+MLCGx7OqJjHgi5CrgUlGVtc4MwmUKjoKU5YbxHRBKuekgU0P5NQgYbiqowoeqmOH5pQ6rbPGLKesgcvaky6u3w2mtIiMGdlcU2RQsf7aI0RwiXkA2BlGt4fJjJ8qziqI0KxGWAinLr4uiOEJEF6iL04Al4cI80oD5YpF4vvgvNyFYKN3mpYiWHU+MLCGlmpEV7egoi5CKiRoET4R4rDV0AREThSukp7ynlILiF4lWiqqIUh2Ohgoa5UJoFbyEkCtIV6DMIlgoeGmg+SArTVeENATLt2nDWUTKxmkKomovYTk+RzRt2utuFQuhHbfdHDlD/VJJsBYjLipkuKL+UgdRiqhMFinFOqvybd7fw6LiiZElZCW7aa4lEaIrmyAZQuV/aYJkCBd/lm4v3eYnzwBNnGEjp9nEAM0oMc/iT4/Vh3A7ewjozHhCa0tI2m76x1EIVex20ifV4piaK1x8s0RWwH1uzikLFiZHVUqi5Ro9o1aArPHhdEWRJbGhFNpgDpGzcZpDUB+85DYWzCzRtBlFhixZBDgIm6JoUWWHaGUUX3+/5hbC+9xCbhW9hNFjKdpmSbdrrrTNUofepJ/u7Wqii0/N/nhRvj7pPkcVnamLbtRXYRDPEyNLyEpYwa9lEWKqAus5QwWJi0REuiwyJguL0s8A+QXva8+ky2nCnFEbOV0UJ6fZSFxUL96BeawshgZRbfa6FUdB2nbbjItn1wqB0EEZxS8of/Gs+lJdTqVtLXEsVOQl2ngebSyPNlZwz9JXCAXIOj92VxRVXazjkQptIAuWRDaHwLfIIuRy0EoWAcbsfwdKudGVrFUWrsgJawA0gTIFyqeDf6qfEbNtc6mwJnn4TBYppfb50kiN4uW1YFLoiZElZDlbe9ekCFGKVrrZxUF28TxbeRE/ly/gcgTIECJLiCxB8vgp4MPGwEFHui5p6DiESREhRQs9hEmzi+fZxfPlbY2ounLk5DQbOcsGcmIVfKh6LD66KBZJGjOfYSvX4K18FuxM8q4ximmAkoeLPnnw0tKgQhpOyMBZ59ZciIxdFCZ5tPHlEScKkI1B7K4IqqLY1eUotP4MKOWKkPl42qwmhHA9iEzt0pGWog+PG2WRroBVbm5IoRBqhpYFIdzHXPwZLCb9LHkVlUwHi6aOCMpRPEytuM4FFPfChEnhRSJmym22JL2CRQWeGFl
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"311.146375pt\" version=\"1.1\" viewBox=\"0 0 394.423125 311.146375\" width=\"394.423125pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:46:34.326307</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 311.146375 \r\nL 394.423125 311.146375 \r\nL 394.423125 0 \r\nL 0 0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 30.103125 273.312 \r\nL 387.223125 273.312 \r\nL 387.223125 7.2 \r\nL 30.103125 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"PolyCollection_1\">\r\n <defs>\r\n <path d=\"M 46.335852 -37.834375 \r\nL 46.335852 -37.834375 \r\nL 64.372216 -37.834375 \r\nL 82.40858 -37.834375 \r\nL 100.444943 -37.834375 \r\nL 118.481307 -37.834375 \r\nL 136.51767 -37.834375 \r\nL 154.554034 -37.834375 \r\nL 172.590398 -37.834375 \r\nL 190.626761 -37.834375 \r\nL 208.663125 -37.834375 \r\nL 226.699489 -37.834375 \r\nL 244.735852 -37.834375 \r\nL 262.772216 -37.834375 \r\nL 280.80858 -37.834375 \r\nL 298.844943 -37.834375 \r\nL 316.881307 -37.834375 \r\nL 334.91767 -37.834375 \r\nL 352.954034 -37.834375 \r\nL 370.990398 -37.834375 \r\nL 370.990398 -65.554375 \r\nL 370.990398 -65.554375 \r\nL 352.954034 -75.342453 \r\nL 334.91767 -71.64236 \r\nL 316.881307 -84.255393 \r\nL 298.844943 -71.39225 \r\nL 280.80858 -78.41465 \r\nL 262.772216 -86.063629 \r\nL 244.735852 -86.063629 \r\nL 226.699489 -93.562491 \r\nL 208.663125 -94.609206 \r\nL 190.626761 -99.691575 \r\nL 172.590398 -100.079239 \r\nL 154.554034 -104.01713 \r\nL 136.51767 -106.317098 \r\nL 118.481307 -115.490557 \r\nL 100.444943 -88.147701 \r\nL 82.40858 -78.858835 \r\nL 64.372216 -37.834375 \r\nL 46.335852 -37.834375 \r\nz\r\n\" id=\"m0f841d93fb\" style=\"stroke:#1f77b4;\"/>\r\n </defs>\r\n <g clip-path=\"url(#pbf92d1c67e)\">\r\n <use style=\"fill:#1f77b4;stroke:#1f77b4;\" x=\"0\" xlink:href=\"#m0f841d93fb\" y=\"311.146375\"/>\r\n </g>\r\n </g>\r\n <g id=\"PolyCollection_2\">\r\n <defs>\r\n <path d=\"M 46.335852 -37.834375 \r\nL 46.335852 -37.834375 \r\nL 64.372216 -37.834375 \r\nL 82.40858 -78.858835 \r\nL 100.444943 -88.147701 \r\nL 118.481307 -115.490557 \r\nL 136.51767 -106.317098 \r\nL 154.554034 -104.01713 \r\nL 172.590398 -100.079239 \r\nL 190.626761 -99.691575 \r\nL 208.663125 -94.609206 \r\nL 226.699489 -93.562491 \r\nL 244.735852 -86.063629 \r\nL 262.772216 -86.063629 \r\nL 280.80858 -78.41465 \r\nL 298.844943 -71.39225 \r\nL 316.881307 -84.255393 \r\nL 334.91767 -71.64236 \r\nL 352.954034 -75.342453 \r\nL 370.990398 -65.554375 \r\nL 370.990398 -77.681875 \r\nL 370.990398 -77.681875 \r\nL 352.954034 -82.896163 \r\nL 334.91767 -84.829872 \r\nL 316.881307 -100.666863 \r\nL 298.844943 -87.976084 \r\nL 280.80858 -94.845927 \r\nL 262.772216 -105.35533 \r\nL 244.735852 -105.35533 \r\nL 226.699489 -107.084755 \r\nL 208.663125 -111.695049 \r\nL 190.626761 -123.54247 \r\nL 172.590398 -123.395719 \r\nL 154.554034 -134.937158 \r\nL 136.51767 -133.099056 \r\nL 118.481307 -147.805003 \r\nL 100.444943 -151.842231 \r\nL 82.40858 -144.497972 \r\nL 64.372216 -91.939544 \r\nL 46.335852 -37.834375 \r\nz\r\n\" id=\"m2bb319ab5d\" style=\"stroke:#ff7f0e;\"/>\
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"\r\n",
"## Computing Medicine-Diagnosis Correspondence\r\n",
"\r\n",
"One of the most interesting relationships we can look for is how different diagnoses are treated with different medicines. In order to visualize it, we need to compute **co-occurence frequency map**, which would show how many times two terms are mentioned in the same paper.\r\n",
"\r\n",
"Such a map is essentialy a 2D matrix, which is best represented by **numpy array**. We will compute this map by walking through all abstracts, and marking entities that occur there:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 156,
"source": [
"m = np.zeros((len(medications),len(diagnosis)))\r\n",
"for a in df['abstract']:\r\n",
" x = str(a).lower()\r\n",
" for i,d in enumerate(diagnosis):\r\n",
" if ' '+d in x:\r\n",
" for j,me in enumerate(medications):\r\n",
" if ' '+me in x:\r\n",
" m[j,i] += 1"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 157,
"source": [
"m"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[4788., 2264., 741., 2109., 348., 2730., 975.],\n",
" [2111., 1238., 231., 998., 79., 1394., 364.],\n",
" [2186., 821., 691., 1063., 185., 1136., 573.],\n",
" [3210., 2191., 522., 1538., 160., 2191., 622.],\n",
" [1803., 773., 406., 880., 133., 909., 410.],\n",
" [1982., 1102., 379., 885., 113., 1366., 370.],\n",
" [ 504., 356., 83., 259., 23., 354., 106.],\n",
" [1419., 640., 345., 742., 108., 760., 314.],\n",
" [1537., 678., 330., 782., 93., 826., 301.],\n",
" [ 967., 634., 201., 431., 44., 656., 136.],\n",
" [ 660., 336., 293., 385., 53., 452., 148.]])"
]
},
"metadata": {},
"execution_count": 157
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"One of the ways to visualize this matrix is to draw a **heatmap**:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 158,
"source": [
"plt.imshow(m,interpolation='nearest',cmap='hot')\r\n",
"ax = plt.gca()\r\n",
"ax.set_yticks(range(len(medications))) \r\n",
"ax.set_yticklabels(medications)\r\n",
"ax.set_xticks(range(len(diagnosis)))\r\n",
"ax.set_xticklabels(diagnosis,rotation=90)\r\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAHgCAYAAACPaOswAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAA9hAAAPYQGoP6dpAABqWklEQVR4nO3deXxM1/vA8c9k3xcJkRAiIiERiaW+gmpsP0qtbRSxRCyt1hLE9rUFtdVWqrYWCbXUXlVFLVGClGgsFVtIoxW1JxVbZO7vDzVfI4tlZCbheb9e99XMveee89wZnWfuufeeo1IURUEIIYR4SUaGDkAIIUTRJolECCGETiSRCCGE0IkkEiGEEDqRRCKEEEInkkiEEELoRBKJEEIInZgYOgDx5lGr1Vy6dAlbW1tUKpWhwxFC5EFRFP755x/c3NwwMsr7vEMSidC7S5cu4e7ubugwhBDP6eLFi5QuXTrP7ZJIhN7Z2toCj/5x2tnZGTaYNvaGbR8YssvQETzyeRVDR/BIg2OGjgB2RRk6gkfmRRm2/XtAFP/7fzYvkkiE3j3uzrKzszN8IikE/weYGzqAf9kZGzqCRwrBR4KdhaEjeKSQhPHMLmi52C6EEEInkkiEEELoRBKJEEIInUgiEUIIoRNJJEIIIXQiiUQIIYROJJEIIYTQiSQSIYQQOpFEIoQQQicvnEiCg4OJiIh4pUEURJ2vWkpKCiqVisTExDzLREdH4+DgoLeYdFXU4hVCFE5yRvIG+/DDDzlz5oyhwxBCFHGFYVibZ3rw4AFmZmaGDkMvsrKyMDU11UtblpaWWFpa6qUtIcTr66XOSNRqNUOGDKFYsWKULFmSqKgoAMLDw3nvvfe0yj58+JCSJUuyePFiADIzM+nSpQs2Nja4uroyffr0HPV7eHjw2WefERYWhr29PT179gRg3bp1+Pn5YW5ujoeHh9a+48aNw83NjevXr2vWtWzZknr16qFWq58rNrVazZQpU/Dy8sLc3JwyZcowYcIErX3Onz9P/fr1sbKyIiAggAMHDuT7Xs2bN4/y5ctjZmaGj48Py5Yt09quUqmYP38+rVq1wtrams8+++y59jt79iz16tXDwsICX19ffv75Z1QqFRs3bgQgNjYWlUrFrVu3NPskJiaiUqlISUkBcnZtRUVFERgYyLJly/Dw8MDe3p727dvzzz//aMooisLnn3+Op6cnlpaWBAQEsHbt2nzfAyHE6+2lEklMTAzW1tbEx8fz+eefM27cOH7++Wd69OjB1q1bSUtL05TdsmULt2/fpl27dgAMHjyY3bt3s2HDBrZv305sbCwJCQk52pg6dSqVK1cmISGBUaNGkZCQQLt27Wjfvj3Hjx8nKiqKUaNGER0dDcCIESPw8PCgR48eAMyfP59ffvmFZcuWYWRk9FyxDR8+nClTpjBq1ChOnjzJihUrcHFx0YprxIgRREZGkpiYiLe3Nx06dODhw4e5vk8bNmygf//+DBo0iBMnTvDRRx/RrVs3du/erVVuzJgxtGrViuPHjxMeHv7M/dRqNW3btsXY2JiDBw8yf/58hg4d+iIfYZ6Sk5PZuHEjmzdvZvPmzezZs4fJkydrto8cOZIlS5Ywb948fv/9dwYMGECnTp3Ys2dPnnXev3+fjIwMrUUI8fp4qa6tKlWqMGbMGAAqVKjAnDlz2LlzJ5MnT9b8eh4yZAgAS5YsISQkBBsbG27fvs2iRYtYunQpjRs3Bh4lpdwmTGnQoAGRkZGa16GhoTRs2JBRo0YB4O3tzcmTJ5k6dSphYWEYGxvz7bffEhgYyLBhw/jyyy9ZuHAhZcuWBaB27dr5xvbPP/8wa9Ys5syZQ9euXQEoX748devW1YorMjKS5s2bAzB27Fj8/Pw4d+4cFStWzHEM06ZNIywsjE8++QSAgQMHcvDgQaZNm0b9+vU15Tp27Eh4eLjW6/z227FjB0lJSaSkpGjeu4kTJ/Luu+8+x6eXP7VaTXR0tGb+gc6dO7Nz504mTJhAZmYmM2bMYNeuXQQFBQHg6enJvn37WLBgAe+8806udU6aNImxY8fqHJsQonB6qTOSKlW0Z8BxdXXlypUrAPTo0YMlS5YAcOXKFX788UfNl2RycjIPHjzQfAkBFCtWDB8fnxxt1KhRQ+t1UlISderU0VpXp04dzp49S3Z2NvDoS23atGlMmTKFFi1aEBoaqlU+v9iSkpK4f/8+DRs2fO5jd3V11dSVm7xiTkpKeqljfbxfUlISZcqU0UrAT76nuvDw8NCaxObJz/bkyZPcu3ePxo0bY2Njo1mWLl1KcnJynnUOHz6c9PR0zXLx4sVXEqsQonB4qTOSpy8Gq1Qq1Go1AF26dGHYsGEcOHCAAwcO4OHhwdtvvw086l9/XtbW1lqvFUXJMblKbvX98ssvGBsbk5KSwsOHDzEx+d8h5hfb8150fvLYH8fz+Nhzk1vMT697+liftV9ux/10+cfzKz9ZNisrK884H8vvs3383x9//JFSpUpplTM3z3t6JnNz83y3CyGKtld++6+TkxOtW7dmyZIlLFmyhG7dumm2eXl5YWpqysGDBzXrbt68+Vy3oPr6+rJv3z6tdfv378fb2xtj40dTu3333XesX7+e2NhYLl68yPjx4587tgoVKmBpacnOnTtf6rhzU6lSpVxjrlSpkk77+fr6kpqayqVLlzTbn77oX7x4cQCta0L5PQPzPHx9fTE3Nyc1NRUvLy+tReZgF+LNVSC3//bo0YP33nuP7OxszfUGABsbG7p3787gwYNxcnLCxcWFESNGaH4952fQoEG89dZbjB8/ng8//JADBw4wZ84c5s6dC8Cff/5J7969mTJlCnXr1iU6OprmzZvz7rvvUqtWrWfGZmFhwdChQxkyZAhmZmbUqVOHq1ev8vvvv9O9e/eXeh8GDx5Mu3btqFatGg0bNuSHH35g/fr17NixQ6f9GjVqhI+PD126dGH69OlkZGQwYsQIrToef7lHRUXx2Wefcfbs2VzvkHsRtra2REZGMmDAANRqNXXr1iUjI4P9+/djY2Oj9X4KId4cBZJIGjVqhKurK35+fri5uWltmzp1Krdv36Zly5bY2toyaNAg0tPTn1lntWrVWL16NaNHj2b8+PG4uroybtw4wsLCUBSFsLAwatasSZ8+fQBo3Lgxffr0oVOnTiQmJmJjY/PM2EaNGoWJiQmjR4/m0qVLuLq68vHHH7/0+9C6dWtmzZrF1KlT6devH+XKlWPJkiUEBwfrtJ+RkREbNmyge/fu1KxZEw8PD2bPnk3Tpk01dZiamrJy5Up69+5NQEAAb731Fp999hkhISEvfTwA48ePp0SJEkyaNInz58/j4OBAtWrV+O9//6tTvUKIokulvMiFi+d0584d3NzcWLx4MW3btn3V1eukMMemK5VKxYYNG2jdurWhQ8lXRkYG9vb2pKenY2dnZ9hgmqieXaaA9d9u6AgemVXV0BE8Uus3Q0cAByc/u4w+zBpm2PbvAcPgmf+vvtIzErVazeXLl5k+fTr29va0bNnyVVavk8IcmxBCFGWvNJGkpqZSrlw5SpcuTXR0tNYdU4ZWmGMTQoii7JV+m3p4eLzQLb76VJhje1Ve9+MTQhROMvqvEEIInUgiEUIIoRNJJEIIIXQiiUQIIYROJJEIIYTQiSQSIYQQOpFEIoQQQifyVJ4wnLftwdiwIWwpBMNxbDN0AP+aUwjeC4Czhg4AOGTgoUke+87A7ec+92tOckYihBBCJ5JIhBBC6EQSiRBCCJ1IIhFCCKETSSRCCCF0IolECCGETiSRCCGE0IkkEiGEEDqRRCKEEEInkkiEEELoRBKJEEIInUgieU4pKSmoVCoSExPzLBMdHY2Dg4PeYtJVUYtXCFE4SSJ5g3344YecOXPG0GEIIYo4Gf23kMnKysLU1FQvbVlaWmJpaamXtoQQry85I3mKWq1mypQpeHl5YW5uTpkyZZgwYYJm+/nz56lfvz5WVlYEBARw4MCBfOubN28e5cuXx8zMDB8fH5YtW6a1XaVSMX/+fFq1aoW1tTWfffbZc+139uxZ6tWrh4WFBb6+vvz888+
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<svg height=\"345.954187pt\" version=\"1.1\" viewBox=\"0 0 290.159625 345.954187\" width=\"290.159625pt\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n <metadata>\r\n <rdf:RDF xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\r\n <cc:Work>\r\n <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\r\n <dc:date>2021-08-30T17:47:01.511850</dc:date>\r\n <dc:format>image/svg+xml</dc:format>\r\n <dc:creator>\r\n <cc:Agent>\r\n <dc:title>Matplotlib v3.4.2, https://matplotlib.org/</dc:title>\r\n </cc:Agent>\r\n </dc:creator>\r\n </cc:Work>\r\n </rdf:RDF>\r\n </metadata>\r\n <defs>\r\n <style type=\"text/css\">*{stroke-linecap:butt;stroke-linejoin:round;}</style>\r\n </defs>\r\n <g id=\"figure_1\">\r\n <g id=\"patch_1\">\r\n <path d=\"M 0 345.954187 \r\nL 290.159625 345.954187 \r\nL 290.159625 0 \r\nL 0 0 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g id=\"axes_1\">\r\n <g id=\"patch_2\">\r\n <path d=\"M 113.615625 273.312 \r\nL 282.959625 273.312 \r\nL 282.959625 7.2 \r\nL 113.615625 7.2 \r\nz\r\n\" style=\"fill:#ffffff;\"/>\r\n </g>\r\n <g clip-path=\"url(#pc1bf481d13)\">\r\n <image height=\"266.4\" id=\"image8c7a87aa12\" transform=\"scale(1 -1)translate(0 -266.4)\" width=\"169.92\" x=\"113.615625\" xlink:href=\"data:image/png;base64,\r\niVBORw0KGgoAAAANSUhEUgAAAOwAAAFyCAYAAAD2yDaTAAAFyklEQVR4nO3docqkdRiH4edzP2wWQdgiJouwQZOC5i2C4AkYPABlk56Bdc0WiwdgVDAZDIIsWkQQZcFiEUQQNoxh3gNY0/PecF1H8GNm7nni/+b9mcucwKPtATPzZHvA4bntAYczfCdvbA84/LQ94PDM9gDg6QkWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyE3L6yveBwhtfPv90ecHh+e8CJvLg94OAFduB/EyyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBBy++72gsPP2wPmPA8pv7Y94PBoe8DM3N0ecHhhe8DBhYUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChZDbv7YXHJ5sD5iZO9sDDmf4LGbO8Xn8sz3gcJbLdpYdwFMQLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCLl5e+ayPWJm5s/tATPzy/aAw93tAYfftgfMzL3tAYdftwccXFgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBByc3npHA86f/779oKZz7YHHO5vDzh8uj1gZr7cHnB4sD3g4MJCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwULI7fyxPeHqx+0BM3Nne8DJnOHf/JvtAYdntwcczvCdAE9JsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIeTm8v1ctkfMzMyH2wNm3vl2e8HVw+0Bh7e2B8zM4ze3F1y9d5LfhgsLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCbs/wkPLMzMcneDD37vaAw9fbAw53tgfMzBcn+F3MzPy7PeDgwkKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQsjt/LM94eqt7QEz82B7wOHe9oDD39sDZubl7QGHx9sDDi4shAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIublcLpftETMzc/9me8F88NX2gquHr24vuHr9h+0FM999sr3g6uFH2wuuXFgIESyECBZCBAshgoUQwUKIYCFEsBAiWAgRLIQIFkIECyGChRDBQohgIUSwECJYCBEshAgWQgQLIYKFEMFCiGAhRLAQIlgIESyECBZCBAshgoUQwUKIYCFEsBAiWAj5D+pnOKxmJBoZAAAAAElFTkSuQmCC\" y=\"-6.912\"/>\r\n </g>\r\n <g id=\"matplotlib.axis_1\">\r\n <g id=\"xtick_1\">\r\n <g id=\"line2d_1\">\r\n <defs>\r\n <path d=\"M 0 0 \r\nL 0 3.5 \r\n\" id=\"mf4b223099c\" style=\"stroke:#000000;stroke-width:0.8;\"/>\r\n </defs>\r\n <g>\r\n <use style=\"stroke:#000000;stroke-width:0.8;\" x=\"125.711625\" xlink:href=\"#mf4b223099c\" y=\"2
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"However, even better visualization can be done using so-called **Sankey** diagram! `matplotlib` does not have built-in support for this diagram type, so we would have to use [Plotly](https://plotly.com/python/) as described [in this tutorial](https://plotly.com/python/sankey-diagram/).\r\n",
"\r\n",
"To make plotly sankey diagram, we need to build the following lists:\r\n",
"* List `all_nodes` of all nodes in the graph, which will include both medications and diagnosis\r\n",
"* List of source and target indices - those lists would show, which nodes go to the left, and which to the right part of the diagram\r\n",
"* List of all links, each link consisting of:\r\n",
" - Source index in the `all_nodes` array\r\n",
" - Target index \r\n",
" - Value indicating strength of the link. This is exactly the value from our co-occurence matrix.\r\n",
" - Optionally color of the link. We will make an option to highlight some of the terms for clarity\r\n",
"\r\n",
"Generic code to draw sankey diagram is structured as a separate `sankey` function, which takes two lists (source and target categories) and co-occurence matrix. It also allows us to specify the treshold, and omit all links that are weaker than that treshold - this makes the diagram a little bit less complex. "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 160,
"source": [
"import plotly.graph_objects as go\r\n",
"\r\n",
"def sankey(cat1, cat2, m, treshold=0, h1=[], h2=[]):\r\n",
" all_nodes = cat1 + cat2\r\n",
" source_indices = list(range(len(cat1)))\r\n",
" target_indices = list(range(len(cat1),len(cat1)+len(cat2)))\r\n",
"\r\n",
" s, t, v, c = [], [], [], []\r\n",
" for i in range(len(cat1)):\r\n",
" for j in range(len(cat2)):\r\n",
" if m[i,j]>treshold:\r\n",
" s.append(i)\r\n",
" t.append(len(cat1)+j)\r\n",
" v.append(m[i,j])\r\n",
" c.append('pink' if i in h1 or j in h2 else 'lightgray')\r\n",
"\r\n",
" fig = go.Figure(data=[go.Sankey(\r\n",
" # Define nodes\r\n",
" node = dict(\r\n",
" pad = 40,\r\n",
" thickness = 40,\r\n",
" line = dict(color = \"black\", width = 1.0),\r\n",
" label = all_nodes),\r\n",
"\r\n",
" # Add links\r\n",
" link = dict(\r\n",
" source = s,\r\n",
" target = t,\r\n",
" value = v,\r\n",
" color = c\r\n",
" ))])\r\n",
" fig.show()\r\n",
"\r\n",
"sankey(medications,diagnosis,m,500,h2=[0])"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"link": {
"color": [
"pink",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"pink",
"pink",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"lightgray",
"pink",
"lightgray",
"lightgray",
"pink"
],
"source": [
0,
0,
0,
0,
0,
0,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
3,
3,
3,
3,
3,
3,
4,
4,
4,
4,
5,
5,
5,
5,
6,
7,
7,
7,
7,
8,
8,
8,
8,
9,
9,
9,
10
],
"target": [
11,
12,
13,
14,
16,
17,
11,
12,
14,
16,
11,
12,
13,
14,
16,
17,
11,
12,
13,
14,
16,
17,
11,
12,
14,
16,
11,
12,
14,
16,
11,
11,
12,
14,
16,
11,
12,
14,
16,
11,
12,
16,
11
],
"value": [
4788,
2264,
741,
2109,
2730,
975,
2111,
1238,
998,
1394,
2186,
821,
691,
1063,
1136,
573,
3210,
2191,
522,
1538,
2191,
622,
1803,
773,
880,
909,
1982,
1102,
885,
1366,
504,
1419,
640,
742,
760,
1537,
678,
782,
826,
967,
634,
656,
660
]
},
"node": {
"label": [
"hydroxychloroquine",
"chloroquine",
"tocilizumab",
"remdesivir",
"azithromycin",
"lopinavir",
"ritonavir",
"dexamethasone",
"heparin",
"favipiravir",
"methylprednisolone",
"covid",
"sars",
"pneumonia",
"infection",
"diabetes",
"coronavirus",
"death"
],
"line": {
"color": "black",
"width": 1
},
"pad": 40,
"thickness": 40
},
"type": "sankey"
}
],
"layout": {
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"heatmapgl": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmapgl"
}
],
"histogram": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
}
}
}
},
"metadata": {}
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Conclusion\r\n",
"\r\n",
"You have seen that we can use quite simple methods to extract information from non-structured data sources, such as text. In this example, we have taken the existing list of medications, but it would be much more powerful to use natural language processing (NLP) techniques to perform entity extraction from text. In [this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/) we describe how to use cloud services for entity extraction. Another option would be using Python NLP libraries such as [NLTK](https://www.nltk.org/) - an approach for extracting information from text using NLTK is described [here](https://www.nltk.org/book/ch07.html)."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Challenge\r\n",
"\r\n",
"Continue to research the COVID paper data along the following lines:\r\n",
"\r\n",
"1. Build co-occurrence matrix of different medications, and see which medications often occur together (i.e. mentioned in one abstract). You can modify the code for building co-occurrence matrix for medications and diagnoses.\r\n",
"1. Visualize this matrix using heatmap.\r\n",
"1. As a stretch goal, you may want to visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may help you draw a chord diagram.\r\n",
"1. As another stretch goal, try to extract dosages of different medications (such as **400mg** in *take 400mg of cholroquine daily*) using regular expressions, and build dataframe that shows different dosages for different medications. **Note**: consider numeric values that are in close textual vicinity of the medicine name."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python",
"version": "3.8.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.8 64-bit (conda)"
},
"interpreter": {
"hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}