Merge pull request #684 from microsoft/copilot/fix-e446e3a1-6b4c-4310-87d5-641ed6823a37

Add real-world data quality checks to data cleaning lesson
2 months ago · 57c2be2a87
parent 70d99571a1 8e723abc24
commit 57c2be2a87
1 changed files with 517 additions and 1 deletions
--- a/2-Working-With-Data/08-data-preparation/notebook.ipynb
+++ b/2-Working-With-Data/08-data-preparation/notebook.ipynb
@ -3687,6 +3687,522 @@
      "source": [
        "> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!"
      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Real-World Data Quality Checks\n",
+        "\n",
+        "> **Learning goal:** By the end of this section, you should be comfortable detecting and correcting common real-world data quality issues including inconsistent categorical values, abnormal numeric values (outliers), and duplicate entities with variations.\n",
+        "\n",
+        "While missing values and exact duplicates are common issues, real-world datasets often contain more subtle problems:\n",
+        "\n",
+        "1. **Inconsistent categorical values**: The same category spelled differently (e.g., \"USA\", \"U.S.A\", \"United States\")\n",
+        "2. **Abnormal numeric values**: Extreme outliers that indicate data entry errors (e.g., age = 999)\n",
+        "3. **Near-duplicate rows**: Records that represent the same entity with slight variations\n",
+        "\n",
+        "Let's explore techniques to detect and handle these issues."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Creating a Sample \"Dirty\" Dataset\n",
+        "\n",
+        "First, let's create a sample dataset that contains the types of issues we commonly encounter in real-world data:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "\n",
+        "# Create a sample dataset with quality issues\n",
+        "dirty_data = pd.DataFrame({\n",
+        "    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],\n",
+        "    'name': ['John Smith', 'Jane Doe', 'John Smith', 'Bob Johnson', \n",
+        "             'Alice Williams', 'Charlie Brown', 'John  Smith', 'Eva Martinez',\n",
+        "             'Bob Johnson', 'Diana Prince', 'Frank Castle', 'Alice Williams'],\n",
+        "    'age': [25, 32, 25, 45, 28, 199, 25, 31, 45, 27, -5, 28],\n",
+        "    'country': ['USA', 'UK', 'U.S.A', 'Canada', 'USA', 'United Kingdom',\n",
+        "                'United States', 'Mexico', 'canada', 'USA', 'UK', 'usa'],\n",
+        "    'purchase_amount': [100.50, 250.00, 105.00, 320.00, 180.00, 90.00,\n",
+        "                       102.00, 275.00, 325.00, 195.00, 410.00, 185.00]\n",
+        "})\n",
+        "\n",
+        "print(\"Sample 'Dirty' Dataset:\")\n",
+        "print(dirty_data)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 1. Detecting Inconsistent Categorical Values\n",
+        "\n",
+        "Notice the `country` column has multiple representations for the same countries. Let's identify these inconsistencies:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Check unique values in the country column\n",
+        "print(\"Unique country values:\")\n",
+        "print(dirty_data['country'].unique())\n",
+        "print(f\"\\nTotal unique values: {dirty_data['country'].nunique()}\")\n",
+        "\n",
+        "# Count occurrences of each variation\n",
+        "print(\"\\nValue counts:\")\n",
+        "print(dirty_data['country'].value_counts())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Standardizing Categorical Values\n",
+        "\n",
+        "We can create a mapping to standardize these values. A simple approach is to convert to lowercase and create a mapping dictionary:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Create a standardization mapping\n",
+        "country_mapping = {\n",
+        "    'usa': 'USA',\n",
+        "    'u.s.a': 'USA',\n",
+        "    'united states': 'USA',\n",
+        "    'uk': 'UK',\n",
+        "    'united kingdom': 'UK',\n",
+        "    'canada': 'Canada',\n",
+        "    'mexico': 'Mexico'\n",
+        "}\n",
+        "\n",
+        "# Standardize the country column\n",
+        "dirty_data['country_clean'] = dirty_data['country'].str.lower().map(country_mapping)\n",
+        "\n",
+        "print(\"Before standardization:\")\n",
+        "print(dirty_data['country'].value_counts())\n",
+        "print(\"\\nAfter standardization:\")\n",
+        "print(dirty_data[['country_clean']].value_counts())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**Alternative: Using Fuzzy Matching**\n",
+        "\n",
+        "For more complex cases, we can use fuzzy string matching with the `rapidfuzz` library to automatically detect similar strings:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "try:\n",
+        "    from rapidfuzz import process, fuzz\n",
+        "except ImportError:\n",
+        "    print(\"rapidfuzz is not installed. Please install it with 'pip install rapidfuzz' to use fuzzy matching.\")\n",
+        "    process = None\n",
+        "    fuzz = None\n",
+        "\n",
+        "# Get unique countries\n",
+        "unique_countries = dirty_data['country'].unique()\n",
+        "\n",
+        "# For each country, find similar matches\n",
+        "if process is not None and fuzz is not None:\n",
+        "    print(\"Finding similar country names (similarity > 70%):\")\n",
+        "    for country in unique_countries:\n",
+        "        matches = process.extract(country, unique_countries, scorer=fuzz.ratio, limit=3)\n",
+        "        # Filter matches with similarity > 70 and not identical\n",
+        "        similar = [m for m in matches if m[1] > 70 and m[0] != country]\n",
+        "        if similar:\n",
+        "            print(f\"\\n'{country}' is similar to:\")\n",
+        "            for match, score, _ in similar:\n",
+        "                print(f\"  - '{match}' (similarity: {score}%)\")\n",
+        "else:\n",
+        "    print(\"Skipping fuzzy matching because rapidfuzz is not available.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 2. Detecting Abnormal Numeric Values (Outliers)\n",
+        "\n",
+        "Looking at the `age` column, we have some suspicious values like 199 and -5. Let's use statistical methods to detect these outliers."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Display basic statistics\n",
+        "print(\"Age column statistics:\")\n",
+        "print(dirty_data['age'].describe())\n",
+        "\n",
+        "# Identify impossible values using domain knowledge\n",
+        "print(\"\\nRows with impossible age values (< 0 or > 120):\")\n",
+        "impossible_ages = dirty_data[(dirty_data['age'] < 0) | (dirty_data['age'] > 120)]\n",
+        "print(impossible_ages[['customer_id', 'name', 'age']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Using IQR (Interquartile Range) Method\n",
+        "\n",
+        "The IQR method is a robust statistical technique for outlier detection that is less sensitive to extreme values:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Calculate IQR for age (excluding impossible values)\n",
+        "valid_ages = dirty_data[(dirty_data['age'] >= 0) & (dirty_data['age'] <= 120)]['age']\n",
+        "\n",
+        "Q1 = valid_ages.quantile(0.25)\n",
+        "Q3 = valid_ages.quantile(0.75)\n",
+        "IQR = Q3 - Q1\n",
+        "\n",
+        "# Define outlier bounds\n",
+        "lower_bound = Q1 - 1.5 * IQR\n",
+        "upper_bound = Q3 + 1.5 * IQR\n",
+        "\n",
+        "print(f\"IQR-based outlier bounds for age: [{lower_bound:.2f}, {upper_bound:.2f}]\")\n",
+        "\n",
+        "# Identify outliers\n",
+        "age_outliers = dirty_data[(dirty_data['age'] < lower_bound) | (dirty_data['age'] > upper_bound)]\n",
+        "print(f\"\\nRows with age outliers:\")\n",
+        "print(age_outliers[['customer_id', 'name', 'age']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Using Z-Score Method\n",
+        "\n",
+        "The Z-score method identifies outliers based on standard deviations from the mean:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "try:\n",
+        "    from scipy import stats\n",
+        "except ImportError:\n",
+        "    print(\"scipy is required for Z-score calculation. Please install it with 'pip install scipy' and rerun this cell.\")\n",
+        "else:\n",
+        "    # Calculate Z-scores for age, handling NaN values\n",
+        "    age_nonan = dirty_data['age'].dropna()\n",
+        "    zscores = np.abs(stats.zscore(age_nonan))\n",
+        "    dirty_data['age_zscore'] = np.nan\n",
+        "    dirty_data.loc[age_nonan.index, 'age_zscore'] = zscores\n",
+        "\n",
+        "    # Typically, Z-score > 3 indicates an outlier\n",
+        "    print(\"Rows with age Z-score > 3:\")\n",
+        "    zscore_outliers = dirty_data[dirty_data['age_zscore'] > 3]\n",
+        "    print(zscore_outliers[['customer_id', 'name', 'age', 'age_zscore']])\n",
+        "\n",
+        "    # Clean up the temporary column\n",
+        "    dirty_data = dirty_data.drop('age_zscore', axis=1)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Handling Outliers\n",
+        "\n",
+        "Once detected, outliers can be handled in several ways:\n",
+        "1. **Remove**: Drop rows with outliers (if they're errors)\n",
+        "2. **Cap**: Replace with boundary values\n",
+        "3. **Replace with NaN**: Treat as missing data and use imputation techniques\n",
+        "4. **Keep**: If they're legitimate extreme values"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Create a cleaned version by replacing impossible ages with NaN\n",
+        "dirty_data['age_clean'] = dirty_data['age'].apply(\n",
+        "    lambda x: np.nan if (x < 0 or x > 120) else x\n",
+        ")\n",
+        "\n",
+        "print(\"Age column before and after cleaning:\")\n",
+        "print(dirty_data[['customer_id', 'name', 'age', 'age_clean']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 3. Detecting Near-Duplicate Rows\n",
+        "\n",
+        "Notice that our dataset has multiple entries for \"John Smith\" with slightly different values. Let's identify potential duplicates based on name similarity."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# First, let's look at exact name matches (ignoring extra whitespace)\n",
+        "dirty_data['name_normalized'] = dirty_data['name'].str.strip().str.lower()\n",
+        "\n",
+        "print(\"Checking for duplicate names:\")\n",
+        "duplicate_names = dirty_data[dirty_data.duplicated(['name_normalized'], keep=False)]\n",
+        "print(duplicate_names.sort_values('name_normalized')[['customer_id', 'name', 'age', 'country']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Finding Near-Duplicates with Fuzzy Matching\n",
+        "\n",
+        "For more sophisticated duplicate detection, we can use fuzzy matching to find similar names:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "try:\n",
+        "    from rapidfuzz import process, fuzz\n",
+        "\n",
+        "    # Function to find potential duplicates\n",
+        "    def find_near_duplicates(df, column, threshold=90):\n",
+        "        \"\"\"\n",
+        "        Find near-duplicate entries in a column using fuzzy matching.\n",
+        "        \n",
+        "        Parameters:\n",
+        "        - df: DataFrame\n",
+        "        - column: Column name to check for duplicates\n",
+        "        - threshold: Similarity threshold (0-100)\n",
+        "        \n",
+        "        Returns: List of potential duplicate groups\n",
+        "        \"\"\"\n",
+        "        values = df[column].unique()\n",
+        "        duplicate_groups = []\n",
+        "        checked = set()\n",
+        "        \n",
+        "        for value in values:\n",
+        "            if value in checked:\n",
+        "                continue\n",
+        "                \n",
+        "            # Find similar values\n",
+        "            matches = process.extract(value, values, scorer=fuzz.ratio, limit=len(values))\n",
+        "            similar = [m[0] for m in matches if m[1] >= threshold]\n",
+        "            \n",
+        "            if len(similar) > 1:\n",
+        "                duplicate_groups.append(similar)\n",
+        "                checked.update(similar)\n",
+        "        \n",
+        "        return duplicate_groups\n",
+        "\n",
+        "    # Find near-duplicate names\n",
+        "    duplicate_groups = find_near_duplicates(dirty_data, 'name', threshold=90)\n",
+        "\n",
+        "    print(\"Potential duplicate groups:\")\n",
+        "    for i, group in enumerate(duplicate_groups, 1):\n",
+        "        print(f\"\\nGroup {i}:\")\n",
+        "        for name in group:\n",
+        "            matching_rows = dirty_data[dirty_data['name'] == name]\n",
+        "            print(f\"  '{name}': {len(matching_rows)} occurrence(s)\")\n",
+        "            for _, row in matching_rows.iterrows():\n",
+        "                print(f\"    - Customer {row['customer_id']}: age={row['age']}, country={row['country']}\")\n",
+        "except ImportError:\n",
+        "    print(\"rapidfuzz is not installed. Skipping fuzzy matching for near-duplicates.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Handling Duplicates\n",
+        "\n",
+        "Once identified, you need to decide how to handle duplicates:\n",
+        "1. **Keep the first occurrence**: Use `drop_duplicates(keep='first')`\n",
+        "2. **Keep the last occurrence**: Use `drop_duplicates(keep='last')`\n",
+        "3. **Aggregate information**: Combine information from duplicate rows\n",
+        "4. **Manual review**: Flag for human review"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Example: Remove duplicates based on normalized name, keeping first occurrence\n",
+        "cleaned_data = dirty_data.drop_duplicates(subset=['name_normalized'], keep='first')\n",
+        "\n",
+        "print(f\"Original dataset: {len(dirty_data)} rows\")\n",
+        "print(f\"After removing name duplicates: {len(cleaned_data)} rows\")\n",
+        "print(f\"Removed: {len(dirty_data) - len(cleaned_data)} duplicate rows\")\n",
+        "\n",
+        "print(\"\\nCleaned dataset:\")\n",
+        "print(cleaned_data[['customer_id', 'name', 'age', 'country_clean']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Summary: Complete Data Cleaning Pipeline\n",
+        "\n",
+        "Let's put it all together into a comprehensive cleaning pipeline:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def clean_dataset(df):\n",
+        "    \"\"\"\n",
+        "    Comprehensive data cleaning function.\n",
+        "    \"\"\"\n",
+        "    # Create a copy to avoid modifying the original\n",
+        "    cleaned = df.copy()\n",
+        "    \n",
+        "    # 1. Standardize categorical values (country)\n",
+        "    country_mapping = {\n",
+        "        'usa': 'USA', 'u.s.a': 'USA', 'united states': 'USA',\n",
+        "        'uk': 'UK', 'united kingdom': 'UK',\n",
+        "        'canada': 'Canada', 'mexico': 'Mexico'\n",
+        "    }\n",
+        "    cleaned['country'] = cleaned['country'].str.lower().map(country_mapping)\n",
+        "    \n",
+        "    # 2. Clean abnormal age values\n",
+        "    cleaned['age'] = cleaned['age'].apply(\n",
+        "        lambda x: np.nan if (x < 0 or x > 120) else x\n",
+        "    )\n",
+        "    \n",
+        "    # 3. Remove near-duplicate names (normalize whitespace)\n",
+        "    cleaned['name'] = cleaned['name'].str.strip()\n",
+        "    cleaned = cleaned.drop_duplicates(subset=['name'], keep='first')\n",
+        "    \n",
+        "    return cleaned\n",
+        "\n",
+        "# Apply the cleaning pipeline\n",
+        "final_cleaned_data = clean_dataset(dirty_data)\n",
+        "\n",
+        "print(\"Before cleaning:\")\n",
+        "print(f\"  Rows: {len(dirty_data)}\")\n",
+        "print(f\"  Unique countries: {dirty_data['country'].nunique()}\")\n",
+        "print(f\"  Invalid ages: {((dirty_data['age'] < 0) | (dirty_data['age'] > 120)).sum()}\")\n",
+        "\n",
+        "print(\"\\nAfter cleaning:\")\n",
+        "print(f\"  Rows: {len(final_cleaned_data)}\")\n",
+        "print(f\"  Unique countries: {final_cleaned_data['country'].nunique()}\")\n",
+        "print(f\"  Invalid ages: {((final_cleaned_data['age'] < 0) | (final_cleaned_data['age'] > 120)).sum()}\")\n",
+        "\n",
+        "print(\"\\nCleaned dataset:\")\n",
+        "print(final_cleaned_data[['customer_id', 'name', 'age', 'country', 'purchase_amount']])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 🎯 Challenge Exercise\n",
+        "\n",
+        "Now it's your turn! Below is a new row of data with multiple quality issues. Can you:\n",
+        "\n",
+        "1. Identify all the issues in this row\n",
+        "2. Write code to clean each issue\n",
+        "3. Add the cleaned row to the dataset\n",
+        "\n",
+        "Here's the problematic data:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# New problematic row\n",
+        "new_row = pd.DataFrame({\n",
+        "    'customer_id': [13],\n",
+        "    'name': ['  Diana  Prince  '],  # Extra whitespace\n",
+        "    'age': [250],  # Impossible age\n",
+        "    'country': ['U.S.A.'],  # Inconsistent format\n",
+        "    'purchase_amount': [150.00]\n",
+        "})\n",
+        "\n",
+        "print(\"New row to clean:\")\n",
+        "print(new_row)\n",
+        "\n",
+        "# TODO: Your code here to clean this row\n",
+        "# Hints:\n",
+        "# 1. Strip whitespace from the name\n",
+        "# 2. Check if the name is a duplicate (Diana Prince already exists)\n",
+        "# 3. Handle the impossible age value\n",
+        "# 4. Standardize the country name\n",
+        "\n",
+        "# Example solution (uncomment and modify as needed):\n",
+        "# new_row_cleaned = new_row.copy()\n",
+        "# new_row_cleaned['name'] = new_row_cleaned['name'].str.strip()\n",
+        "# new_row_cleaned['age'] = np.nan  # Invalid age\n",
+        "# new_row_cleaned['country'] = 'USA'  # Standardized\n",
+        "# print(\"\\nCleaned row:\")\n",
+        "# print(new_row_cleaned)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Key Takeaways\n",
+        "\n",
+        "1. **Inconsistent categories** are common in real-world data. Always check unique values and standardize them using mappings or fuzzy matching.\n",
+        "\n",
+        "2. **Outliers** can significantly affect your analysis. Use domain knowledge combined with statistical methods (IQR, Z-score) to detect them.\n",
+        "\n",
+        "3. **Near-duplicates** are harder to detect than exact duplicates. Consider using fuzzy matching and normalizing data (lowercasing, stripping whitespace) to identify them.\n",
+        "\n",
+        "4. **Data cleaning is iterative**. You may need to apply multiple techniques and review the results before finalizing your cleaned dataset.\n",
+        "\n",
+        "5. **Document your decisions**. Keep track of what cleaning steps you applied and why, as this is important for reproducibility and transparency.\n",
+        "\n",
+        "> **Best Practice:** Always keep a copy of your original \"dirty\" data. Never overwrite your source data files - create cleaned versions with clear naming conventions like `data_cleaned.csv`."
+      ]
    }
  ],
  "metadata": {
@ -3715,4 +4231,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}