Merge pull request #684 from microsoft/copilot/fix-e446e3a1-6b4c-4310-87d5-641ed6823a37

Add real-world data quality checks to data cleaning lesson
pull/688/head
Lee Stott 2 months ago committed by GitHub
commit 57c2be2a87
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -3687,6 +3687,522 @@
"source": [
"> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Real-World Data Quality Checks\n",
"\n",
"> **Learning goal:** By the end of this section, you should be comfortable detecting and correcting common real-world data quality issues including inconsistent categorical values, abnormal numeric values (outliers), and duplicate entities with variations.\n",
"\n",
"While missing values and exact duplicates are common issues, real-world datasets often contain more subtle problems:\n",
"\n",
"1. **Inconsistent categorical values**: The same category spelled differently (e.g., \"USA\", \"U.S.A\", \"United States\")\n",
"2. **Abnormal numeric values**: Extreme outliers that indicate data entry errors (e.g., age = 999)\n",
"3. **Near-duplicate rows**: Records that represent the same entity with slight variations\n",
"\n",
"Let's explore techniques to detect and handle these issues."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a Sample \"Dirty\" Dataset\n",
"\n",
"First, let's create a sample dataset that contains the types of issues we commonly encounter in real-world data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Create a sample dataset with quality issues\n",
"dirty_data = pd.DataFrame({\n",
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],\n",
" 'name': ['John Smith', 'Jane Doe', 'John Smith', 'Bob Johnson', \n",
" 'Alice Williams', 'Charlie Brown', 'John Smith', 'Eva Martinez',\n",
" 'Bob Johnson', 'Diana Prince', 'Frank Castle', 'Alice Williams'],\n",
" 'age': [25, 32, 25, 45, 28, 199, 25, 31, 45, 27, -5, 28],\n",
" 'country': ['USA', 'UK', 'U.S.A', 'Canada', 'USA', 'United Kingdom',\n",
" 'United States', 'Mexico', 'canada', 'USA', 'UK', 'usa'],\n",
" 'purchase_amount': [100.50, 250.00, 105.00, 320.00, 180.00, 90.00,\n",
" 102.00, 275.00, 325.00, 195.00, 410.00, 185.00]\n",
"})\n",
"\n",
"print(\"Sample 'Dirty' Dataset:\")\n",
"print(dirty_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Detecting Inconsistent Categorical Values\n",
"\n",
"Notice the `country` column has multiple representations for the same countries. Let's identify these inconsistencies:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check unique values in the country column\n",
"print(\"Unique country values:\")\n",
"print(dirty_data['country'].unique())\n",
"print(f\"\\nTotal unique values: {dirty_data['country'].nunique()}\")\n",
"\n",
"# Count occurrences of each variation\n",
"print(\"\\nValue counts:\")\n",
"print(dirty_data['country'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Standardizing Categorical Values\n",
"\n",
"We can create a mapping to standardize these values. A simple approach is to convert to lowercase and create a mapping dictionary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a standardization mapping\n",
"country_mapping = {\n",
" 'usa': 'USA',\n",
" 'u.s.a': 'USA',\n",
" 'united states': 'USA',\n",
" 'uk': 'UK',\n",
" 'united kingdom': 'UK',\n",
" 'canada': 'Canada',\n",
" 'mexico': 'Mexico'\n",
"}\n",
"\n",
"# Standardize the country column\n",
"dirty_data['country_clean'] = dirty_data['country'].str.lower().map(country_mapping)\n",
"\n",
"print(\"Before standardization:\")\n",
"print(dirty_data['country'].value_counts())\n",
"print(\"\\nAfter standardization:\")\n",
"print(dirty_data[['country_clean']].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Alternative: Using Fuzzy Matching**\n",
"\n",
"For more complex cases, we can use fuzzy string matching with the `rapidfuzz` library to automatically detect similar strings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" from rapidfuzz import process, fuzz\n",
"except ImportError:\n",
" print(\"rapidfuzz is not installed. Please install it with 'pip install rapidfuzz' to use fuzzy matching.\")\n",
" process = None\n",
" fuzz = None\n",
"\n",
"# Get unique countries\n",
"unique_countries = dirty_data['country'].unique()\n",
"\n",
"# For each country, find similar matches\n",
"if process is not None and fuzz is not None:\n",
" print(\"Finding similar country names (similarity > 70%):\")\n",
" for country in unique_countries:\n",
" matches = process.extract(country, unique_countries, scorer=fuzz.ratio, limit=3)\n",
" # Filter matches with similarity > 70 and not identical\n",
" similar = [m for m in matches if m[1] > 70 and m[0] != country]\n",
" if similar:\n",
" print(f\"\\n'{country}' is similar to:\")\n",
" for match, score, _ in similar:\n",
" print(f\" - '{match}' (similarity: {score}%)\")\n",
"else:\n",
" print(\"Skipping fuzzy matching because rapidfuzz is not available.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Detecting Abnormal Numeric Values (Outliers)\n",
"\n",
"Looking at the `age` column, we have some suspicious values like 199 and -5. Let's use statistical methods to detect these outliers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display basic statistics\n",
"print(\"Age column statistics:\")\n",
"print(dirty_data['age'].describe())\n",
"\n",
"# Identify impossible values using domain knowledge\n",
"print(\"\\nRows with impossible age values (< 0 or > 120):\")\n",
"impossible_ages = dirty_data[(dirty_data['age'] < 0) | (dirty_data['age'] > 120)]\n",
"print(impossible_ages[['customer_id', 'name', 'age']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using IQR (Interquartile Range) Method\n",
"\n",
"The IQR method is a robust statistical technique for outlier detection that is less sensitive to extreme values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate IQR for age (excluding impossible values)\n",
"valid_ages = dirty_data[(dirty_data['age'] >= 0) & (dirty_data['age'] <= 120)]['age']\n",
"\n",
"Q1 = valid_ages.quantile(0.25)\n",
"Q3 = valid_ages.quantile(0.75)\n",
"IQR = Q3 - Q1\n",
"\n",
"# Define outlier bounds\n",
"lower_bound = Q1 - 1.5 * IQR\n",
"upper_bound = Q3 + 1.5 * IQR\n",
"\n",
"print(f\"IQR-based outlier bounds for age: [{lower_bound:.2f}, {upper_bound:.2f}]\")\n",
"\n",
"# Identify outliers\n",
"age_outliers = dirty_data[(dirty_data['age'] < lower_bound) | (dirty_data['age'] > upper_bound)]\n",
"print(f\"\\nRows with age outliers:\")\n",
"print(age_outliers[['customer_id', 'name', 'age']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using Z-Score Method\n",
"\n",
"The Z-score method identifies outliers based on standard deviations from the mean:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" from scipy import stats\n",
"except ImportError:\n",
" print(\"scipy is required for Z-score calculation. Please install it with 'pip install scipy' and rerun this cell.\")\n",
"else:\n",
" # Calculate Z-scores for age, handling NaN values\n",
" age_nonan = dirty_data['age'].dropna()\n",
" zscores = np.abs(stats.zscore(age_nonan))\n",
" dirty_data['age_zscore'] = np.nan\n",
" dirty_data.loc[age_nonan.index, 'age_zscore'] = zscores\n",
"\n",
" # Typically, Z-score > 3 indicates an outlier\n",
" print(\"Rows with age Z-score > 3:\")\n",
" zscore_outliers = dirty_data[dirty_data['age_zscore'] > 3]\n",
" print(zscore_outliers[['customer_id', 'name', 'age', 'age_zscore']])\n",
"\n",
" # Clean up the temporary column\n",
" dirty_data = dirty_data.drop('age_zscore', axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Handling Outliers\n",
"\n",
"Once detected, outliers can be handled in several ways:\n",
"1. **Remove**: Drop rows with outliers (if they're errors)\n",
"2. **Cap**: Replace with boundary values\n",
"3. **Replace with NaN**: Treat as missing data and use imputation techniques\n",
"4. **Keep**: If they're legitimate extreme values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a cleaned version by replacing impossible ages with NaN\n",
"dirty_data['age_clean'] = dirty_data['age'].apply(\n",
" lambda x: np.nan if (x < 0 or x > 120) else x\n",
")\n",
"\n",
"print(\"Age column before and after cleaning:\")\n",
"print(dirty_data[['customer_id', 'name', 'age', 'age_clean']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Detecting Near-Duplicate Rows\n",
"\n",
"Notice that our dataset has multiple entries for \"John Smith\" with slightly different values. Let's identify potential duplicates based on name similarity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First, let's look at exact name matches (ignoring extra whitespace)\n",
"dirty_data['name_normalized'] = dirty_data['name'].str.strip().str.lower()\n",
"\n",
"print(\"Checking for duplicate names:\")\n",
"duplicate_names = dirty_data[dirty_data.duplicated(['name_normalized'], keep=False)]\n",
"print(duplicate_names.sort_values('name_normalized')[['customer_id', 'name', 'age', 'country']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Finding Near-Duplicates with Fuzzy Matching\n",
"\n",
"For more sophisticated duplicate detection, we can use fuzzy matching to find similar names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" from rapidfuzz import process, fuzz\n",
"\n",
" # Function to find potential duplicates\n",
" def find_near_duplicates(df, column, threshold=90):\n",
" \"\"\"\n",
" Find near-duplicate entries in a column using fuzzy matching.\n",
" \n",
" Parameters:\n",
" - df: DataFrame\n",
" - column: Column name to check for duplicates\n",
" - threshold: Similarity threshold (0-100)\n",
" \n",
" Returns: List of potential duplicate groups\n",
" \"\"\"\n",
" values = df[column].unique()\n",
" duplicate_groups = []\n",
" checked = set()\n",
" \n",
" for value in values:\n",
" if value in checked:\n",
" continue\n",
" \n",
" # Find similar values\n",
" matches = process.extract(value, values, scorer=fuzz.ratio, limit=len(values))\n",
" similar = [m[0] for m in matches if m[1] >= threshold]\n",
" \n",
" if len(similar) > 1:\n",
" duplicate_groups.append(similar)\n",
" checked.update(similar)\n",
" \n",
" return duplicate_groups\n",
"\n",
" # Find near-duplicate names\n",
" duplicate_groups = find_near_duplicates(dirty_data, 'name', threshold=90)\n",
"\n",
" print(\"Potential duplicate groups:\")\n",
" for i, group in enumerate(duplicate_groups, 1):\n",
" print(f\"\\nGroup {i}:\")\n",
" for name in group:\n",
" matching_rows = dirty_data[dirty_data['name'] == name]\n",
" print(f\" '{name}': {len(matching_rows)} occurrence(s)\")\n",
" for _, row in matching_rows.iterrows():\n",
" print(f\" - Customer {row['customer_id']}: age={row['age']}, country={row['country']}\")\n",
"except ImportError:\n",
" print(\"rapidfuzz is not installed. Skipping fuzzy matching for near-duplicates.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Handling Duplicates\n",
"\n",
"Once identified, you need to decide how to handle duplicates:\n",
"1. **Keep the first occurrence**: Use `drop_duplicates(keep='first')`\n",
"2. **Keep the last occurrence**: Use `drop_duplicates(keep='last')`\n",
"3. **Aggregate information**: Combine information from duplicate rows\n",
"4. **Manual review**: Flag for human review"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Example: Remove duplicates based on normalized name, keeping first occurrence\n",
"cleaned_data = dirty_data.drop_duplicates(subset=['name_normalized'], keep='first')\n",
"\n",
"print(f\"Original dataset: {len(dirty_data)} rows\")\n",
"print(f\"After removing name duplicates: {len(cleaned_data)} rows\")\n",
"print(f\"Removed: {len(dirty_data) - len(cleaned_data)} duplicate rows\")\n",
"\n",
"print(\"\\nCleaned dataset:\")\n",
"print(cleaned_data[['customer_id', 'name', 'age', 'country_clean']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary: Complete Data Cleaning Pipeline\n",
"\n",
"Let's put it all together into a comprehensive cleaning pipeline:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def clean_dataset(df):\n",
" \"\"\"\n",
" Comprehensive data cleaning function.\n",
" \"\"\"\n",
" # Create a copy to avoid modifying the original\n",
" cleaned = df.copy()\n",
" \n",
" # 1. Standardize categorical values (country)\n",
" country_mapping = {\n",
" 'usa': 'USA', 'u.s.a': 'USA', 'united states': 'USA',\n",
" 'uk': 'UK', 'united kingdom': 'UK',\n",
" 'canada': 'Canada', 'mexico': 'Mexico'\n",
" }\n",
" cleaned['country'] = cleaned['country'].str.lower().map(country_mapping)\n",
" \n",
" # 2. Clean abnormal age values\n",
" cleaned['age'] = cleaned['age'].apply(\n",
" lambda x: np.nan if (x < 0 or x > 120) else x\n",
" )\n",
" \n",
" # 3. Remove near-duplicate names (normalize whitespace)\n",
" cleaned['name'] = cleaned['name'].str.strip()\n",
" cleaned = cleaned.drop_duplicates(subset=['name'], keep='first')\n",
" \n",
" return cleaned\n",
"\n",
"# Apply the cleaning pipeline\n",
"final_cleaned_data = clean_dataset(dirty_data)\n",
"\n",
"print(\"Before cleaning:\")\n",
"print(f\" Rows: {len(dirty_data)}\")\n",
"print(f\" Unique countries: {dirty_data['country'].nunique()}\")\n",
"print(f\" Invalid ages: {((dirty_data['age'] < 0) | (dirty_data['age'] > 120)).sum()}\")\n",
"\n",
"print(\"\\nAfter cleaning:\")\n",
"print(f\" Rows: {len(final_cleaned_data)}\")\n",
"print(f\" Unique countries: {final_cleaned_data['country'].nunique()}\")\n",
"print(f\" Invalid ages: {((final_cleaned_data['age'] < 0) | (final_cleaned_data['age'] > 120)).sum()}\")\n",
"\n",
"print(\"\\nCleaned dataset:\")\n",
"print(final_cleaned_data[['customer_id', 'name', 'age', 'country', 'purchase_amount']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🎯 Challenge Exercise\n",
"\n",
"Now it's your turn! Below is a new row of data with multiple quality issues. Can you:\n",
"\n",
"1. Identify all the issues in this row\n",
"2. Write code to clean each issue\n",
"3. Add the cleaned row to the dataset\n",
"\n",
"Here's the problematic data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# New problematic row\n",
"new_row = pd.DataFrame({\n",
" 'customer_id': [13],\n",
" 'name': [' Diana Prince '], # Extra whitespace\n",
" 'age': [250], # Impossible age\n",
" 'country': ['U.S.A.'], # Inconsistent format\n",
" 'purchase_amount': [150.00]\n",
"})\n",
"\n",
"print(\"New row to clean:\")\n",
"print(new_row)\n",
"\n",
"# TODO: Your code here to clean this row\n",
"# Hints:\n",
"# 1. Strip whitespace from the name\n",
"# 2. Check if the name is a duplicate (Diana Prince already exists)\n",
"# 3. Handle the impossible age value\n",
"# 4. Standardize the country name\n",
"\n",
"# Example solution (uncomment and modify as needed):\n",
"# new_row_cleaned = new_row.copy()\n",
"# new_row_cleaned['name'] = new_row_cleaned['name'].str.strip()\n",
"# new_row_cleaned['age'] = np.nan # Invalid age\n",
"# new_row_cleaned['country'] = 'USA' # Standardized\n",
"# print(\"\\nCleaned row:\")\n",
"# print(new_row_cleaned)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Key Takeaways\n",
"\n",
"1. **Inconsistent categories** are common in real-world data. Always check unique values and standardize them using mappings or fuzzy matching.\n",
"\n",
"2. **Outliers** can significantly affect your analysis. Use domain knowledge combined with statistical methods (IQR, Z-score) to detect them.\n",
"\n",
"3. **Near-duplicates** are harder to detect than exact duplicates. Consider using fuzzy matching and normalizing data (lowercasing, stripping whitespace) to identify them.\n",
"\n",
"4. **Data cleaning is iterative**. You may need to apply multiple techniques and review the results before finalizing your cleaned dataset.\n",
"\n",
"5. **Document your decisions**. Keep track of what cleaning steps you applied and why, as this is important for reproducibility and transparency.\n",
"\n",
"> **Best Practice:** Always keep a copy of your original \"dirty\" data. Never overwrite your source data files - create cleaned versions with clear naming conventions like `data_cleaned.csv`."
]
}
],
"metadata": {
@ -3715,4 +4231,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}
Loading…
Cancel
Save