Back to Python Development

Data Cleaning Pipeline

data-sciencepythonpandasautomationpreprocessing
4.6 (182)267📄 MIT🕒 2026-03-04Source ↗

Install this skill

npx skills add aj-geddes/useful-ai-prompts

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

What this skill does

  • Imputation strategies for missing values using mean, median, or KNN
  • Detection and handling of outliers via IQR bounds or clipping
  • Automated deduplication based on single or multi-column subsets
  • Standardization of data types and text formatting rules
  • Integrated validation reports for monitoring data quality metrics

When to use it

  • Preparing raw CSV or database exports for machine learning models
  • Standardizing mismatched data formats across multiple source files
  • Cleaning large datasets where manual filtering is error-prone
  • Establishing a gatekeeper for automated data ingestion workflows

When not to use it

  • Processing real-time streaming data with low-latency requirements
  • Handling unstructured data like high-resolution images or raw audio
  • Projects requiring heavy-duty distributed processing like Spark

How to invoke it

Example prompts that trigger this skill:

  • Create a data cleaning pipeline that imputes missing values and removes duplicates.
  • Help me build a Python data pipeline to clean my customer transaction dataset.
  • Write a script to detect and cap outliers in my numerical data columns.
  • Add a validation step to my data preprocessing workflow to ensure no nulls exist.
  • Structure my messy CSV cleaning process into a modular Python class.

Example workflow

  1. Load the raw CSV into a pandas DataFrame.
  2. Apply specific imputation strategies for missing numerical and categorical features.
  3. Filter out duplicate rows and handle outliers using IQR calculations.
  4. Normalize and scale numerical features for consistent range.
  5. Run final validation checks to confirm data integrity before saving output.

Prerequisites

  • Python 3.x
  • pandas
  • numpy
  • scikit-learn

Pitfalls & limitations

  • !Automated imputation can introduce bias if the missing data is not missing at random.
  • !Aggressive outlier removal can lead to loss of valuable edge-case information.
  • !Hardcoded pipelines may fail if input schema changes unexpectedly.

FAQ

Should I use mean or median for imputation?
Use the median if your data contains significant outliers, as it is more robust to skewed distributions. Mean imputation is generally only suitable for normally distributed data.
Can this pipeline handle categorical data?
Yes, the pipeline supports categorical imputation using mode-filling and text cleaning techniques to ensure consistency.
Does this require extensive memory for large files?
Pandas loads data into memory, so for extremely large datasets, consider chunking your data or using a tool like Polars or Dask instead.

How it compares

Unlike writing ad-hoc scripts, this pipeline provides a modular class-based architecture that makes it easier to track, audit, and extend cleaning steps in a repeatable fashion.

Source & trust

267 stars📄 MIT🕒 Updated 2026-03-04🛡 no risky patterns found

From the source: “# Data Cleaning Pipeline ## Overview Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues. ## When to Use - Preparing raw datasets for analysis or modelin…”

View the full SKILL.md source

# Data Cleaning Pipeline

## Overview

Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues.

## When to Use

- Preparing raw datasets for analysis or modeling
- Handling missing values and data quality issues
- Removing duplicates and standardizing formats
- Detecting and treating outliers
- Building automated data preprocessing workflows
- Ensuring data integrity and consistency

## Core Components

- **Missing Value Handling**: Imputation and removal strategies
- **Outlier Detection & Treatment**: Identifying and handling anomalies
- **Data Type Standardization**: Ensuring correct data types
- **Duplicate Removal**: Identifying and removing duplicates
- **Normalization & Scaling**: Standardizing value ranges
- **Text Cleaning**: Handling text data

## Cleaning Strategies

- **Deletion**: Removing rows or columns
- **Imputation**: Filling with mean, median, or predictive models
- **Transformation**: Converting between formats
- **Validation**: Ensuring data integrity rules

## Implementation with Python

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer

# Load raw data
df = pd.read_csv('raw_data.csv')

# Step 1: Identify and handle missing values
print("Missing values:\n", df.isnull().sum())

# Strategy 1: Delete rows with critical missing values
df = df.dropna(subset=['customer_id', 'transaction_date'])

# Strategy 2: Impute numerical columns with median
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])

# Strategy 3: Use KNN imputation for related features
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

# Strategy 4: Fill categorical with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])

# Step 2: Handle duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Duplicate on specific columns
df = df.drop_duplicates(subset=['customer_id', 'transaction_date'])

# Step 3: Outlier detection and handling
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['amount'] >= lower_bound) & (df['amount'] <= upper_bound)]

# Alternative: Cap outliers
df['amount'] = df['amount'].clip(lower=lower_bound, upper=upper_bound)

# Step 4: Data type standardization
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df['customer_id'] = df['customer_id'].astype('int64')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# Step 5: Text cleaning
df['name'] = df['name'].str.strip().str.lower()
df['name'] = df['name'].str.replace('[^a-z0-9\s]', '', regex=True)

# Step 6: Normalization and scaling
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# MinMax scaling for bounded range [0, 1]
minmax_scaler = MinMaxScaler()
df[['score']] = minmax_scaler.fit_transform(df[['score']])

# Step 7: Create data quality report
def create_quality_report(df_original, df_cleaned):
    report = {
        'Original rows': len(df_original),
        'Cleaned rows': len(df_cleaned),
        'Rows removed': len(df_original) - len(df_cleaned),
        'Removal percentage': ((len(df_original) - len(df_cleaned)) / len(df_original) * 100),
        'Original missing': df_original.isnull().sum().sum(),
        'Cleaned missing': df_cleaned.isnull().sum().sum(),
    }
    return pd.DataFrame(report, index=[0])

quality = create_quality_report(df, df)
print(quality)

# Step 8: Validation checks
assert df['age'].isnull().sum() == 0, "Age has missing values"
assert df['transaction_date'].dtype == 'datetime64[ns]', "Date not datetime"
assert (df['amount'] >= 0).all(), "Negative amounts detected"

print("Data cleaning pipeline completed successfully!")
```

## Pipeline Architecture

```python
class DataCleaningPipeline:
    def __init__(self):
        self.cleaner_steps = []

    def add_step(self, func, description):
        self.cleaner_steps.append((func, description))
        return self

    def execute(self, df):
        for func, desc in self.cleaner_steps:
            print(f"Executing: {desc}")
            df = func(df)
        return df

# Usage
pipeline = DataCleaningPipeline()
pipeline.add_step(
    lambda df: df.dropna(subset=['customer_id']),
    "Remove rows with missing customer_id"
).add_step(
    lambda df: df.drop_duplicates(),
    "Remove duplicate rows"
).add_step(
    lambda df: df[(df['amount'] > 0) & (df['amount'] < 100000)],
    "Filter invalid amount ranges"
)

df_clean = pipeline.execute(df)
```

## Advanced Cleaning Techniques

```python
# Step 9: Feature-specific cleaning
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)  # Remove non-digits

# Step 10: Datetime handling
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')
df['days_since_creation'] = (pd.Timestamp.now() - df['created_date']).dt.days

# Step 11: Categorical standardization
df['status'] = df['status'].str.lower().str.strip()
df['status'] = df['status'].replace({
    'active': 'active',
    'inactive': 'inactive',
    'pending': 'pending',
})

# Step 12: Numeric constraint checking
df['age'] = df['age'].where((df['age'] >= 0) & (df['age'] <= 150), np.nan)
df['percentage'] = df['percentage'].where((df['percentage'] >= 0) & (df['percentage'] <= 100), np.nan)

# Step 13: Create data quality score
quality_score = {
    'Missing %': (df.isnull().sum() / len(df) * 100).mean(),
    'Duplicates %': (df.duplicated().sum() / len(df) * 100),
    'Complete Features': (df.notna().sum() / len(df)).mean() * 100,
}

# Step 14: Generate cleaning report
cleaning_report = f"""
DATA CLEANING REPORT
====================
Rows removed: {len(df) - len(df_clean)}
Columns: {len(df_clean.columns)}
Remaining rows: {len(df_clean)}
Completeness: {(df_clean.notna().sum().sum() / (len(df_clean) * len(df_clean.columns)) * 100):.1f}%
"""
print(cleaning_report)
```

## Key Decisions

- How to handle missing values (delete vs impute)?
- Which outliers are legitimate business cases?
- What are acceptable value ranges?
- Which duplicates are true duplicates?
- How to standardize categorical values?

## Validation Steps

- Check for data type consistency
- Verify value ranges are reasonable
- Confirm no unintended data loss
- Document all transformations applied
- Create audit trail of changes

## Deliverables

- Cleaned dataset with quality metrics
- Data cleaning log documenting all steps
- Validation report confirming data integrity
- Before/after comparison statistics
- Cleaning code and pipeline documentation

Quoted from aj-geddes/useful-ai-prompts for reference — see the original for the authoritative, latest version.

📄 Full skill instructions — original source: aj-geddes/useful-ai-prompts
The Data Cleaning Pipeline provides a modular framework for automating the sanitization of raw datasets. It focuses on systematic handling of common data quality hurdles, including missing values, outliers, duplicate records, and inconsistent formatting. By providing a structured approach to preprocessing, this skill allows data engineers and analysts to transform messy input into production-ready formats for modeling or statistical reporting. The implementation uses Python and common data science libraries to ensure reproducibility. Rather than relying on manual adjustments, it treats data preparation as a repeatable code-based process. This is particularly effective for teams managing large volumes of incoming data where manual intervention is unfeasible. The pipeline also includes validation functions to verify that data integrity standards are met before the final dataset reaches a downstream analysis environment or machine learning model.

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/data-cleaning-pipeline/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md
  • Cursor: ~/.cursor/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md

🚀 Install with CLI:
npx skills add aj-geddes/useful-ai-prompts

Read the Master Guide: Mastering Agent Skills

Recommended Rules

View more rules

Recommended Workflows

View more workflows

Recommended MCP Servers

View more MCP servers

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid python development issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under Python Development and is published by aj-geddes, maintained in aj-geddes/useful-ai-prompts.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.