Data Cleaning Pipeline

Name: Data Cleaning Pipeline
Rating: 4.6 (182 reviews)
Author: aj-geddes

data-sciencepythonpandasautomationpreprocessing

★ 4.6 (182)⭐ 267📄 MIT🕒 2026-03-04Source ↗

Install this skill

npx skills add aj-geddes/useful-ai-prompts

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

What this skill does

•Imputation strategies for missing values using mean, median, or KNN
•Detection and handling of outliers via IQR bounds or clipping
•Automated deduplication based on single or multi-column subsets
•Standardization of data types and text formatting rules
•Integrated validation reports for monitoring data quality metrics

When to use it

✓Preparing raw CSV or database exports for machine learning models
✓Standardizing mismatched data formats across multiple source files
✓Cleaning large datasets where manual filtering is error-prone
✓Establishing a gatekeeper for automated data ingestion workflows

When not to use it

✕Processing real-time streaming data with low-latency requirements
✕Handling unstructured data like high-resolution images or raw audio
✕Projects requiring heavy-duty distributed processing like Spark

How to invoke it

Example prompts that trigger this skill:

“Create a data cleaning pipeline that imputes missing values and removes duplicates.”
“Help me build a Python data pipeline to clean my customer transaction dataset.”
“Write a script to detect and cap outliers in my numerical data columns.”
“Add a validation step to my data preprocessing workflow to ensure no nulls exist.”
“Structure my messy CSV cleaning process into a modular Python class.”

Example workflow

Load the raw CSV into a pandas DataFrame.
Apply specific imputation strategies for missing numerical and categorical features.
Filter out duplicate rows and handle outliers using IQR calculations.
Normalize and scale numerical features for consistent range.
Run final validation checks to confirm data integrity before saving output.

Prerequisites

–Python 3.x
–pandas
–numpy
–scikit-learn

Pitfalls & limitations

!Automated imputation can introduce bias if the missing data is not missing at random.
!Aggressive outlier removal can lead to loss of valuable edge-case information.
!Hardcoded pipelines may fail if input schema changes unexpectedly.

FAQ

Should I use mean or median for imputation?

Use the median if your data contains significant outliers, as it is more robust to skewed distributions. Mean imputation is generally only suitable for normally distributed data.

Can this pipeline handle categorical data?

Yes, the pipeline supports categorical imputation using mode-filling and text cleaning techniques to ensure consistency.

Does this require extensive memory for large files?

Pandas loads data into memory, so for extremely large datasets, consider chunking your data or using a tool like Polars or Dask instead.

How it compares

Unlike writing ad-hoc scripts, this pipeline provides a modular class-based architecture that makes it easier to track, audit, and extend cleaning steps in a repeatable fashion.

Source & trust

⭐ 267 stars📄 MIT🕒 Updated 2026-03-04🛡 no risky patterns found

View original skill on GitHub →

From the source: “# Data Cleaning Pipeline ## Overview Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues. ## When to Use - Preparing raw datasets for analysis or modelin…”

View the full SKILL.md source


# Data Cleaning Pipeline

## Overview

Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues.

## When to Use

- Preparing raw datasets for analysis or modeling
- Handling missing values and data quality issues
- Removing duplicates and standardizing formats
- Detecting and treating outliers
- Building automated data preprocessing workflows
- Ensuring data integrity and consistency

## Core Components

- **Missing Value Handling**: Imputation and removal strategies
- **Outlier Detection & Treatment**: Identifying and handling anomalies
- **Data Type Standardization**: Ensuring correct data types
- **Duplicate Removal**: Identifying and removing duplicates
- **Normalization & Scaling**: Standardizing value ranges
- **Text Cleaning**: Handling text data

## Cleaning Strategies

- **Deletion**: Removing rows or columns
- **Imputation**: Filling with mean, median, or predictive models
- **Transformation**: Converting between formats
- **Validation**: Ensuring data integrity rules

## Implementation with Python

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer

# Load raw data
df = pd.read_csv('raw_data.csv')

# Step 1: Identify and handle missing values
print("Missing values:\n", df.isnull().sum())

# Strategy 1: Delete rows with critical missing values
df = df.dropna(subset=['customer_id', 'transaction_date'])

# Strategy 2: Impute numerical columns with median
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])

# Strategy 3: Use KNN imputation for related features
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

# Strategy 4: Fill categorical with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])

# Step 2: Handle duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Duplicate on specific columns
df = df.drop_duplicates(subset=['customer_id', 'transaction_date'])

# Step 3: Outlier detection and handling
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['amount'] >= lower_bound) & (df['amount'] <= upper_bound)]

# Alternative: Cap outliers
df['amount'] = df['amount'].clip(lower=lower_bound, upper=upper_bound)

# Step 4: Data type standardization
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df['customer_id'] = df['customer_id'].astype('int64')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# Step 5: Text cleaning
df['name'] = df['name'].str.strip().str.lower()
df['name'] = df['name'].str.replace('[^a-z0-9\s]', '', regex=True)

# Step 6: Normalization and scaling
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# MinMax scaling for bounded range [0, 1]
minmax_scaler = MinMaxScaler()
df[['score']] = minmax_scaler.fit_transform(df[['score']])

# Step 7: Create data quality report
def create_quality_report(df_original, df_cleaned):
    report = {
        'Original rows': len(df_original),
        'Cleaned rows': len(df_cleaned),
        'Rows removed': len(df_original) - len(df_cleaned),
        'Removal percentage': ((len(df_original) - len(df_cleaned)) / len(df_original) * 100),
        'Original missing': df_original.isnull().sum().sum(),
        'Cleaned missing': df_cleaned.isnull().sum().sum(),
    }
    return pd.DataFrame(report, index=[0])

quality = create_quality_report(df, df)
print(quality)

# Step 8: Validation checks
assert df['age'].isnull().sum() == 0, "Age has missing values"
assert df['transaction_date'].dtype == 'datetime64[ns]', "Date not datetime"
assert (df['amount'] >= 0).all(), "Negative amounts detected"

print("Data cleaning pipeline completed successfully!")
```

## Pipeline Architecture

```python
class DataCleaningPipeline:
    def __init__(self):
        self.cleaner_steps = []

    def add_step(self, func, description):
        self.cleaner_steps.append((func, description))
        return self

    def execute(self, df):
        for func, desc in self.cleaner_steps:
            print(f"Executing: {desc}")
            df = func(df)
        return df

# Usage
pipeline = DataCleaningPipeline()
pipeline.add_step(
    lambda df: df.dropna(subset=['customer_id']),
    "Remove rows with missing customer_id"
).add_step(
    lambda df: df.drop_duplicates(),
    "Remove duplicate rows"
).add_step(
    lambda df: df[(df['amount'] > 0) & (df['amount'] < 100000)],
    "Filter invalid amount ranges"
)

df_clean = pipeline.execute(df)
```

## Advanced Cleaning Techniques

```python
# Step 9: Feature-specific cleaning
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)  # Remove non-digits

# Step 10: Datetime handling
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')
df['days_since_creation'] = (pd.Timestamp.now() - df['created_date']).dt.days

# Step 11: Categorical standardization
df['status'] = df['status'].str.lower().str.strip()
df['status'] = df['status'].replace({
    'active': 'active',
    'inactive': 'inactive',
    'pending': 'pending',
})

# Step 12: Numeric constraint checking
df['age'] = df['age'].where((df['age'] >= 0) & (df['age'] <= 150), np.nan)
df['percentage'] = df['percentage'].where((df['percentage'] >= 0) & (df['percentage'] <= 100), np.nan)

# Step 13: Create data quality score
quality_score = {
    'Missing %': (df.isnull().sum() / len(df) * 100).mean(),
    'Duplicates %': (df.duplicated().sum() / len(df) * 100),
    'Complete Features': (df.notna().sum() / len(df)).mean() * 100,
}

# Step 14: Generate cleaning report
cleaning_report = f"""
DATA CLEANING REPORT
====================
Rows removed: {len(df) - len(df_clean)}
Columns: {len(df_clean.columns)}
Remaining rows: {len(df_clean)}
Completeness: {(df_clean.notna().sum().sum() / (len(df_clean) * len(df_clean.columns)) * 100):.1f}%
"""
print(cleaning_report)
```

## Key Decisions

- How to handle missing values (delete vs impute)?
- Which outliers are legitimate business cases?
- What are acceptable value ranges?
- Which duplicates are true duplicates?
- How to standardize categorical values?

## Validation Steps

- Check for data type consistency
- Verify value ranges are reasonable
- Confirm no unintended data loss
- Document all transformations applied
- Create audit trail of changes

## Deliverables

- Cleaned dataset with quality metrics
- Data cleaning log documenting all steps
- Validation report confirming data integrity
- Before/after comparison statistics
- Cleaning code and pipeline documentation

Quoted from aj-geddes/useful-ai-prompts for reference — see the original for the authoritative, latest version.

📄 Full skill instructions — original source: aj-geddes/useful-ai-prompts

The Data Cleaning Pipeline provides a modular framework for automating the sanitization of raw datasets. It focuses on systematic handling of common data quality hurdles, including missing values, outliers, duplicate records, and inconsistent formatting. By providing a structured approach to preprocessing, this skill allows data engineers and analysts to transform messy input into production-ready formats for modeling or statistical reporting. The implementation uses Python and common data science libraries to ensure reproducibility. Rather than relying on manual adjustments, it treats data preparation as a repeatable code-based process. This is particularly effective for teams managing large volumes of incoming data where manual intervention is unfeasible. The pipeline also includes validation functions to verify that data integrity standards are met before the final dataset reaches a downstream analysis environment or machine learning model.

By aj-geddes

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/data-cleaning-pipeline/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md
Cursor: ~/.cursor/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md

🚀 Install with CLI:
npx skills add aj-geddes/useful-ai-prompts

Read the Master Guide: Mastering Agent Skills →

Recommended Rules

View more rules →

Recommended Workflows

View more workflows →

Automatic commit message generator

GitAIAutomation

--- description: Automatic commit message generator and fast AI-powered commit for all current changes --- // turbo-all This workflow automatically ...

Setup Husky Git Hooks

GitAutomationQuality

--- description: Automate code quality checks with pre-commit and pre-push hooks --- 1. **Install Husky**: - Install husky and lint-staged. // ...

Setup Vercel Cron Jobs

VercelCronAutomation

--- description: Create and test scheduled tasks in Next.js --- 1. **Create Cron Config**: - Add `crons` to `vercel.json`. ```json { "c...

Recommended MCP Servers

View more MCP servers →

HOPX

Official

Execute Python, JavaScript, Bash, and Go code in isolated cloud containers with sub-150ms startup times. Pre-installed data science libraries (pandas, numpy, matplotlib) for AI-powered data analysis and code testing.

Sugar

Official

Autonomous AI development platform for Claude Code with task management, specialized agents, and workflow automation. Full MCP server bridges Claude with Python CLI for rich task context and autonomous execution.

AltTester®

Official

Use AltTester® capabilities to connect and test your Unity or Unreal game. Write game test automation faster and smarter, using [AltTester](https://alttester.com) and the AltTester® MCP server.

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library

📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

Data Cleaning Pipeline

Install this skill

What this skill does

When to use it

When not to use it

How to invoke it

Example workflow

Prerequisites

Pitfalls & limitations

FAQ

How it compares

Source & trust

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Option B: Global Installation (All Agents)

Recommended Rules

Python Automation & Scripting Expert

Python DevOps & Infrastructure Automation

Python Web Scraping Expert

Recommended Workflows

Automatic commit message generator

Setup Husky Git Hooks

Setup Vercel Cron Jobs

Recommended MCP Servers

HOPX

Sugar

AltTester®

Take It Further

Define Your Standards

Master Workflows

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

For Cursor & Windsurf

Source & attribution