Install this skill
npx skills add aj-geddes/useful-ai-promptsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
What this skill does
- •Imputation strategies for missing values using mean, median, or KNN
- •Detection and handling of outliers via IQR bounds or clipping
- •Automated deduplication based on single or multi-column subsets
- •Standardization of data types and text formatting rules
- •Integrated validation reports for monitoring data quality metrics
When to use it
- ✓Preparing raw CSV or database exports for machine learning models
- ✓Standardizing mismatched data formats across multiple source files
- ✓Cleaning large datasets where manual filtering is error-prone
- ✓Establishing a gatekeeper for automated data ingestion workflows
When not to use it
- ✕Processing real-time streaming data with low-latency requirements
- ✕Handling unstructured data like high-resolution images or raw audio
- ✕Projects requiring heavy-duty distributed processing like Spark
How to invoke it
Example prompts that trigger this skill:
- “Create a data cleaning pipeline that imputes missing values and removes duplicates.”
- “Help me build a Python data pipeline to clean my customer transaction dataset.”
- “Write a script to detect and cap outliers in my numerical data columns.”
- “Add a validation step to my data preprocessing workflow to ensure no nulls exist.”
- “Structure my messy CSV cleaning process into a modular Python class.”
Example workflow
- Load the raw CSV into a pandas DataFrame.
- Apply specific imputation strategies for missing numerical and categorical features.
- Filter out duplicate rows and handle outliers using IQR calculations.
- Normalize and scale numerical features for consistent range.
- Run final validation checks to confirm data integrity before saving output.
Prerequisites
- –Python 3.x
- –pandas
- –numpy
- –scikit-learn
Pitfalls & limitations
- !Automated imputation can introduce bias if the missing data is not missing at random.
- !Aggressive outlier removal can lead to loss of valuable edge-case information.
- !Hardcoded pipelines may fail if input schema changes unexpectedly.
FAQ
How it compares
Unlike writing ad-hoc scripts, this pipeline provides a modular class-based architecture that makes it easier to track, audit, and extend cleaning steps in a repeatable fashion.
Source & trust
From the source: “# Data Cleaning Pipeline ## Overview Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues. ## When to Use - Preparing raw datasets for analysis or modelin…”
View the full SKILL.md source
# Data Cleaning Pipeline
## Overview
Data cleaning pipelines transform raw, messy data into clean, standardized formats suitable for analysis and modeling through systematic handling of missing values, outliers, and data quality issues.
## When to Use
- Preparing raw datasets for analysis or modeling
- Handling missing values and data quality issues
- Removing duplicates and standardizing formats
- Detecting and treating outliers
- Building automated data preprocessing workflows
- Ensuring data integrity and consistency
## Core Components
- **Missing Value Handling**: Imputation and removal strategies
- **Outlier Detection & Treatment**: Identifying and handling anomalies
- **Data Type Standardization**: Ensuring correct data types
- **Duplicate Removal**: Identifying and removing duplicates
- **Normalization & Scaling**: Standardizing value ranges
- **Text Cleaning**: Handling text data
## Cleaning Strategies
- **Deletion**: Removing rows or columns
- **Imputation**: Filling with mean, median, or predictive models
- **Transformation**: Converting between formats
- **Validation**: Ensuring data integrity rules
## Implementation with Python
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
# Load raw data
df = pd.read_csv('raw_data.csv')
# Step 1: Identify and handle missing values
print("Missing values:\n", df.isnull().sum())
# Strategy 1: Delete rows with critical missing values
df = df.dropna(subset=['customer_id', 'transaction_date'])
# Strategy 2: Impute numerical columns with median
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])
# Strategy 3: Use KNN imputation for related features
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
# Strategy 4: Fill categorical with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])
# Step 2: Handle duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
df = df.drop_duplicates()
# Duplicate on specific columns
df = df.drop_duplicates(subset=['customer_id', 'transaction_date'])
# Step 3: Outlier detection and handling
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df = df[(df['amount'] >= lower_bound) & (df['amount'] <= upper_bound)]
# Alternative: Cap outliers
df['amount'] = df['amount'].clip(lower=lower_bound, upper=upper_bound)
# Step 4: Data type standardization
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df['customer_id'] = df['customer_id'].astype('int64')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# Step 5: Text cleaning
df['name'] = df['name'].str.strip().str.lower()
df['name'] = df['name'].str.replace('[^a-z0-9\s]', '', regex=True)
# Step 6: Normalization and scaling
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# MinMax scaling for bounded range [0, 1]
minmax_scaler = MinMaxScaler()
df[['score']] = minmax_scaler.fit_transform(df[['score']])
# Step 7: Create data quality report
def create_quality_report(df_original, df_cleaned):
report = {
'Original rows': len(df_original),
'Cleaned rows': len(df_cleaned),
'Rows removed': len(df_original) - len(df_cleaned),
'Removal percentage': ((len(df_original) - len(df_cleaned)) / len(df_original) * 100),
'Original missing': df_original.isnull().sum().sum(),
'Cleaned missing': df_cleaned.isnull().sum().sum(),
}
return pd.DataFrame(report, index=[0])
quality = create_quality_report(df, df)
print(quality)
# Step 8: Validation checks
assert df['age'].isnull().sum() == 0, "Age has missing values"
assert df['transaction_date'].dtype == 'datetime64[ns]', "Date not datetime"
assert (df['amount'] >= 0).all(), "Negative amounts detected"
print("Data cleaning pipeline completed successfully!")
```
## Pipeline Architecture
```python
class DataCleaningPipeline:
def __init__(self):
self.cleaner_steps = []
def add_step(self, func, description):
self.cleaner_steps.append((func, description))
return self
def execute(self, df):
for func, desc in self.cleaner_steps:
print(f"Executing: {desc}")
df = func(df)
return df
# Usage
pipeline = DataCleaningPipeline()
pipeline.add_step(
lambda df: df.dropna(subset=['customer_id']),
"Remove rows with missing customer_id"
).add_step(
lambda df: df.drop_duplicates(),
"Remove duplicate rows"
).add_step(
lambda df: df[(df['amount'] > 0) & (df['amount'] < 100000)],
"Filter invalid amount ranges"
)
df_clean = pipeline.execute(df)
```
## Advanced Cleaning Techniques
```python
# Step 9: Feature-specific cleaning
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True) # Remove non-digits
# Step 10: Datetime handling
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')
df['days_since_creation'] = (pd.Timestamp.now() - df['created_date']).dt.days
# Step 11: Categorical standardization
df['status'] = df['status'].str.lower().str.strip()
df['status'] = df['status'].replace({
'active': 'active',
'inactive': 'inactive',
'pending': 'pending',
})
# Step 12: Numeric constraint checking
df['age'] = df['age'].where((df['age'] >= 0) & (df['age'] <= 150), np.nan)
df['percentage'] = df['percentage'].where((df['percentage'] >= 0) & (df['percentage'] <= 100), np.nan)
# Step 13: Create data quality score
quality_score = {
'Missing %': (df.isnull().sum() / len(df) * 100).mean(),
'Duplicates %': (df.duplicated().sum() / len(df) * 100),
'Complete Features': (df.notna().sum() / len(df)).mean() * 100,
}
# Step 14: Generate cleaning report
cleaning_report = f"""
DATA CLEANING REPORT
====================
Rows removed: {len(df) - len(df_clean)}
Columns: {len(df_clean.columns)}
Remaining rows: {len(df_clean)}
Completeness: {(df_clean.notna().sum().sum() / (len(df_clean) * len(df_clean.columns)) * 100):.1f}%
"""
print(cleaning_report)
```
## Key Decisions
- How to handle missing values (delete vs impute)?
- Which outliers are legitimate business cases?
- What are acceptable value ranges?
- Which duplicates are true duplicates?
- How to standardize categorical values?
## Validation Steps
- Check for data type consistency
- Verify value ranges are reasonable
- Confirm no unintended data loss
- Document all transformations applied
- Create audit trail of changes
## Deliverables
- Cleaned dataset with quality metrics
- Data cleaning log documenting all steps
- Validation report confirming data integrity
- Before/after comparison statistics
- Cleaning code and pipeline documentation
Quoted from aj-geddes/useful-ai-prompts for reference — see the original for the authoritative, latest version.
📄 Full skill instructions — original source: aj-geddes/useful-ai-prompts
How to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/data-cleaning-pipeline/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md - Cursor:
~/.cursor/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/aj-geddes/useful-ai-prompts/data-cleaning-pipeline/SKILL.md
🚀 Install with CLI:npx skills add aj-geddes/useful-ai-prompts

