Back to Documents

PDF Processing

pdfpythonocrdata-extractionautomation
🕒 2025-11-24Source ↗

Install this skill

npx skills add Ming-Kai-LC/self-learn

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

Claude Code often struggles with PDFs exceeding 10-15MB or 50 pages, leading to session crashes or context window exhaustion. This skill provides a set of Python-based utilities to pre-process, chunk, and extract data from large documents before sending them to the agent. By breaking massive files into smaller, manageable segments or isolating specific text and table data, you maintain session stability and avoid costly token bloat. It provides direct wrappers for common tasks like OCR, text extraction, and automated splitting based on file size or page count. Developers working with long-form technical documentation, massive data reports, or scanned archives will find this essential for keeping their AI workflows operational and performant when handling documents that exceed standard reading limits.
By Ming-Kai-LC

What this skill does

  • Splitting large PDF files into smaller chunks by page count or file size
  • High-speed text extraction for parsing document content without full rendering
  • Precise table extraction using coordinates for data-heavy reports
  • Automated scanning detection for OCR integration
  • File size validation scripts to check compatibility before agent processing

When to use it

  • When a PDF file is over 15MB or contains more than 50 pages
  • When you need to extract specific table data into a machine-readable format
  • When Claude Code repeatedly crashes or loses context while reading a long document
  • When dealing with scanned images that require character recognition

When not to use it

  • When a document is small enough for direct ingestion (under 5MB)
  • When the PDF is encrypted and requires specific proprietary decryption keys

How to invoke it

Example prompts that trigger this skill:

  • Check if this PDF is too large for Claude Code to read directly.
  • Split this 200-page document into 20-page segments.
  • Extract all tables from this report and save them as CSV files.
  • Convert this scanned PDF into extractable text using OCR.
  • Show me the file size and verify if it is safe to process.

Example workflow

  1. 1. Run a size check script to verify the file exceeds the agent's safe limit.
  2. 2. Execute a splitting script to partition the PDF into 25-page chunks.
  3. 3. Use PyMuPDF to extract text from each chunk sequentially.
  4. 4. Parse complex data tables using pdfplumber.
  5. 5. Feed the extracted data strings into the context window for final analysis.

Prerequisites

  • Python 3.x
  • Poppler (for pdf2image)
  • Tesseract (for OCR)

Pitfalls & limitations

  • !Automated splitting can break tables across file boundaries.
  • !OCR performance is dependent on the quality of the source image.
  • !Handling massive files can temporarily consume high local memory.

FAQ

Why does my session crash when I read large PDFs?
Claude Code has strict token limits and memory buffers; files over 15MB often cause the agent to exceed its memory overhead or context window.
Is PyMuPDF better than pypdf?
PyMuPDF is significantly faster for extraction tasks, while pypdf is better for simple file manipulations like merging or rotating pages.
Can I extract tables from images?
You must first run the PDF through an OCR process using Tesseract before the extraction tool can identify text or tables within the scanned images.
What is the optimal chunk size?
We recommend sticking to 20-30 pages per chunk to stay reliably under common memory thresholds.

How it compares

While a generic prompt might attempt to read the whole file, this skill automates the partitioning and extraction process, preventing common context crashes.

Source & trust

🕒 Updated 2025-11-24🛡 runs-shell, network, reads-credentials

From the source: “# PDF Processing for Claude Code Provides comprehensive techniques and utilities for processing PDF files in Claude Code, especially large files that exceed direct reading capabilities. ## Overview Claude Code can read PDF files directly using the Read tool, but has critical limitations: - **Officia…”

View the full SKILL.md source

# PDF Processing for Claude Code

Provides comprehensive techniques and utilities for processing PDF files in Claude Code, especially large files that exceed direct reading capabilities.

## Overview

Claude Code can read PDF files directly using the Read tool, but has critical limitations:

- **Official limits**: 32MB max file size, 100 pages max
- **Real-world limits**: Much lower (10-15MB, 30-50 pages)
- **Known issue**: Claude Code crashes with large PDFs, causing session termination and context loss
- **Token cost**: 1,500-3,000 tokens per page for text + additional for images

This skill provides workarounds, utilities, and best practices for handling PDFs of any size.

## Quick Start

### Check if PDF is Too Large for Direct Reading

```python
import os

def is_pdf_too_large(filepath, max_mb=10):
    """Check if PDF exceeds safe processing size."""
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    return size_mb > max_mb

# Use before attempting to read
if is_pdf_too_large("document.pdf"):
    print("PDF too large - use chunking strategies")
else:
    # Safe to read directly with Claude Code
    pass
```

### Extract Text from PDF

```python
import fitz  # PyMuPDF - fastest option

def extract_text_fast(pdf_path):
    """Extract all text from PDF quickly."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Usage
text = extract_text_fast("document.pdf")
```

### Split Large PDF into Chunks

```python
from pypdf import PdfReader, PdfWriter

def chunk_pdf(input_path, pages_per_chunk=25, output_dir="chunks"):
    """Split PDF into smaller files."""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    os.makedirs(output_dir, exist_ok=True)

    for i in range(0, total_pages, pages_per_chunk):
        writer = PdfWriter()
        end = min(i + pages_per_chunk, total_pages)

        for page_num in range(i, end):
            writer.add_page(reader.pages[page_num])

        output_file = f"{output_dir}/chunk_{i//pages_per_chunk:03d}_pages_{i+1}-{end}.pdf"
        with open(output_file, "wb") as output:
            writer.write(output)

        print(f"Created {output_file}")

# Usage
chunk_pdf("large_document.pdf", pages_per_chunk=30)
```

### Extract Tables from PDF

```python
import pdfplumber

def extract_tables(pdf_path):
    """Extract all tables from PDF with high accuracy."""
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            page_tables = page.extract_tables()
            for table_num, table in enumerate(page_tables, 1):
                tables.append({
                    'page': page_num,
                    'table_num': table_num,
                    'data': table
                })

    return tables

# Usage
tables = extract_tables("report.pdf")
for t in tables:
    print(f"Page {t['page']}, Table {t['table_num']}")
    print(t['data'])
```

## Python Libraries

### pypdf (formerly PyPDF2)
- **Best for**: Basic PDF operations (split, merge, rotate)
- **Speed**: Slower than alternatives
- **Install**: `pip install pypdf`

### PyMuPDF (fitz)
- **Best for**: Fast text extraction, general-purpose processing
- **Speed**: 10-20x faster than pypdf
- **Install**: `pip install PyMuPDF`

### pdfplumber
- **Best for**: Table extraction, precise text with coordinates
- **Speed**: Moderate (0.10s per page)
- **Install**: `pip install pdfplumber`

### pdf2image
- **Best for**: Converting PDF pages to images
- **Requires**: Poppler (system dependency)
- **Install**: `pip install pdf2image`

### pytesseract
- **Best for**: OCR on scanned PDFs
- **Requires**: Tesseract (system dependency)
- **Install**: `pip install pytesseract`

## Chunking Strategies

### 1. Page-Based Splitting
Split PDF into fixed page batches.

**When to use**: Document structure is irrelevant; you need simple, predictable chunks

**Optimal size**: 20-30 pages per chunk (stays under 10MB typically)

```python
# See Quick Start "Split Large PDF into Chunks"
chunk_pdf("document.pdf", pages_per_chunk=25)
```

### 2. Size-Based Splitting
Monitor file size and split when threshold is reached.

**When to use**: Avoiding crashes is critical; page count is unreliable indicator

```python
def chunk_by_size(pdf_path, max_mb=8):
    """Split PDF keeping chunks under size limit."""
    reader = PdfReader(pdf_path)
    writer = PdfWriter()
    chunk_num = 0

    for page_num, page in enumerate(reader.pages):
        writer.add_page(page)

        # Check size by writing to bytes
        from io import BytesIO
        buffer = BytesIO()
        writer.write(buffer)
        size_mb = buffer.tell() / (1024 * 1024)

        if size_mb >= max_mb:
            # Save chunk
            output = f"chunk_{chunk_num:03d}.pdf"
            with open(output, "wb") as f:
                writer.write(f)
            chunk_num += 1
            writer = PdfWriter()  # Start new chunk
```

### 3. Overlapping Chunks
Include overlap between chunks to maintain context.

**When to use**: Content spans pages; losing context between chunks is problematic

**Optimal overlap**: 1-2 pages (or 10-20% of chunk size)

```python
def chunk_with_overlap(pdf_path, pages_per_chunk=25, overlap=2):
    """Split PDF with overlapping pages for context preservation."""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    chunk_num = 0
    start = 0

    while start < total_pages:
        writer = PdfWriter()
        end = min(start + pages_per_chunk, total_pages)

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        output = f"chunk_{chunk_num:03d}_pages_{start+1}-{end}.pdf"
        with open(output, "wb") as f:
            writer.write(f)

        chunk_num += 1
        start = end - overlap  # Move forward with overlap
```

### 4. Text Extraction First
Extract text, then chunk the text instead of PDF.

**When to use**: You only need text content, not layout/images

**Advantage**: Much smaller, faster to process, no crashes

```python
def extract_and_chunk_text(pdf_path, chars_per_chunk=10000):
    """Extract text and split into manageable chunks."""
    import fitz

    doc = fitz.open(pdf_path)
    full_text = ""

    for page in doc:
        full_text += f"\n\n--- Page {page.number + 1} ---\n\n"
        full_text += page.get_text()

    doc.close()

    # Split text into chunks
    chunks = []
    for i in range(0, len(full_text), chars_per_chunk):
        chunks.append(full_text[i:i + chars_per_chunk])

    return chunks

# Usage
text_chunks = extract_and_chunk_text("large.pdf")
for i, chunk in enumerate(text_chunks):
    with open(f"text_chunk_{i:03d}.txt", "w", encoding="utf-8") as f:
        f.write(chunk)
```

## Handling Different PDF Types

### Text-Based PDFs (Native Text)
PDFs created digitally with searchable text.

**Detection**:
```python
import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()  # First page

if len(text.strip()) > 50:
    print("Text-based PDF")
else:
    print("Likely scanned PDF")
```

**Best approach**: Direct text extraction with PyMuPDF or pdfplumber

### Scanned PDFs (Images of Text)
PDFs created by scanning physical documents.

**Requires**: OCR (Optical Character Recognition)

**Approach**:
```python
from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path):
    """Extract text from scanned PDF using OCR."""
    # Convert to images
    images = convert_from_path(pdf_path, dpi=300)

    # OCR each page
    text = ""
    for i, image in enumerate(images, 1):
        page_text = pytesseract.image_to_string(image)
        text += f"\n\n--- Page {i} ---\n\n{page_text}"

    return text
```

**Performance note**: OCR is much slower than direct text extraction

### Mixed PDFs
Some pages have text, others are scanned.

**Approach**: Detect page-by-page and use appropriate method

```python
def extract_mixed_pdf(pdf_path):
    """Handle PDFs with both text and scanned pages."""
    import fitz
    from pdf2image import convert_from_path
    import pytesseract

    doc = fitz.open(pdf_path)
    full_text = ""

    for page_num, page in enumerate(doc):
        text = page.get_text()

        if len(text.strip()) > 50:
            # Has text - use direct extraction
            full_text += f"\n\n--- Page {page_num + 1} (text) ---\n\n{text}"
        else:
            # Likely scanned - use OCR
            images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
            ocr_text = pytesseract.image_to_string(images[0])
            full_text += f"\n\n--- Page {page_num + 1} (OCR) ---\n\n{ocr_text}"

    doc.close()
    return full_text
```

## Helper Scripts

This skill includes pre-built scripts in the `scripts/` directory:

- **chunk_pdf.py**: Flexible PDF chunking with multiple strategies
- **extract_text.py**: Unified text extraction (handles text-based and OCR)
- **extract_tables.py**: Advanced table extraction with formatting
- **process_large_pdf.py**: Orchestrate complete large PDF processing workflow

### Using Helper Scripts

```bash
# Chunk a large PDF
python .claude/skills/pdf-processing/scripts/chunk_pdf.py large_doc.pdf --pages 30 --overlap 2

# Extract all text
python .claude/skills/pdf-processing/scripts/extract_text.py document.pdf --output text.txt

# Extract tables to CSV
python .claude/skills/pdf-processing/scripts/extract_tables.py report.pdf --output tables/

# Process large PDF end-to-end
python .claude/skills/pdf-processing/scripts/process_large_pdf.py huge_doc.pdf --strategy chunk --output processed/
```

## Error Handling

### Preventing Crashes

**Key principle**: Never trust PDF size alone - always check before reading

```python
def safe_pdf_read(pdf_path, max_pages=30, max_mb=10):
    """Safely check if PDF can be read directly."""
    import fitz

    # Check file size
    size_mb = os.path.getsize(pdf_path) / (1024 * 1024)
    if size_mb > max_mb:
        return False, f"File too large: {size_mb:.1f}MB (max: {max_mb}MB)"

    # Check page count
    try:
        doc = fitz.open(pdf_path)
        page_count = len(doc)
        doc.close()

        if page_count > max_pages:
            return False, f"Too many pages: {page_count} (max: {max_pages})"

        return True, f"Safe to read: {size_mb:.1f}MB, {page_count} pages"

    except Exception as e:
        return False, f"Error checking PDF: {e}"

# Usage
safe, message = safe_pdf_read("document.pdf")
print(message)

if safe:
    # Use Claude Code Read tool
    pass
else:
    # Use chunking strategies
    pass
```

### Handling Corrupted PDFs

```python
def is_pdf_valid(pdf_path):
    """Check if PDF is valid and readable."""
    try:
        import fitz
        doc = fitz.open(pdf_path)
        _ = len(doc)  # Force reading
        doc.close()
        return True, "PDF is valid"
    except Exception as e:
        return False, f"PDF is corrupted or invalid: {e}"
```

### Graceful Degradation

```python
def extract_with_fallback(pdf_path):
    """Try multiple extraction methods, falling back if needed."""

    # Try 1: PyMuPDF (fastest)
    try:
        import fitz
        doc = fitz.open(pdf_path)
        text = "\n".join(page.get_text() for page in doc)
        doc.close()
        if text.strip():
            return text, "pymupdf"
    except Exception as e:
        print(f"PyMuPDF failed: {e}")

    # Try 2: pdfplumber (more reliable)
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join(page.extract_text() or "" for page in pdf.pages)
        if text.strip():
            return text, "pdfplumber"
    except Exception as e:
        print(f"pdfplumber failed: {e}")

    # Try 3: OCR (last resort)
    try:
        from pdf2image import convert_from_path
        import pytesseract
        images = convert_from_path(pdf_path, dpi=300)
        text = "\n\n".join(pytesseract.image_to_string(img) for img in images)
        return text, "ocr"
    except Exception as e:
        print(f"OCR failed: {e}")

    return None, "all_methods_failed"
```

## Best Practices

1. **Always check file size before reading**: Use `safe_pdf_read()` to avoid crashes
2. **Prefer text extraction over direct reading**: Extract text first, then process text files
3. **Use overlapping chunks for context**: 1-2 pages overlap prevents information loss
4. **Choose the right tool**: PyMuPDF for speed, pdfplumber for tables, OCR for scans
5. **Monitor progress**: For large PDFs, log progress to recover from interruptions
6. **Save intermediate results**: Don't lose progress if processing fails partway through
7. **Test with small chunks first**: Validate approach on 1-2 chunks before processing entire document

## Common Workflows

### Workflow 1: Analyze Large Report

```python
# 1. Check if direct read is safe
safe, msg = safe_pdf_read("report.pdf")

if not safe:
    # 2. Extract text instead
    text = extract_text_fast("report.pdf")

    # 3. Save to file for Claude to read
    with open("report_text.txt", "w", encoding="utf-8") as f:
        f.write(text)

    # 4. Process text file (much safer)
    # Claude can now read report_text.txt without crashes
```

### Workflow 2: Extract Data from Multi-Page Invoice

```python
# 1. Extract tables from all pages
tables = extract_tables("invoice_100pages.pdf")

# 2. Convert to structured format
import csv

for t in tables:
    filename = f"invoice_page{t['page']}_table{t['table_num']}.csv"
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerows(t['data'])
```

### Workflow 3: Process Scanned Document Archive

```python
# 1. Check if scanned
import fitz
doc = fitz.open("archive.pdf")
is_scanned = len(doc[0].get_text().strip()) < 50
doc.close()

if is_scanned:
    # 2. Use OCR
    text = ocr_pdf("archive.pdf")

    # 3. Save extracted text
    with open("archive_ocr.txt", "w", encoding="utf-8") as f:
        f.write(text)
```

## Troubleshooting

### Issue: "Claude Code crashed when reading PDF"
**Solution**: File was too large. Use chunking or text extraction first.

### Issue: "Extracted text is gibberish"
**Solution**: PDF might be scanned. Use OCR (`ocr_pdf()` function).

### Issue: "Table extraction is inaccurate"
**Solution**: Use pdfplumber with custom table detection settings (see `reference.md`).

### Issue: "OCR is too slow"
**Solution**: Reduce DPI (try 150-200 instead of 300), or process only needed pages.

### Issue: "Out of memory when processing large PDF"
**Solution**: Process page-by-page instead of loading entire document. See `process_large_pdf.py`.

## Next Steps

- For advanced techniques and detailed API references, see [reference.md](reference.md)
- For troubleshooting specific library issues, see library documentation
- For custom workflows, combine techniques from Quick Start and Common Workflows sections

## Installation

Required dependencies:

```bash
pip install pypdf PyMuPDF pdfplumber pdf2image pytesseract
```

System dependencies:
- **Poppler** (for pdf2image): [Installation guide](https://pdf2image.readthedocs.io/en/latest/installation.html)
- **Tesseract** (for OCR): [Installation guide](https://github.com/tesseract-ocr/tesseract)

Quoted from Ming-Kai-LC/self-learn for reference — see the original for the authoritative, latest version.

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/pdf-processing/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/Ming-Kai-LC/self-learn/pdf-processing/SKILL.md
  • Cursor: ~/.cursor/skills/Ming-Kai-LC/self-learn/pdf-processing/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/Ming-Kai-LC/self-learn/pdf-processing/SKILL.md

🚀 Install with CLI:
npx skills add Ming-Kai-LC/self-learn

Read the Master Guide: Mastering Agent Skills

Recommended Rules

View more rules

Recommended Workflows

View more workflows

Recommended MCP Servers

View more MCP servers

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid documents issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under Documents and is published by Ming-Kai-LC, maintained in Ming-Kai-LC/self-learn.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.