Back to AI Tools & Agents

Crawl4AI

scrapingautomationllm-opsmarkdownpython
β˜… 4.8 (138)⭐ 5πŸ“„ MITπŸ•’ 2026-03-01Source β†—

Install this skill

npx skills add basher83/agent-auditor

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

What this skill does

  • β€’Headless browser rendering for JavaScript-dependent content
  • β€’Automated conversion of web pages into structured markdown
  • β€’Configurable content filtering using BM25 and pruning strategies
  • β€’Automatic extraction of metadata, images, and internal link structures
  • β€’Session persistence across multi-page crawling operations

When to use it

  • βœ“When you need to scrape data from sites that require JavaScript execution to load content
  • βœ“When you are preparing web content for ingestion into a RAG pipeline or LLM context
  • βœ“When you need to isolate specific components like article bodies while stripping navbars and footers
  • βœ“When you need to automate multi-page data collection into a standardized machine-readable format

When not to use it

  • βœ•For simple static pages where a basic HTTP request and regex could suffice
  • βœ•When accessing sites with aggressive CAPTCHA or blocking measures that require specialized proxy rotation

How to invoke it

Example prompts that trigger this skill:

  • β€œScrape the main content of https://example.com and return it as clean markdown.”
  • β€œCrawl this documentation URL and extract all links to internal sub-pages.”
  • β€œExtract all product names and prices from this store page using the built-in schema generator.”
  • β€œRun a crawl on this site, removing the sidebar and footer elements from the output.”
  • β€œFetch the page content and take a screenshot for visual debugging.”

Example workflow

  1. Run crawl4ai-doctor to verify your local browser dependencies.
  2. Define a BrowserConfig object to set viewport dimensions and headless mode.
  3. Use the AsyncWebCrawler to target a specific URL with your defined configuration.
  4. Apply a ContentFilter strategy to prune irrelevant noise like scripts and styles.
  5. Access the result.markdown attribute to retrieve the formatted text.
  6. Save the extracted data into your local vector database or documentation file.

Prerequisites

  • –Python 3.8+
  • –Playwright or browser drivers installed via crawl4ai-setup

Pitfalls & limitations

  • !High resource consumption when crawling many pages simultaneously due to browser overhead
  • !Risk of being blocked if you crawl too aggressively without setting artificial delays
  • !Large DOM structures can lead to memory spikes during the conversion process

FAQ

How is this different from BeautifulSoup?
BeautifulSoup only parses static HTML. Crawl4AI renders the page in a headless browser, allowing it to extract content generated by JavaScript that BeautifulSoup would miss.
Can I use this for pagination?
Yes, you can manage sessions and loop through URL lists or dynamic navigation by keeping the crawler instance active throughout your process.
Is the output reliable for LLMs?
The generated markdown is specifically optimized to remove non-essential noise, which preserves token space and improves the accuracy of LLM-based extraction.

How it compares

While manual scraping requires building your own browser driver management and parsing logic, Crawl4AI provides a consolidated, pre-configured framework that handles the browser lifecycle and markdown formatting automatically.

Source & trust

⭐ 5 starsπŸ“„ MITπŸ•’ Updated 2026-03-01πŸ›‘ runs-shell, network, reads-credentials

From the source: β€œ# Crawl4AI ## Overview This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction. ## Quick Start ### Installation Che…”

View the full SKILL.md source

# Crawl4AI

## Overview

This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.

## Quick Start

### Installation Check

```bash
# Verify installation
crawl4ai-doctor

# If issues, run setup
crawl4ai-setup
```

### Basic First Crawl

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])  # First 500 chars

asyncio.run(main())
```

### Using Provided Scripts

```bash
# Simple markdown extraction
python scripts/basic_crawler.py https://example.com

# Batch processing
python scripts/batch_crawler.py urls.txt

# Data extraction
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
```

## Core Crawling Fundamentals

### 1. Basic Crawling

Understanding the core components for any crawl:

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
    headless=True,  # Run without GUI
    viewport_width=1920,
    viewport_height=1080,
    user_agent="custom-agent"  # Optional custom user agent
)

# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
    page_timeout=30000,  # 30 seconds timeout
    screenshot=True,  # Take screenshot
    remove_overlay_elements=True  # Remove popups/overlays
)

# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=crawler_config
    )

    # CrawlResult contains everything
    print(f"Success: {result.success}")
    print(f"HTML length: {len(result.html)}")
    print(f"Markdown length: {len(result.markdown)}")
    print(f"Links found: {len(result.links)}")
```

### 2. Configuration Deep Dive

**BrowserConfig** - Controls the browser instance:

- `headless`: Run with/without GUI
- `viewport_width/height`: Browser dimensions
- `user_agent`: Custom user agent string
- `cookies`: Pre-set cookies
- `headers`: Custom HTTP headers

**CrawlerRunConfig** - Controls each crawl:

- `page_timeout`: Maximum page load/JS execution time (ms)
- `wait_for`: CSS selector or JS condition to wait for (optional)
- `cache_mode`: Control caching behavior
- `js_code`: Execute custom JavaScript
- `screenshot`: Capture page screenshot
- `session_id`: Persist session across crawls

### 3. Content Processing

Basic content operations available in every crawl:

```python
result = await crawler.arun(url)

# Access extracted content
markdown = result.markdown  # Clean markdown
html = result.html  # Raw HTML
text = result.cleaned_html  # Cleaned HTML

# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]

# Metadata
title = result.metadata["title"]
description = result.metadata["description"]
```

## Markdown Generation (Primary Use Case)

### 1. Basic Markdown Extraction

Crawl4AI excels at generating clean, well-formatted markdown:

```python
# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://docs.example.com")

    # High-quality markdown ready for LLMs
    with open("documentation.md", "w") as f:
        f.write(result.markdown)
```

### 2. Fit Markdown (Content Filtering)

Use content filters to get only relevant content:

```python
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")

# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)

md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)

result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown)  # Filtered markdown
print(result.markdown.raw_markdown)  # Original markdown
```

### 3. Markdown Customization

Control markdown generation with options:

```python
config = CrawlerRunConfig(
    # Exclude elements from markdown
    excluded_tags=["nav", "footer", "aside"],

    # Focus on specific CSS selector
    css_selector=".main-content",

    # Clean up formatting
    remove_forms=True,
    remove_overlay_elements=True,

    # Control link handling
    exclude_external_links=True,
    exclude_internal_links=False
)

# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

generator = DefaultMarkdownGenerator(
    options={
        "ignore_links": False,
        "ignore_images": False,
        "image_alt_text": True
    }
)
```

## Data Extraction

### 1. Schema-Based Extraction (Most Efficient)

For repetitive patterns, generate schema once and reuse:

```bash
# Step 1: Generate schema with LLM (one-time)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

# Step 2: Use schema for fast extraction (no LLM)
python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
```

### 2. Manual CSS/JSON Extraction

When you know the structure:

```python
schema = {
    "name": "articles",
    "baseSelector": "article.post",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "date", "selector": ".date", "type": "text"},
        {"name": "content", "selector": ".content", "type": "text"}
    ]
}

extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
```

### 3. LLM-Based Extraction

For complex or irregular content:

```python
extraction_strategy = LLMExtractionStrategy(
    provider="openai/gpt-4o-mini",
    instruction="Extract key financial metrics and quarterly trends"
)
```

## Advanced Patterns

### 1. Deep Crawling

Discover and crawl links from a page:

```python
# Basic link discovery
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url)

    # Extract and process discovered links
    internal_links = result.links.get("internal", [])
    external_links = result.links.get("external", [])

    # Crawl discovered internal links
    for link in internal_links:
        if "/blog/" in link and "/tag/" not in link:  # Filter links
            sub_result = await crawler.arun(link)
            # Process sub-page

    # For advanced deep crawling, consider using URL seeding patterns
    # or custom crawl strategies (see complete-sdk-reference.md)
```

### 2. Batch & Multi-URL Processing

Efficiently crawl multiple URLs:

```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

async with AsyncWebCrawler() as crawler:
    # Concurrent crawling with arun_many()
    results = await crawler.arun_many(
        urls=urls,
        config=crawler_config,
        max_concurrent=5  # Control concurrency
    )

    for result in results:
        if result.success:
            print(f"βœ… {result.url}: {len(result.markdown)} chars")
```

### 3. Session & Authentication

Handle login-required content:

```python
# First crawl - establish session and login
login_config = CrawlerRunConfig(
    session_id="user_session",
    js_code="""
    document.querySelector('#username').value = 'myuser';
    document.querySelector('#password').value = 'mypass';
    document.querySelector('#submit').click();
    """,
    wait_for="css:.dashboard"  # Wait for post-login element
)

await crawler.arun("https://site.com/login", config=login_config)

# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)
```

### 4. Dynamic Content Handling

For JavaScript-heavy sites:

```python
config = CrawlerRunConfig(
    # Wait for dynamic content
    wait_for="css:.ajax-content",

    # Execute JavaScript
    js_code="""
    // Scroll to load content
    window.scrollTo(0, document.body.scrollHeight);

    // Click load more button
    document.querySelector('.load-more')?.click();
    """,

    # Note: For virtual scrolling (Twitter/Instagram-style),
    # use virtual_scroll_config parameter (see docs)

    # Extended timeout for slow loading
    page_timeout=60000
)
```

### 5. Anti-Detection & Proxies

Avoid bot detection:

```python
# Proxy configuration
browser_config = BrowserConfig(
    headless=True,
    proxy_config={
        "server": "http://proxy.server:8080",
        "username": "user",
        "password": "pass"
    }
)

# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests

# Rate limiting
import asyncio
for url in urls:
    result = await crawler.arun(url)
    await asyncio.sleep(2)  # Delay between requests
```

## Common Use Cases

### Documentation to Markdown

```python
# Convert entire documentation site to clean markdown
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://docs.example.com")

    # Save as markdown for LLM consumption
    with open("docs.md", "w") as f:
        f.write(result.markdown)
```

### E-commerce Product Monitoring

```python
# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
    config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
```

### News Aggregation

```python
# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)

# Extract articles with Fit Markdown
for result in results:
    if result.success:
        # Get only relevant content
        article = result.fit_markdown
```

### Research & Data Collection

```python
# Academic paper collection with focused extraction
config = CrawlerRunConfig(
    fit_markdown=True,
    fit_markdown_options={
        "query": "machine learning transformers",
        "max_tokens": 10000
    }
)
```

## Resources

### scripts/

- **extraction_pipeline.py** - Three extraction approaches with schema generation
- **basic_crawler.py** - Simple markdown extraction with screenshots
- **batch_crawler.py** - Multi-URL concurrent processing

### references/

- **complete-sdk-reference.md** - Complete SDK documentation (23K words) with all parameters, methods, and advanced features

### Example Code Repository

The Crawl4AI repository includes extensive examples in `docs/examples/`:

#### Core Examples

- **quickstart.py** - Comprehensive starter with all basic patterns:
  - Simple crawling, JavaScript execution, CSS selectors
  - Content filtering, link analysis, media handling
  - LLM extraction, CSS extraction, dynamic content
  - Browser comparison, SSL certificates

#### Specialized Examples

- **amazon_product_extraction_*.py** - Three approaches for e-commerce scraping
- **extraction_strategies_examples.py** - All extraction strategies demonstrated
- **deepcrawl_example.py** - Advanced deep crawling patterns
- **crypto_analysis_example.py** - Complex data extraction with analysis
- **parallel_execution_example.py** - High-performance concurrent crawling
- **session_management_example.py** - Authentication and session handling
- **markdown_generation_example.py** - Advanced markdown customization
- **hooks_example.py** - Custom hooks for crawl lifecycle events
- **proxy_rotation_example.py** - Proxy management and rotation
- **router_example.py** - Request routing and URL patterns

#### Advanced Patterns

- **adaptive_crawling/** - Intelligent crawling strategies
- **c4a_script/** - C4A script examples
- **docker_*.py** - Docker deployment patterns

To explore examples:

```python
# The examples are located in your Crawl4AI installation:
# Look in: docs/examples/ directory

# Start with quickstart.py for comprehensive patterns
# It includes: simple crawl, JS execution, CSS selectors,
# content filtering, LLM extraction, dynamic pages, and more

# For specific use cases:
# - E-commerce: amazon_product_extraction_*.py
# - High performance: parallel_execution_example.py
# - Authentication: session_management_example.py
# - Deep crawling: deepcrawl_example.py

# Run any example directly:
# python docs/examples/quickstart.py
```

## Best Practices

1. **Start with basic crawling** - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
2. **Use markdown generation** for documentation and content - Crawl4AI excels at clean markdown extraction
3. **Try schema generation first** for structured data - 10-100x more efficient than LLM extraction
4. **Enable caching during development** - `cache_mode=CacheMode.ENABLED` to avoid repeated requests
5. **Set appropriate timeouts** - 30s for normal sites, 60s+ for JavaScript-heavy sites
6. **Respect rate limits** - Use delays and `max_concurrent` parameter
7. **Reuse sessions** for authenticated content instead of re-logging

## Troubleshooting

**JavaScript not loading:**

```python
config = CrawlerRunConfig(
    wait_for="css:.dynamic-content",  # Wait for specific element
    page_timeout=60000  # Increase timeout
)
```

**Bot detection issues:**

```python
browser_config = BrowserConfig(
    headless=False,  # Sometimes visible browsing helps
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))
```

**Content extraction problems:**

```python
# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")

# Try different wait strategies
config = CrawlerRunConfig(
    wait_for="js:document.querySelector('.content') !== null"
)
```

**Session/auth issues:**

```python
# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")
```

For more details on any topic, refer to `references/complete-sdk-reference.md` which contains comprehensive documentation of all features, parameters, and advanced usage patterns.

Quoted from basher83/agent-auditor for reference β€” see the original for the authoritative, latest version.

πŸ“„ Full skill instructions β€” original source: basher83/agent-auditor
Crawl4AI provides a browser-based crawling framework specifically built to convert complex, JavaScript-heavy web pages into clean, LLM-ready markdown. It handles the heavy lifting of browser automation, allowing developers to extract structured data without manually managing session persistence or DOM parsing. By running in headless mode, it captures content precisely as a human would see it, effectively bypassing common anti-scraping hurdles. It is ideal for developers building data pipelines, training datasets, or internal research agents who need high-quality text extraction from the web. Unlike basic request libraries, Crawl4AI manages page rendering, dynamic element removal, and intelligent content filtering within a single interface, significantly reducing the amount of post-processing needed to clean up raw HTML headers, footers, and noise.

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/crawl4ai/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/basher83/agent-auditor/crawl4ai/SKILL.md
  • Cursor: ~/.cursor/skills/basher83/agent-auditor/crawl4ai/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/basher83/agent-auditor/crawl4ai/SKILL.md

πŸš€ Install with CLI:
npx skills add basher83/agent-auditor

Read the Master Guide: Mastering Agent Skills β†’

Recommended Rules

View more rules β†’

Recommended Workflows

View more workflows β†’

Recommended MCP Servers

View more MCP servers β†’

Take It Further

Maximize your productivity with these powerful resources

πŸ“‹

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
πŸ“–

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid ai tools & agents issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under AI Tools & Agents and is published by basher83, maintained in basher83/agent-auditor.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.