baoyu-url-to-markdown

Name: baoyu-url-to-markdown
Author: Jim Liu

markdownweb scrapingChrome CDPURL conversioncontent extractiondocumentationAI agentautomation

⭐ 21.7k🕒 2026-06-13Source ↗

Install this skill

npx skills add jimliu/baoyu-skills

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The baoyu-url-to-markdown skill provides a specialized utility for extracting web content into formatted Markdown using Chrome CDP for full JavaScript rendering. Instead of traditional static scrapers, this tool handles dynamic pages that rely on client-side execution to display information. It features two operating modes: an automatic capture that processes pages upon loading, and a manual wait mode that pauses the script to allow for user interactions like logging in or dismissing popups. Files are automatically organized into a domain-based folder structure with smart naming conventions and timestamp-based collision handling. By maintaining consistent metadata headers and stripping unnecessary elements, the skill produces clean documentation archives directly from live URLs, suitable for technical research or local knowledge base population.

When to Use This Skill

•Archiving technical documentation from sites requiring login
•Converting complex web articles into readable offline Markdown
•Capturing dynamic web pages that fail with standard curl or wget
•Creating personal knowledge base entries from live research URLs

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

“convert this webpage to markdown
“scrape this url and save it to my docs
“capture this article as a markdown file
“fetch the content from this link for my notes
“save this dynamic webpage into a markdown file

Pro Tips

💡For complex pages or those requiring authentication, always use the `--wait` flag to manually signal when the page is ready for capture, ensuring all dynamic elements are loaded.
💡Leverage the `-o` option to specify an output file path, which is crucial when integrating this skill into automated workflows or scripting environments.
💡Remember that Chrome CDP renders the page fully, so the markdown output will reflect the page's appearance after all scripts have executed, providing a more accurate representation than simple HTML parsers.

What this skill does

•Executes full JavaScript rendering via Chrome CDP
•Supports interactive wait-mode for authenticated sessions
•Generates standardized frontmatter metadata for every page
•Organizes outputs into domain-specific directory structures
•Implements automatic conflict resolution with timestamp suffixes

When not to use it

✕Scraping websites with aggressive anti-bot or CAPTCHA challenges
✕Extracting data from internal web applications that require specific SSO/MFA workflows

Example workflow

Identify target URL and determine if authentication is required
Execute the main script via CLI using the appropriate mode
If in wait-mode, perform manual login or navigation in the browser
Confirm page readiness to finalize the document capture
Verify the output in the automatically generated domain directory

Prerequisites

–Google Chrome or Chromium installed on the system
–Bun runtime environment
–Appropriate file system permissions for the output directory

Pitfalls & limitations

!May struggle with sites that detect automated Chrome sessions
!Default timeout settings might be insufficient for slow-loading heavy pages
!Requires manual intervention for pages with complex multi-step login flows

FAQ

How do I handle pages that require a login?

Use the --wait flag. This allows you to log into the page manually in the browser before signaling the script to perform the final capture.

Where are my saved files stored?

Files are stored in a folder structure organized by domain, specifically url-to-markdown/<domain>/.

What happens if a filename already exists?

The script automatically appends a timestamp to the filename to prevent overwriting existing data.

Can I customize the chrome path?

Yes, set the URL_CHROME_PATH environment variable to point to your specific browser executable.

How it compares

Unlike standard text-to-markdown prompts which rely on the LLM's cached knowledge or simple HTML parsing, this tool uses actual browser rendering to ensure the captured DOM matches what a user sees on screen.

Source & trust

⭐ 22k stars🕒 Updated 2026-06-13

View original skill on GitHub →

📄 Full skill instructions — original source: jimliu/baoyu-skills

# URL to Markdown

Fetches any URL via Chrome CDP and converts HTML to clean markdown.

## Script Directory

**Important**: All scripts are located in the scripts/ subdirectory of this skill.

**Agent Execution Instructions**:
1. Determine this SKILL.md file's directory path as SKILL_DIR
2. Script path = ${SKILL_DIR}/scripts/<script-name>.ts
3. Replace all ${SKILL_DIR} in this document with the actual path

**Script Reference**:
| Script | Purpose |
|--------|---------|
| scripts/main.ts | CLI entry point for URL fetching |

## Features

- Chrome CDP for full JavaScript rendering
- Two capture modes: auto or wait-for-user
- Clean markdown output with metadata
- Handles login-required pages via wait mode

## Usage

# Auto mode (default) - capture when page loads
npx -y bun ${SKILL_DIR}/scripts/main.ts <url>

# Wait mode - wait for user signal before capture
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> --wait

# Save to specific file
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> -o output.md

## Options

| Option | Description |
|--------|-------------|
| <url> | URL to fetch |
| -o <path> | Output file path (default: auto-generated) |
| --wait | Wait for user signal before capturing |
| --timeout <ms> | Page load timeout (default: 30000) |

## Capture Modes

### Auto Mode (default)

Page loads → waits for network idle → captures immediately.

Best for:
- Public pages
- Static content
- No login required

### Wait Mode (--wait)

Page opens → user can interact (login, scroll, etc.) → user signals ready → captures.

Best for:
- Login-required pages
- Dynamic content needing interaction
- Pages with lazy loading

**Agent workflow for wait mode**:
1. Run script with --wait flag
2. Script outputs: Page opened. Press Enter when ready to capture...
3. Use AskUserQuestion to ask user if page is ready
4. When user confirms, send newline to stdin to trigger capture

## Output Format

---
url: https://example.com/page
title: "Page Title"
description: "Meta description if available"
author: "Author if available"
published: "2024-01-01"
captured_at: "2024-01-15T10:30:00Z"
---

# Page Title

Converted markdown content...

## Mode Selection Guide

When user requests URL capture, help select appropriate mode:

**Suggest Auto Mode when**:
- URL is public (no login wall visible)
- Content appears static
- User doesn't mention login requirements

**Suggest Wait Mode when**:
- User mentions needing to log in
- Site known to require authentication
- User wants to scroll/interact before capture
- Content is behind paywall

**Ask user when unclear**:

The page may require login or interaction before capturing.

Which mode should I use?
1. Auto - Capture immediately when loaded
2. Wait - Wait for you to interact first

## Output Directory

Each capture creates a file organized by domain:

url-to-markdown/
└── <domain>/
    └── <slug>.md

**Path Components**:
- <domain>: Site domain (e.g., example.com, github.com)
- <slug>: Generated from page title or URL path (kebab-case)

**Slug Generation**:
1. Extract from page title (preferred) or URL path
2. Convert to kebab-case, 2-6 words
3. Example: "Getting Started with React" → getting-started-with-react

**Conflict Resolution**:
If url-to-markdown/<domain>/<slug>.md already exists:
- Append timestamp: <slug>-YYYYMMDD-HHMMSS.md
- Example: getting-started.md exists → getting-started-20260118-143052.md

## Error Handling

| Error | Resolution |
|-------|------------|
| Chrome not found | Install Chrome or set URL_CHROME_PATH env |
| Page timeout | Increase --timeout value |
| Capture failed | Try wait mode for complex pages |
| Empty content | Page may need JS rendering time |

## Environment Variables

| Variable | Description |
|----------|-------------|
| URL_CHROME_PATH | Custom Chrome executable path |
| URL_DATA_DIR | Custom data directory |
| URL_CHROME_PROFILE_DIR | Custom Chrome profile directory |

## Extension Support

Custom configurations via EXTEND.md.

**Check paths** (priority order):
1. .baoyu-skills/baoyu-url-to-markdown/EXTEND.md (project)
2. ~/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md (user)

If found, load before workflow. Extension content overrides defaults.

By Jim Liu

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/baoyu-url-to-markdown/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/jimliu/baoyu-skills/baoyu-url-to-markdown/SKILL.md
Cursor: ~/.cursor/skills/jimliu/baoyu-skills/baoyu-url-to-markdown/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/jimliu/baoyu-skills/baoyu-url-to-markdown/SKILL.md

🚀 Install with CLI:
npx skills add jimliu/baoyu-skills

Read the Master Guide: Mastering Agent Skills →

Recommended Rules

View more rules →

Recommended Workflows

View more workflows →

Automatic commit message generator

GitAIAutomation

--- description: Automatic commit message generator and fast AI-powered commit for all current changes --- // turbo-all This workflow automatically ...

Setup Husky Git Hooks

GitAutomationQuality

--- description: Automate code quality checks with pre-commit and pre-push hooks --- 1. **Install Husky**: - Install husky and lint-staged. // ...

Create GitHub PR Template

GitGitHubTeam

--- description: Standardize pull request descriptions for better code reviews --- 1. **Create Template Directory**: - GitHub looks for templates ...

Recommended MCP Servers

View more MCP servers →

Clix MCP Server

Official

Clix MCP Server that enables AI agents to provide real-time, trusted Clix documentation and SDK code examples for seamless integrations.

Detailer

Official

Instantly generate rich, AI-powered documentation for your GitHub repositories. Designed for AI agents to gain deep project context before taking action.