Back to Security & Vulnerability Analysis

sarif-parsing

Name: sarif-parsing
Author: Trail of Bits

SARIFsecuritystatic analysisvulnerability managementCI/CDcode securityscan resultsanalysis

⭐ 5.7k📄 CC-BY-SA-4.0🕒 2026-06-15Source ↗

Install this skill

npx skills add trailofbits/skills

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

SARIF (Static Analysis Results Interchange Format) parsing provides a standardized approach to interpreting machine-readable security scan logs. By interacting with the SARIF 2.1.0 schema, you can programmatically extract, normalize, and manipulate vulnerability data from diverse security tooling. This skill focuses on the logical traversal of SARIF structures, specifically the runs, results, and artifacts arrays, to facilitate efficient data extraction. It enables the creation of custom dashboards, automated filtering pipelines, and deduplication logic based on stable fingerprinting rather than volatile file paths. By mastering the manipulation of these JSON-based outputs, you transition from viewing raw diagnostic logs to building actionable security intelligence that maintains consistency across varying CI/CD environments and distinct scanning methodologies.

When to Use This Skill

•Aggregating scan results from multiple tools into a centralized security dashboard
•Suppressing known false positives across multiple development branches
•Creating custom CI/CD gates that fail builds based on specific rule violations
•Converting static analysis reports for ingestion into third-party vulnerability management systems

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

“Parse this SARIF file and list all errors
“Extract all file paths and line numbers from these SARIF results
“Help me filter these scan findings by rule ID
“Convert this SARIF output into a summary report
“How can I compare two SARIF files for regression detection?

Pro Tips

💡Always validate SARIF files against the official SARIF 2.1.0 JSON schema to ensure correctness and prevent parsing errors.
💡When aggregating results from multiple tools, leverage common properties like 'ruleId' and 'fingerprints' to accurately deduplicate findings.
💡Consider scripting common SARIF processing tasks, such as extracting specific vulnerability types or converting to other formats, to automate your security workflows.

What this skill does

•Programmatic extraction of specific finding metadata from nested SARIF arrays
•Normalization of heterogeneous data from multiple security scanners
•Identification of unique vulnerabilities via fingerprint analysis for deduplication
•Conversion of security findings into custom formats like CSV, HTML, or Jira issues
•Automated filtering of findings based on severity level, rule ID, or location

When not to use it

✕Running the actual static analysis scans or linting source code
✕Attempting to remediate identified security vulnerabilities
✕Manually editing large-scale SARIF files without programmatic validation

Example workflow

Load the raw SARIF file into a Python environment using a dedicated loader library
Iterate through the 'results' array to normalize data across different scanner runs
Apply filtering logic to remove findings flagged as false positives
Aggregate remaining data by rule ID and severity to prioritize remediation tasks
Format the cleaned output into a standard report or push to a ticketing API

Prerequisites

–Access to a generated SARIF 2.1.0 formatted log file
–Basic familiarity with JSON schema structures
–Python or jq environment for data processing

Pitfalls & limitations

!Relying on file paths for deduplication instead of using stable fingerprints
!Ignoring the 'partialFingerprints' field which is critical for cross-run tracking
!Assuming all security tools populate optional SARIF fields consistently

FAQ

Why should I use SARIF parsing instead of reading raw tool logs?

SARIF is a structured, standardized schema, whereas raw logs vary wildly between tools. Parsing SARIF ensures your automation logic works regardless of which scanner produced the output.

Can SARIF files be used to track vulnerabilities over time?

Yes, by utilizing the fingerprinting mechanism defined in the schema, you can identify recurring issues across multiple scan runs even if file paths change.

Is every SARIF file compliant with the 2.1.0 standard?

Most modern tools target 2.1.0, but some legacy or proprietary tools may output variations. Always validate your input against the OASIS standard if processing fails.

How it compares

While manual parsing or generic regex-based extraction is error-prone due to deep nesting, this skill applies targeted library-based traversal that accounts for schema-specific relationships like physicalLocations and rule definitions.

Source & trust

⭐ 5.7k stars📄 CC-BY-SA-4.0🕒 Updated 2026-06-15

View original skill on GitHub →

📄 Full skill instructions — original source: trailofbits/skills

# SARIF Parsing Best Practices

You are a SARIF parsing expert. Your role is to help users effectively read, analyze, and process SARIF files from static analysis tools.

## When to Use

Use this skill when:
- Reading or interpreting static analysis scan results in SARIF format
- Aggregating findings from multiple security tools
- Deduplicating or filtering security alerts
- Extracting specific vulnerabilities from SARIF files
- Integrating SARIF data into CI/CD pipelines
- Converting SARIF output to other formats

## When NOT to Use

Do NOT use this skill for:
- Running static analysis scans (use CodeQL or Semgrep skills instead)
- Writing CodeQL or Semgrep rules (use their respective skills)
- Analyzing source code directly (SARIF is for processing existing scan results)
- Triaging findings without SARIF input (use variant-analysis or audit skills)

## SARIF Structure Overview

SARIF 2.1.0 is the current OASIS standard. Every SARIF file has this hierarchical structure:

sarifLog
├── version: "2.1.0"
├── $schema: (optional, enables IDE validation)
└── runs[] (array of analysis runs)
    ├── tool
    │   ├── driver
    │   │   ├── name (required)
    │   │   ├── version
    │   │   └── rules[] (rule definitions)
    │   └── extensions[] (plugins)
    ├── results[] (findings)
    │   ├── ruleId
    │   ├── level (error/warning/note)
    │   ├── message.text
    │   ├── locations[]
    │   │   └── physicalLocation
    │   │       ├── artifactLocation.uri
    │   │       └── region (startLine, startColumn, etc.)
    │   ├── fingerprints{}
    │   └── partialFingerprints{}
    └── artifacts[] (scanned files metadata)

### Why Fingerprinting Matters

Without stable fingerprints, you can't track findings across runs:

- **Baseline comparison**: "Is this a new finding or did we see it before?"
- **Regression detection**: "Did this PR introduce new vulnerabilities?"
- **Suppression**: "Ignore this known false positive in future runs"

Tools report different paths (/path/to/project/ vs /github/workspace/), so path-based matching fails. Fingerprints hash the *content* (code snippet, rule ID, relative location) to create stable identifiers regardless of environment.

## Tool Selection Guide

| Use Case | Tool | Installation |
|----------|------|--------------|
| Quick CLI queries | jq | brew install jq / apt install jq |
| Python scripting (simple) | pysarif | pip install pysarif |
| Python scripting (advanced) | sarif-tools | pip install sarif-tools |
| .NET applications | SARIF SDK | NuGet package |
| JavaScript/Node.js | sarif-js | npm package |
| Go applications | garif | go get github.com/chavacava/garif |
| Validation | SARIF Validator | sarifweb.azurewebsites.net |

## Strategy 1: Quick Analysis with jq

For rapid exploration and one-off queries:

# Pretty print the file
jq '.' results.sarif

# Count total findings
jq '[.runs[].results[]] | length' results.sarif

# List all rule IDs triggered
jq '[.runs[].results[].ruleId] | unique' results.sarif

# Extract errors only
jq '.runs[].results[] | select(.level == "error")' results.sarif

# Get findings with file locations
jq '.runs[].results[] | {
  rule: .ruleId,
  message: .message.text,
  file: .locations[0].physicalLocation.artifactLocation.uri,
  line: .locations[0].physicalLocation.region.startLine
}' results.sarif

# Filter by severity and get count per rule
jq '[.runs[].results[] | select(.level == "error")] | group_by(.ruleId) | map({rule: .[0].ruleId, count: length})' results.sarif

# Extract findings for a specific file
jq --arg file "src/auth.py" '.runs[].results[] | select(.locations[].physicalLocation.artifactLocation.uri | contains($file))' results.sarif

## Strategy 2: Python with pysarif

For programmatic access with full object model:

from pysarif import load_from_file, save_to_file

# Load SARIF file
sarif = load_from_file("results.sarif")

# Iterate through runs and results
for run in sarif.runs:
    tool_name = run.tool.driver.name
    print(f"Tool: {tool_name}")

    for result in run.results:
        print(f"  [{result.level}] {result.rule_id}: {result.message.text}")

        if result.locations:
            loc = result.locations[0].physical_location
            if loc and loc.artifact_location:
                print(f"    File: {loc.artifact_location.uri}")
                if loc.region:
                    print(f"    Line: {loc.region.start_line}")

# Save modified SARIF
save_to_file(sarif, "modified.sarif")

## Strategy 3: Python with sarif-tools

For aggregation, reporting, and CI/CD integration:

from sarif import loader

# Load single file
sarif_data = loader.load_sarif_file("results.sarif")

# Or load multiple files
sarif_set = loader.load_sarif_files(["tool1.sarif", "tool2.sarif"])

# Get summary report
report = sarif_data.get_report()

# Get histogram by severity
errors = report.get_issue_type_histogram_for_severity("error")
warnings = report.get_issue_type_histogram_for_severity("warning")

# Filter results
high_severity = [r for r in sarif_data.get_results()
                 if r.get("level") == "error"]

**sarif-tools CLI commands:**

# Summary of findings
sarif summary results.sarif

# List all results with details
sarif ls results.sarif

# Get results by severity
sarif ls --level error results.sarif

# Diff two SARIF files (find new/fixed issues)
sarif diff baseline.sarif current.sarif

# Convert to other formats
sarif csv results.sarif > results.csv
sarif html results.sarif > report.html

## Strategy 4: Aggregating Multiple SARIF Files

When combining results from multiple tools:

import json
from pathlib import Path

def aggregate_sarif_files(sarif_paths: list[str]) -> dict:
    """Combine multiple SARIF files into one."""
    aggregated = {
        "version": "2.1.0",
        "$schema": "https://json.schemastore.org/sarif-2.1.0.json",
        "runs": []
    }

    for path in sarif_paths:
        with open(path) as f:
            sarif = json.load(f)
            aggregated["runs"].extend(sarif.get("runs", []))

    return aggregated

def deduplicate_results(sarif: dict) -> dict:
    """Remove duplicate findings based on fingerprints."""
    seen_fingerprints = set()

    for run in sarif["runs"]:
        unique_results = []
        for result in run.get("results", []):
            # Use partialFingerprints or create key from location
            fp = None
            if result.get("partialFingerprints"):
                fp = tuple(sorted(result["partialFingerprints"].items()))
            elif result.get("fingerprints"):
                fp = tuple(sorted(result["fingerprints"].items()))
            else:
                # Fallback: create fingerprint from rule + location
                loc = result.get("locations", [{}])[0]
                phys = loc.get("physicalLocation", {})
                fp = (
                    result.get("ruleId"),
                    phys.get("artifactLocation", {}).get("uri"),
                    phys.get("region", {}).get("startLine")
                )

            if fp not in seen_fingerprints:
                seen_fingerprints.add(fp)
                unique_results.append(result)

        run["results"] = unique_results

    return sarif

## Strategy 5: Extracting Actionable Data

import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class Finding:
    rule_id: str
    level: str
    message: str
    file_path: Optional[str]
    start_line: Optional[int]
    end_line: Optional[int]
    fingerprint: Optional[str]

def extract_findings(sarif_path: str) -> list[Finding]:
    """Extract structured findings from SARIF file."""
    with open(sarif_path) as f:
        sarif = json.load(f)

    findings = []
    for run in sarif.get("runs", []):
        for result in run.get("results", []):
            loc = result.get("locations", [{}])[0]
            phys = loc.get("physicalLocation", {})
            region = phys.get("region", {})

            findings.append(Finding(
                rule_id=result.get("ruleId", "unknown"),
                level=result.get("level", "warning"),
                message=result.get("message", {}).get("text", ""),
                file_path=phys.get("artifactLocation", {}).get("uri"),
                start_line=region.get("startLine"),
                end_line=region.get("endLine"),
                fingerprint=next(iter(result.get("partialFingerprints", {}).values()), None)
            ))

    return findings

# Filter and prioritize
def prioritize_findings(findings: list[Finding]) -> list[Finding]:
    """Sort findings by severity."""
    severity_order = {"error": 0, "warning": 1, "note": 2, "none": 3}
    return sorted(findings, key=lambda f: severity_order.get(f.level, 99))

## Common Pitfalls and Solutions

### 1. Path Normalization Issues

Different tools report paths differently (absolute, relative, URI-encoded):

from urllib.parse import unquote
from pathlib import Path

def normalize_path(uri: str, base_path: str = "") -> str:
    """Normalize SARIF artifact URI to consistent path."""
    # Remove file:// prefix if present
    if uri.startswith("file://"):
        uri = uri[7:]

    # URL decode
    uri = unquote(uri)

    # Handle relative paths
    if not Path(uri).is_absolute() and base_path:
        uri = str(Path(base_path) / uri)

    # Normalize separators
    return str(Path(uri))

### 2. Fingerprint Mismatch Across Runs

Fingerprints may not match if:
- File paths differ between environments
- Tool versions changed fingerprinting algorithm
- Code was reformatted (changing line numbers)

**Solution:** Use multiple fingerprint strategies:

def compute_stable_fingerprint(result: dict, file_content: str = None) -> str:
    """Compute environment-independent fingerprint."""
    import hashlib

    components = [
        result.get("ruleId", ""),
        result.get("message", {}).get("text", "")[:100],  # First 100 chars
    ]

    # Add code snippet if available
    if file_content and result.get("locations"):
        region = result["locations"][0].get("physicalLocation", {}).get("region", {})
        if region.get("startLine"):
            lines = file_content.split("\n")
            line_idx = region["startLine"] - 1
            if 0 <= line_idx < len(lines):
                # Normalize whitespace
                components.append(lines[line_idx].strip())

    return hashlib.sha256("".join(components).encode()).hexdigest()[:16]

### 3. Missing or Incomplete Data

SARIF allows many optional fields. Always use defensive access:

def safe_get_location(result: dict) -> tuple[str, int]:
    """Safely extract file and line from result."""
    try:
        loc = result.get("locations", [{}])[0]
        phys = loc.get("physicalLocation", {})
        file_path = phys.get("artifactLocation", {}).get("uri", "unknown")
        line = phys.get("region", {}).get("startLine", 0)
        return file_path, line
    except (IndexError, KeyError, TypeError):
        return "unknown", 0

### 4. Large File Performance

For very large SARIF files (100MB+):

import ijson  # pip install ijson

def stream_results(sarif_path: str):
    """Stream results without loading entire file."""
    with open(sarif_path, "rb") as f:
        # Stream through results arrays
        for result in ijson.items(f, "runs.item.results.item"):
            yield result

### 5. Schema Validation

Validate before processing to catch malformed files:

# Using ajv-cli
npm install -g ajv-cli
ajv validate -s sarif-schema-2.1.0.json -d results.sarif

# Using Python jsonschema
pip install jsonschema

from jsonschema import validate, ValidationError
import json

def validate_sarif(sarif_path: str, schema_path: str) -> bool:
    """Validate SARIF file against schema."""
    with open(sarif_path) as f:
        sarif = json.load(f)
    with open(schema_path) as f:
        schema = json.load(f)

    try:
        validate(sarif, schema)
        return True
    except ValidationError as e:
        print(f"Validation error: {e.message}")
        return False

## CI/CD Integration Patterns

### GitHub Actions

- name: Upload SARIF
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif

- name: Check for high severity
  run: |
    HIGH_COUNT=$(jq '[.runs[].results[] | select(.level == "error")] | length' results.sarif)
    if [ "$HIGH_COUNT" -gt 0 ]; then
      echo "Found $HIGH_COUNT high severity issues"
      exit 1
    fi

### Fail on New Issues

from sarif import loader

def check_for_regressions(baseline: str, current: str) -> int:
    """Return count of new issues not in baseline."""
    baseline_data = loader.load_sarif_file(baseline)
    current_data = loader.load_sarif_file(current)

    baseline_fps = {get_fingerprint(r) for r in baseline_data.get_results()}
    new_issues = [r for r in current_data.get_results()
                  if get_fingerprint(r) not in baseline_fps]

    return len(new_issues)

## Key Principles

1. **Validate first**: Check SARIF structure before processing
2. **Handle optionals**: Many fields are optional; use defensive access
3. **Normalize paths**: Tools report paths differently; normalize early
4. **Fingerprint wisely**: Combine multiple strategies for stable deduplication
5. **Stream large files**: Use ijson or similar for 100MB+ files
6. **Aggregate thoughtfully**: Preserve tool metadata when combining files

## Skill Resources

For ready-to-use query templates, see [{baseDir}/resources/jq-queries.md]({baseDir}/resources/jq-queries.md):
- 40+ jq queries for common SARIF operations
- Severity filtering, rule extraction, aggregation patterns

For Python utilities, see [{baseDir}/resources/sarif_helpers.py]({baseDir}/resources/sarif_helpers.py):
- normalize_path() - Handle tool-specific path formats
- compute_fingerprint() - Stable fingerprinting ignoring paths
- deduplicate_results() - Remove duplicates across runs

## Reference Links

- [OASIS SARIF 2.1.0 Specification](https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html)
- [Microsoft SARIF Tutorials](https://github.com/microsoft/sarif-tutorials)
- [SARIF SDK (.NET)](https://github.com/microsoft/sarif-sdk)
- [sarif-tools (Python)](https://github.com/microsoft/sarif-tools)
- [pysarif (Python)](https://github.com/Kjeld-P/pysarif)
- [GitHub SARIF Support](https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning)
- [SARIF Validator](https://sarifweb.azurewebsites.net/)

By Trail of Bits

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/sarif-parsing/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/trailofbits/skills/sarif-parsing/SKILL.md
Cursor: ~/.cursor/skills/trailofbits/skills/sarif-parsing/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/trailofbits/skills/sarif-parsing/SKILL.md

🚀 Install with CLI:
npx skills add trailofbits/skills

Read the Master Guide: Mastering Agent Skills →

Related Skill Units

Recommended Rules

View more rules →

🔒 Security Audit Agent - Vulnerability Detection

Agentic AISecurityVulnerability

You are an expert security audit agent specialized in identifying vulnerabilities and security risks. Apply systematic reasoning following OWASP guide...

🚀 DevOps & CI/CD Agent - Pipeline Expert

Agentic AIDevOpsCI/CD

You are an expert DevOps and CI/CD agent specialized in designing and implementing robust deployment pipelines and infrastructure. Apply systematic re...

Python Security Best Practices

PythonSecurityCryptography

You are an expert in Python security and secure coding practices. Key Principles: - Never trust user input - Use principle of least privilege - Keep ...

Recommended Workflows

View more workflows →

Security Hardening Checklist

SecurityHeadersCSP

--- description: Essential security headers, CSP, and rate limiting --- 1. **Security Headers (`next.config.js`)**: - Add these headers to prevent...

Supabase Row Level Security (RLS)

SupabaseDatabaseSecurity

--- description: Define secure database policies to protect user data --- 1. **Enable RLS**: - Always enable RLS on every table you create. ```...

Implement Rate Limiting

SecurityRate LimitingAPI

--- description: Protect APIs with rate limits --- 1. **Install Upstash**: // turbo - Run `npm install @upstash/ratelimit @upstash/redis` 2. *...

Recommended MCP Servers

View more MCP servers →

CrowdStrike Falcon

Official

Connects AI agents with the CrowdStrike Falcon platform for intelligent security analysis, providing programmatic access to detections, incidents, behaviors, threat intelligence, hosts, vulnerabilities, and identity protection capabilities.

PDFActionInspector

Official

A Model Context Protocol server for extracting and analyzing JavaScript Actions from PDF files. Provides comprehensive security analysis to detect malicious PDF behaviors, hidden scripts, and potential security threats through AI-assisted risk assessment.

ADR Analysis

Community

AI-powered Architectural Decision Records (ADR) analysis server that provides architectural insights, technology stack detection, security checks, and TDD workflow enhancement for software development projects.

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library

📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide