RAG Implementation
Install this skill
npx skills add wshobson/agentsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
RAG implementation bridges the gap between static LLM training data and dynamic external information. By converting text into vector representations stored in high-performance indices like Pinecone or Chroma, this technique allows models to fetch pertinent data before generating an output. It functions by combining semantic similarity search with generative reasoning, providing a mechanism for models to cite specific documents rather than relying on probability alone. Successful execution requires managing embedding quality, choosing appropriate retrieval strategies such as hybrid search or reranking, and orchestrating the flow through tools like LangGraph. This architecture is essential for developers maintaining private documentation, live knowledge bases, or technical support systems that must provide accurate, traceable, and up-to-date responses for specific operational contexts.
When to Use This Skill
- •Building technical support bots that cite specific manual pages
- •Enabling internal research tools to search through legal or medical PDFs
- •Creating documentation assistants for complex software codebases
- •Maintaining a chat interface that responds based on real-time company data
How to Invoke This Skill
Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:
- “Search our documents for the answer to this question
- “Use my uploaded technical manuals to write a response
- “Retrieve relevant context from the database to answer this
- “Ground your response using these provided knowledge sources
- “Build a RAG pipeline that checks our knowledge base
Pro Tips
- 💡Strategically choose your vector database based on project scale and required features; Pinecone for managed scalability or Chroma for lightweight local development.
- 💡Experiment with different embedding models, such as `voyage-3-large`, to find the optimal balance between performance and accuracy for your specific domain and data.
- 💡Develop robust document chunking strategies to optimize retrieval relevance, ensuring contextually rich snippets are passed to the LLM.
What this skill does
- •Integrating proprietary document repositories into LLM context windows
- •Combining vector-based semantic search with keyword-based BM25 retrieval
- •Optimizing result relevance through cross-encoder reranking
- •Orchestrating multi-step retrieval and generation flows using LangGraph
- •Generating multiple query variations to improve recall precision
When not to use it
- ✕When the source data is extremely small and fits entirely into a single prompt
- ✕When responses must be deterministic and based on simple conditional logic
- ✕When the cost of maintaining a vector index outweighs the utility of the search
Example workflow
- Split large text documents into manageable chunks
- Generate vector embeddings for each chunk using an API
- Store embeddings in a vector database index
- Perform a similarity search based on the incoming user question
- Rerank the top results to ensure highest quality context
- Pass retrieved fragments to the LLM to synthesize an answer
Prerequisites
- –API access to an LLM provider
- –A designated vector database instance
- –Source text or document files to index
Pitfalls & limitations
- !Poor retrieval due to insufficient chunking strategies
- !Hallucinations caused by ambiguous or low-relevance retrieved documents
- !High latency resulting from inefficient multi-step retrieval chains
- !Embedding drift where index data becomes stale compared to source files
FAQ
How it compares
Unlike a generic prompt that relies on model training data, RAG forces the model to ground its response in specific, retrieved evidence, significantly reducing hallucination rates.
📄 Full skill instructions — original source: wshobson/agents
Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
## When to Use This Skill
- Building Q&A systems over proprietary documents
- Creating chatbots with current, factual information
- Implementing semantic search with natural language queries
- Reducing hallucinations with grounded responses
- Enabling LLMs to access domain-specific knowledge
- Building documentation assistants
- Creating research tools with source citation
## Core Components
### 1. Vector Databases
**Purpose**: Store and retrieve document embeddings efficiently
**Options:**
- **Pinecone**: Managed, scalable, serverless
- **Weaviate**: Open-source, hybrid search, GraphQL
- **Milvus**: High performance, on-premise
- **Chroma**: Lightweight, easy to use, local development
- **Qdrant**: Fast, filtered search, Rust-based
- **pgvector**: PostgreSQL extension, SQL integration
### 2. Embeddings
**Purpose**: Convert text to numerical vectors for similarity search
**Models (2026):**
| Model | Dimensions | Best For |
|-------|------------|----------|
| **voyage-3-large** | 1024 | Claude apps (Anthropic recommended) |
| **voyage-code-3** | 1024 | Code search |
| **text-embedding-3-large** | 3072 | OpenAI apps, high accuracy |
| **text-embedding-3-small** | 1536 | OpenAI apps, cost-effective |
| **bge-large-en-v1.5** | 1024 | Open source, local deployment |
| **multilingual-e5-large** | 1024 | Multi-language support |
### 3. Retrieval Strategies
**Approaches:**
- **Dense Retrieval**: Semantic similarity via embeddings
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse with weighted fusion
- **Multi-Query**: Generate multiple query variations
- **HyDE**: Generate hypothetical documents for better retrieval
### 4. Reranking
**Purpose**: Improve retrieval quality by reordering results
**Methods:**
- **Cross-Encoders**: BERT-based reranking (ms-marco-MiniLM)
- **Cohere Rerank**: API-based reranking
- **Maximal Marginal Relevance (MMR)**: Diversity + relevance
- **LLM-based**: Use LLM to score relevance
## Quick Start with LangGraph
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated
class RAGState(TypedDict):
question: str
context: list[Document]
answer: str
# Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-5")
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# RAG prompt
rag_prompt = ChatPromptTemplate.from_template(
"""Answer based on the context below. If you cannot answer, say so.
Context:
{context}
Question: {question}
Answer:"""
)
async def retrieve(state: RAGState) -> RAGState:
"""Retrieve relevant documents."""
docs = await retriever.ainvoke(state["question"])
return {"context": docs}
async def generate(state: RAGState) -> RAGState:
"""Generate answer from context."""
context_text = "\n\n".join(doc.page_content for doc in state["context"])
messages = rag_prompt.format_messages(
context=context_text,
question=state["question"]
)
response = await llm.ainvoke(messages)
return {"answer": response.content}
# Build RAG graph
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve)
builder.add_node("generate", generate)
builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
rag_chain = builder.compile()
# Use
result = await rag_chain.ainvoke({"question": "What are the main features?"})
print(result["answer"])## Advanced RAG Patterns
### Pattern 1: Hybrid Search with RRF
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Sparse retriever (BM25 for keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Dense retriever (embeddings for semantic search)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Combine with Reciprocal Rank Fusion weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # 30% keyword, 70% semantic
)### Pattern 2: Multi-Query Retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives for better recall
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
llm=llm
)
# Single query → multiple variations → combined results
results = await multi_query_retriever.ainvoke("What is the main topic?")### Pattern 3: Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Compressor extracts only relevant portions
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
# Returns only relevant parts of documents
compressed_docs = await compression_retriever.ainvoke("specific query")### Pattern 4: Parent Document Retriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Small chunks for precise retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Store for parent documents
docstore = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
# Add documents (splits children, stores parents)
await parent_retriever.aadd_documents(documents)
# Retrieval returns parent documents with full context
results = await parent_retriever.ainvoke("query")### Pattern 5: HyDE (Hypothetical Document Embeddings)
from langchain_core.prompts import ChatPromptTemplate
class HyDEState(TypedDict):
question: str
hypothetical_doc: str
context: list[Document]
answer: str
hyde_prompt = ChatPromptTemplate.from_template(
"""Write a detailed passage that would answer this question:
Question: {question}
Passage:"""
)
async def generate_hypothetical(state: HyDEState) -> HyDEState:
"""Generate hypothetical document for better retrieval."""
messages = hyde_prompt.format_messages(question=state["question"])
response = await llm.ainvoke(messages)
return {"hypothetical_doc": response.content}
async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
"""Retrieve using hypothetical document."""
# Use hypothetical doc for retrieval instead of original query
docs = await retriever.ainvoke(state["hypothetical_doc"])
return {"context": docs}
# Build HyDE RAG graph
builder = StateGraph(HyDEState)
builder.add_node("hypothetical", generate_hypothetical)
builder.add_node("retrieve", retrieve_with_hyde)
builder.add_node("generate", generate)
builder.add_edge(START, "hypothetical")
builder.add_edge("hypothetical", "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
hyde_rag = builder.compile()## Document Chunking Strategies
### Recursive Character Text Splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
)
chunks = splitter.split_documents(documents)### Token-Based Splitting
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50,
encoding_name="cl100k_base" # OpenAI tiktoken encoding
)### Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)### Markdown Header Splitter
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False
)## Vector Store Configurations
### Pinecone (Serverless)
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create index if needed
if "my-index" not in pc.list_indexes().names():
pc.create_index(
name="my-index",
dimension=1024, # voyage-3-large dimensions
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
# Create vector store
index = pc.Index("my-index")
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)### Weaviate
import weaviate
from langchain_weaviate import WeaviateVectorStore
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
vectorstore = WeaviateVectorStore(
client=client,
index_name="Documents",
text_key="content",
embedding=embeddings
)### Chroma (Local Development)
from langchain_chroma import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)### pgvector (PostgreSQL)
from langchain_postgres.vectorstores import PGVector
connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
vectorstore = PGVector(
embeddings=embeddings,
collection_name="documents",
connection=connection_string,
)## Retrieval Optimization
### 1. Metadata Filtering
from langchain_core.documents import Document
# Add metadata during indexing
docs_with_metadata = []
for doc in documents:
doc.metadata.update({
"source": doc.metadata.get("source", "unknown"),
"category": determine_category(doc.page_content),
"date": datetime.now().isoformat()
})
docs_with_metadata.append(doc)
# Filter during retrieval
results = await vectorstore.asimilarity_search(
"query",
filter={"category": "technical"},
k=5
)### 2. Maximal Marginal Relevance (MMR)
# Balance relevance with diversity
results = await vectorstore.amax_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)### 3. Reranking with Cross-Encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
# Get initial results
candidates = await vectorstore.asimilarity_search(query, k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]### 4. Cohere Rerank
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
# Wrap retriever with reranking
reranked_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)## Prompt Engineering for RAG
### Contextual Prompt with Citations
rag_prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the context below. Include citations using [1], [2], etc.
If you cannot answer based on the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Instructions:
1. Use only information from the context
2. Cite sources with [1], [2] format
3. If uncertain, express uncertainty
Answer (with citations):"""
)### Structured Output for RAG
from pydantic import BaseModel, Field
class RAGResponse(BaseModel):
answer: str = Field(description="The answer based on context")
confidence: float = Field(description="Confidence score 0-1")
sources: list[str] = Field(description="Source document IDs used")
reasoning: str = Field(description="Brief reasoning for the answer")
# Use with structured output
structured_llm = llm.with_structured_output(RAGResponse)## Evaluation Metrics
from typing import TypedDict
class RAGEvalMetrics(TypedDict):
retrieval_precision: float # Relevant docs / retrieved docs
retrieval_recall: float # Retrieved relevant / total relevant
answer_relevance: float # Answer addresses question
faithfulness: float # Answer grounded in context
context_relevance: float # Context relevant to question
async def evaluate_rag_system(
rag_chain,
test_cases: list[dict]
) -> RAGEvalMetrics:
"""Evaluate RAG system on test cases."""
metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
for test in test_cases:
result = await rag_chain.ainvoke({"question": test["question"]})
# Retrieval metrics
retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
relevant_ids = set(test["relevant_doc_ids"])
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
metrics["retrieval_precision"].append(precision)
metrics["retrieval_recall"].append(recall)
# Use LLM-as-judge for quality metrics
quality = await evaluate_answer_quality(
question=test["question"],
answer=result["answer"],
context=result["context"],
expected=test.get("expected_answer")
)
metrics["answer_relevance"].append(quality["relevance"])
metrics["faithfulness"].append(quality["faithfulness"])
metrics["context_relevance"].append(quality["context_relevance"])
return {k: sum(v) / len(v) for k, v in metrics.items()}## Resources
- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
- [LangGraph RAG Examples](https://langchain-ai.github.io/langgraph/tutorials/rag/)
- [Pinecone Best Practices](https://docs.pinecone.io/guides/get-started/overview)
- [Voyage AI Embeddings](https://docs.voyageai.com/)
- [RAG Evaluation Guide](https://docs.ragas.io/)
## Best Practices
1. **Chunk Size**: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
3. **Metadata**: Include source, page, timestamp for filtering and debugging
4. **Hybrid Search**: Combine semantic and keyword search for best recall
5. **Reranking**: Use cross-encoder reranking for precision-critical applications
6. **Citations**: Always return source documents for transparency
7. **Evaluation**: Continuously test retrieval quality and answer accuracy
8. **Monitoring**: Track retrieval metrics and latency in production
## Common Issues
- **Poor Retrieval**: Check embedding quality, chunk size, query formulation
- **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
- **Missing Information**: Ensure documents are properly indexed, check chunking
- **Slow Queries**: Optimize vector store, use caching, reduce k
- **Hallucinations**: Improve grounding prompt, add verification step
- **Context Too Long**: Use compression or parent document retriever
How to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/rag-implementation/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/wshobson/agents/rag-implementation/SKILL.md - Cursor:
~/.cursor/skills/wshobson/agents/rag-implementation/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/wshobson/agents/rag-implementation/SKILL.md
🚀 Install with CLI:npx skills add wshobson/agents