ctx.py - Prompt Enhancer CLI

CLI that retrieves code context and rewrites input into context-aware prompts using the local LLM decoder. Works with questions and commands/instructions.

Documentation: README · Getting Started · Configuration · IDE Clients · MCP API · ctx CLI · Memory Guide · Architecture · Multi-Repo · Observability · Kubernetes · VS Code Extension · Troubleshooting · Development

On this page:

Basic Usage
Detail Mode
Unicorn Mode
Advanced Features
GPU Acceleration
Configuration

Basic Usage

# Questions: Enhanced with specific details and aspects
scripts/ctx.py "What is ReFRAG?"

# Commands: Enhanced with concrete targets and implementation details
scripts/ctx.py "Refactor ctx.py"

# Via Make target
make ctx Q="Explain the caching logic to me in detail"

# Filter by language/path or adjust tokens
make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"

Detail Mode

Include compact code snippets in retrieved context for richer rewrites (trades speed for quality):

# Enable detail mode (adds short snippets)
scripts/ctx.py "Explain the caching logic" --detail

# Detail mode with commands
scripts/ctx.py "Add error handling to ctx.py" --detail

# Adjust snippet size (default is 1 line when --detail is used)
make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"

Notes:

Default: header-only (fastest). --detail adds short snippets
Detail mode optimized for speed: clamps to max 4 results, 1 result per file

Unicorn Mode

Use --unicorn for the highest quality prompt enhancement with a staged 2-3 pass approach:

# Unicorn mode with commands
scripts/ctx.py "refactor ctx.py" --unicorn

# Unicorn mode with questions
scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn

# Works with all filters
scripts/ctx.py "add error handling" --unicorn --language python

How it works:

Pass 1 (Draft): Retrieves rich code snippets (8 lines of context) to understand the codebase
Pass 2 (Refine): Retrieves even richer snippets (12 lines) to ground the prompt with concrete code
Pass 3 (Polish): Optional cleanup pass if output appears generic or incomplete

Key features:

Code-grounded: References actual code behaviors and patterns
No hallucinations: Only uses real code from your indexed repository
Multi-paragraph output: Produces detailed, comprehensive prompts
Works with both questions and commands

When to use:

Normal mode: Quick, everyday prompts (fastest)
--detail: Richer context without multi-pass overhead (balanced)
--unicorn: When you need the absolute best prompt quality

Advanced Features

Streaming Output (Default)

All modes stream tokens as they arrive for instant feedback:

scripts/ctx.py "refactor ctx.py" --unicorn

To disable streaming, set "streaming": false in ~/.ctx_config.json

Memory Blending

Automatically falls back to context_search with memories when repo search returns no hits:

# If no code matches, ctx.py will search design docs and ADRs
scripts/ctx.py "What is our authentication strategy?"

Adaptive Context Sizing

Automatically adjusts limit and context_lines based on query characteristics:

Short/vague queries → More context for richer grounding
Queries with file/function names → Lighter settings for speed

Automatic Quality Assurance

Enhanced _needs_polish() heuristic triggers a third polish pass when:

Output is too short (< 180 chars)
Contains generic/vague language
Missing concrete code references
Lacks proper paragraph structure

Personalized Templates

Create ~/.ctx_config.json to customize behavior:

{
  "always_include_tests": true,
  "prefer_bullet_commands": false,
  "extra_instructions": "Always consider error handling and edge cases",
  "streaming": true
}

Available preferences:

always_include_tests: Add testing considerations to all prompts
prefer_bullet_commands: Format commands as bullet points
extra_instructions: Custom instructions added to every rewrite
streaming: Enable/disable streaming output (default: true)

See ctx_config.example.json for a template.

GPU Acceleration

For faster prompt rewriting, use the native Metal-accelerated decoder:

# Start the native llama.cpp server with Metal GPU
scripts/gpu_toggle.sh start

# Now ctx.py will automatically use the GPU decoder on port 8081
make ctx Q="Explain the caching logic"

# Stop the native GPU server
scripts/gpu_toggle.sh stop

Configuration

Setting	Description	Default
MCP_INDEXER_URL	Indexer HTTP RMCP endpoint	http://localhost:8003/mcp
USE_GPU_DECODER	Auto-detect GPU mode	0
LLAMACPP_URL	Docker decoder endpoint	http://localhost:8080

GPU decoder (after gpu_toggle.sh gpu): http://localhost:8081/completion