Context Engine Architecture
Documentation: README · Getting Started · Configuration · IDE Clients · MCP API · ctx CLI · Memory Guide · Architecture · Multi-Repo · Observability · Kubernetes · VS Code Extension · Troubleshooting · Development
On this page:
Overview
Production-ready MCP (Model Context Protocol) retrieval stack unifying code indexing, hybrid search, and optional LLM decoding. Enables teams to ship context-aware AI agents with semantic and lexical search capabilities and dual-transport compatibility.
Core Principles
- Research-Grade Retrieval: ReFRAG-inspired micro-chunking and span budgeting
- Dual-Transport Support: SSE (legacy) and HTTP RMCP (modern) protocols
- Performance-First: Intelligent caching, connection pooling, and async I/O
- Production-Ready: Comprehensive health checks, monitoring, and operational tooling
System Architecture
Component Diagram
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client Apps │◄──►│ MCP Servers │◄──►│ Qdrant DB │
│ (IDE, CLI, Web) │ │ (SSE + HTTP) │ │ (Vector Store) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ LLM Decoder │
│ (llama.cpp) │
│ (Optional) │
└─────────────────┘
Core Components
1. MCP Servers
Memory Server (scripts/mcp_memory_server.py)
- Purpose: Knowledge base storage and retrieval
- Transport: SSE (port 8000) + HTTP RMCP (port 8002)
- Key Features:
- Structured memory storage with rich metadata
- Hybrid search (dense + lexical)
- Dual vector support for embedding and lexical hashes
- Automatic collection management
Indexer Server (scripts/mcp_indexer_server.py)
- Purpose: Code search, indexing, and management
- Transport: SSE (port 8001) + HTTP RMCP (port 8003)
- Key Features:
- Hybrid code search with multiple filtering options
- ReFRAG-inspired micro-chunking (16-token windows)
- Context-aware Q&A with local LLM integration
- Workspace and collection management
- Live indexing and pruning capabilities
2. Search Pipeline
Hybrid Search Engine (scripts/hybrid_search.py)
Multi-Vector Architecture:
- Dense Vectors: Semantic embeddings (BAAI/bge-base-en-v1.5)
- Lexical Vectors: BM25-style hashing (4096 dimensions)
- Mini Vectors: ReFRAG gating (64 dimensions, optional)
Retrieval Process:
- Query Expansion: Generate multiple query variations
- Parallel Search: Dense + lexical search with RRF fusion
- Optional Reranking: Cross-encoder neural reranking
- Result Assembly: Format with citations and metadata
Advanced Features:
- Request deduplication
- Intelligent caching (multi-policy: LRU, LFU, TTL, FIFO)
- Connection pooling to Qdrant
- Batch processing support
ReFRAG Implementation
- Micro-chunking: Token-level windows (16 tokens, 8 stride)
- Span Budgeting: Global token budget management
- Gate-First Filtering: Mini-vector pre-filtering for efficiency
3. Storage Layer
Qdrant Vector Database
- Primary Storage: Embeddings and metadata
- Collection Management: Automatic creation and configuration
- Named Vectors: Separate storage for different embedding types
- Performance: HNSW indexing for fast approximate nearest neighbor search
Unified Cache System (scripts/cache_manager.py)
- Eviction Policies: LRU, LFU, TTL, FIFO
- Memory Management: Configurable size limits and monitoring
- Thread Safety: Proper locking for concurrent access
- Statistics Tracking: Hit rates, memory usage, eviction counts
4. Supporting Infrastructure
Async Subprocess Manager (scripts/async_subprocess_manager.py)
- Process Management: Async subprocess execution with resource cleanup
- Connection Pooling: Reused HTTP connections
- Timeout Handling: Configurable timeouts with graceful degradation
- Resource Tracking: Active process monitoring and statistics
Deduplication System (scripts/deduplication.py)
- Request Deduplication: Prevent redundant processing
- Cache Integration: Works with unified cache system
- Performance Impact: Significant reduction in duplicate work
Semantic Expansion (scripts/semantic_expansion.py)
- Query Enhancement: LLM-assisted query variation generation
- Local LLM Integration: llama.cpp for offline expansion
- Caching: Expanded query results cached for reuse
Pattern Detection (scripts/pattern_detection/)
- Structural Search: Find similar code patterns across languages via AST analysis
- 64-dim Pattern Vector: WL graph kernel, CFG fingerprint, SimHash, spectral features
- Auto-Detection: Identifies retry patterns, resource cleanup, filter loops
- Requires:
PATTERN_VECTORS=1to enable
5. Learning Reranker System (Optional)
The Learning Reranker is an optional self-improving ranking system that learns from search patterns to provide increasingly relevant results over time. It is enabled by default but can be disabled via RERANK_LEARNING=0 and RERANK_EVENTS_ENABLED=0 environment variables. See Configuration for all options.
Architecture Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Search Query │────►│ Hybrid Search │────►│ TinyScorer │
│ │ │ (initial rank) │ │ (learned rank) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────────────────┐ │
│ Event Logger │◄────────────┘
│ (NDJSON files) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Learning Worker │
│ (background) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ ONNX Teacher │
│ (cross-encoder) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Weight Updates │
│ (.npz files) │
└──────────────────┘
Components
TinyScorer (scripts/rerank_recursive.py)
- 2-layer MLP neural network (~3MB per collection)
- Scores query-document pairs based on learned patterns
- Hot-reloads weights every 60 seconds from disk
- Per-collection weights (each repo learns independently)
Event Logger (scripts/rerank_events.py)
- Logs every search to NDJSON files at
/tmp/rerank_events/ - Records: query, candidates, initial scores, timestamps
- Hourly file rotation with configurable retention
Learning Worker (scripts/learning_reranker_worker.py)
- Background daemon that processes logged events
- Uses ONNX cross-encoder as "teacher" model
- Trains TinyScorer via knowledge distillation
- Saves versioned weight checkpoints atomically
Learning Flow
- Event Capture: Every search logs query + candidates to NDJSON
- Teacher Scoring: ONNX cross-encoder scores the candidates
- Student Training: TinyScorer learns to match teacher rankings
- Weight Update: New weights saved atomically with versioning
- Hot Reload: Serving path picks up new weights within 60s
- Score Integration:
learning_scoreblends with other signals
Configuration
| Variable | Description | Default |
|---|---|---|
RERANKER_WEIGHTS_DIR |
Directory for weight files | /tmp/rerank_weights |
RERANKER_WEIGHTS_RELOAD_INTERVAL |
Hot-reload check interval (seconds) | 60 |
RERANKER_MAX_CHECKPOINTS |
Number of weight versions to keep | 5 |
RERANKER_LR_DECAY_STEPS |
Steps between learning rate decay | 1000 |
RERANKER_LR_DECAY_RATE |
Learning rate decay multiplier | 0.95 |
RERANKER_MIN_LR |
Minimum learning rate | 0.0001 |
RERANK_EVENTS_DIR |
Directory for event logs | /tmp/rerank_events |
RERANK_EVENTS_RETENTION_DAYS |
Days to keep event files | 7 |
RERANK_LEARNING_BATCH_SIZE |
Events per training batch | 32 |
RERANK_LEARNING_POLL_INTERVAL |
Worker poll interval (seconds) | 30 |
RERANK_LEARNING_RATE |
Initial learning rate | 0.001 |
Observability
Search results include learning metrics in the why field:
{
"score": 3.2,
"why": ["lexical:1.0", "dense_rrf:0.05", "learning:3", "score:3.2"],
"components": {
"learning_score": 3.2,
"learning_iterations": 3
}
}
Worker logs show training progress:
[codebase] Processed 5 events | v12 | lr=0.001 | avg_loss=1.8 | converged=False
Benefits
- Zero Manual Training: Learns automatically from usage
- Per-Collection Specialization: Each codebase gets tuned rankings
- Fast Inference: TinyScorer adds <1ms to search latency
- Continuous Improvement: Rankings improve over time
- Offline Capable: Teacher runs locally, no external API calls
MCP Router (scripts/mcp_router.py)
- Intent Classification: Determines which MCP tool to call based on query
- Tool Orchestration: Routes to search, answer, memory, or index tools
- HTTP Execution: Executes tools via RMCP/HTTP without extra dependencies
- Plan Mode: Preview tool selection without execution
Data Flow Architecture
Search Request Flow
1. Client Query → MCP Server
2. Query Expansion (optional) → Multiple Query Variations
3. Parallel Execution → Dense Search + Lexical Search
4. RRF Fusion → Combined Results
5. Reranking (optional) → Enhanced Relevance
6. Result Formatting → Structured Response with Citations
7. Return to Client → MCP Protocol Response
Indexing Flow
1. File Change Detection → File System Watcher
2. Content Processing → Tokenization + Chunking
3. Embedding Generation → Model Inference
4. Vector Creation → Dense + Lexical + Mini
5. Metadata Assembly → Path, symbols, language, etc.
6. Batch Upsert → Qdrant Storage
7. Cache Updates → Local Cache Refresh
Configuration Architecture
Environment-Based Configuration
- Docker-Native: All configuration via environment variables
- Development Support: Local .env file configuration
- Production Ready: External secret management integration
Key Configuration Areas
- Service Configuration: Ports, hosts, transport protocols
- Model Configuration: Embedding models, reranker settings
- Performance Tuning: Cache sizes, batch sizes, timeouts
- Feature Flags: Experimental features, debug modes
Transport Layer Architecture
Dual-Transport Design
- SSE (Server-Sent Events): Legacy client compatibility
- HTTP RMCP: Modern JSON-RPC over HTTP
- Simultaneous Operation: Both protocols can run together
- Automatic Fallback: Graceful degradation when transport fails
MCP Protocol Implementation
- FastMCP Framework: Modern MCP server implementation
- Tool Registry: Automatic tool discovery and registration
- Health Endpoints:
/readyzand/toolsendpoints - Error Handling: Structured error responses and logging
Performance Architecture
Caching Strategy
- Multi-Level Caching: Embedding cache, search cache, expansion cache
- Intelligent Invalidation: TTL-based and LRU eviction
- Memory Management: Configurable limits and monitoring
- Performance Monitoring: Hit rates, response times, memory usage
Concurrency Model
- Async I/O: Non-blocking operations throughout
- Connection Pooling: Reused connections to external services
- Batch Processing: Efficient bulk operations
- Resource Management: Proper cleanup and resource limits
Security Architecture
Isolation and Safety
- Container-Based: Docker isolation for all services
- Network Segmentation: Internal service communication
- Input Validation: Comprehensive parameter validation
- Resource Limits: Configurable timeouts and memory limits
Data Protection
- No Hardcoded Secrets: Environment-based configuration
- API Key Management: External secret manager integration
- Audit Logging: Structured logging for security events
Operational Architecture
Health Monitoring
- Service Health:
/readyzendpoints for all services - Tool Availability: Dynamic tool listing and status
- Performance Metrics: Response times, cache statistics
- Error Tracking: Structured error logging and alerting
Deployment Patterns
- Docker Compose: Multi-service orchestration
- Environment Parity: Development ↔ Production consistency
- Graceful Shutdown: Proper resource cleanup on termination
- Rolling Updates: Zero-downtime deployment support
Extensibility Architecture
Plugin System
- MCP Tool Extension: Easy addition of new tools
- Transport Flexibility: Support for future MCP transports
- Model Pluggability: Support for different embedding models
- Storage Abstraction: Potential for alternative vector stores
Configuration Extension
- Environment-Driven: Easy configuration via environment variables
- Feature Flags: Experimental feature toggling
- A/B Testing: Multiple configuration variants support
This architecture enables Context Engine to serve as a production-ready, scalable context layer for AI applications while maintaining the flexibility to evolve with changing requirements and technologies.