Context Engine Architecture

Documentation: README · Getting Started · Configuration · IDE Clients · MCP API · ctx CLI · Memory Guide · Architecture · Multi-Repo · Observability · Kubernetes · VS Code Extension · Troubleshooting · Development

On this page:

Overview
Core Principles
System Architecture
Learning Reranker System
Data Flow
ReFRAG Pipeline

Overview

Production-ready MCP (Model Context Protocol) retrieval stack unifying code indexing, hybrid search, and optional LLM decoding. Enables teams to ship context-aware AI agents with semantic and lexical search capabilities and dual-transport compatibility.

Core Principles

Research-Grade Retrieval: ReFRAG-inspired micro-chunking and span budgeting
Dual-Transport Support: SSE (legacy) and HTTP RMCP (modern) protocols
Performance-First: Intelligent caching, connection pooling, and async I/O
Production-Ready: Comprehensive health checks, monitoring, and operational tooling

System Architecture

Component Diagram

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Client Apps   │◄──►│  MCP Servers    │◄──►│   Qdrant DB     │
│ (IDE, CLI, Web) │    │  (SSE + HTTP)   │    │  (Vector Store) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
                       ┌─────────────────┐
                       │  LLM Decoder    │
                       │  (llama.cpp)    │
                       │   (Optional)    │
                       └─────────────────┘

Core Components

1. MCP Servers

Memory Server (scripts/mcp_memory_server.py)

Purpose: Knowledge base storage and retrieval
Transport: SSE (port 8000) + HTTP RMCP (port 8002)
Key Features:
- Structured memory storage with rich metadata
- Hybrid search (dense + lexical)
- Dual vector support for embedding and lexical hashes
- Automatic collection management

Indexer Server (scripts/mcp_indexer_server.py)

Purpose: Code search, indexing, and management
Transport: SSE (port 8001) + HTTP RMCP (port 8003)
Key Features:
- Hybrid code search with multiple filtering options
- ReFRAG-inspired micro-chunking (16-token windows)
- Context-aware Q&A with local LLM integration
- Workspace and collection management
- Live indexing and pruning capabilities

2. Search Pipeline

Hybrid Search Engine (scripts/hybrid_search.py)

Multi-Vector Architecture:
- Dense Vectors: Semantic embeddings (BAAI/bge-base-en-v1.5)
- Lexical Vectors: BM25-style hashing (4096 dimensions)
- Mini Vectors: ReFRAG gating (64 dimensions, optional)
Retrieval Process:
1. Query Expansion: Generate multiple query variations
2. Parallel Search: Dense + lexical search with RRF fusion
3. Optional Reranking: Cross-encoder neural reranking
4. Result Assembly: Format with citations and metadata
Advanced Features:
- Request deduplication
- Intelligent caching (multi-policy: LRU, LFU, TTL, FIFO)
- Connection pooling to Qdrant
- Batch processing support

ReFRAG Implementation

Micro-chunking: Token-level windows (16 tokens, 8 stride)
Span Budgeting: Global token budget management
Gate-First Filtering: Mini-vector pre-filtering for efficiency

3. Storage Layer

Qdrant Vector Database

Primary Storage: Embeddings and metadata
Collection Management: Automatic creation and configuration
Named Vectors: Separate storage for different embedding types
Performance: HNSW indexing for fast approximate nearest neighbor search

Unified Cache System (scripts/cache_manager.py)

Eviction Policies: LRU, LFU, TTL, FIFO
Memory Management: Configurable size limits and monitoring
Thread Safety: Proper locking for concurrent access
Statistics Tracking: Hit rates, memory usage, eviction counts

4. Supporting Infrastructure

Async Subprocess Manager (scripts/async_subprocess_manager.py)

Process Management: Async subprocess execution with resource cleanup
Connection Pooling: Reused HTTP connections
Timeout Handling: Configurable timeouts with graceful degradation
Resource Tracking: Active process monitoring and statistics

Deduplication System (scripts/deduplication.py)

Request Deduplication: Prevent redundant processing
Cache Integration: Works with unified cache system
Performance Impact: Significant reduction in duplicate work

Semantic Expansion (scripts/semantic_expansion.py)

Query Enhancement: LLM-assisted query variation generation
Local LLM Integration: llama.cpp for offline expansion
Caching: Expanded query results cached for reuse

Pattern Detection (scripts/pattern_detection/)

Structural Search: Find similar code patterns across languages via AST analysis
64-dim Pattern Vector: WL graph kernel, CFG fingerprint, SimHash, spectral features
Auto-Detection: Identifies retry patterns, resource cleanup, filter loops
Requires: PATTERN_VECTORS=1 to enable

5. Learning Reranker System (Optional)

The Learning Reranker is an optional self-improving ranking system that learns from search patterns to provide increasingly relevant results over time. It is enabled by default but can be disabled via RERANK_LEARNING=0 and RERANK_EVENTS_ENABLED=0 environment variables. See Configuration for all options.

Architecture Overview

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Search Query   │────►│  Hybrid Search   │────►│  TinyScorer     │
│                 │     │  (initial rank)  │     │  (learned rank) │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                         │
                        ┌──────────────────┐             │
                        │  Event Logger    │◄────────────┘
                        │  (NDJSON files)  │
                        └────────┬─────────┘
                                 │
                        ┌────────▼─────────┐
                        │ Learning Worker  │
                        │  (background)    │
                        └────────┬─────────┘
                                 │
                        ┌────────▼─────────┐
                        │  ONNX Teacher    │
                        │ (cross-encoder)  │
                        └────────┬─────────┘
                                 │
                        ┌────────▼─────────┐
                        │  Weight Updates  │
                        │  (.npz files)    │
                        └──────────────────┘

Components

TinyScorer (scripts/rerank_recursive.py)

2-layer MLP neural network (~3MB per collection)
Scores query-document pairs based on learned patterns
Hot-reloads weights every 60 seconds from disk
Per-collection weights (each repo learns independently)

Event Logger (scripts/rerank_events.py)

Logs every search to NDJSON files at /tmp/rerank_events/
Records: query, candidates, initial scores, timestamps
Hourly file rotation with configurable retention

Learning Worker (scripts/learning_reranker_worker.py)

Background daemon that processes logged events
Uses ONNX cross-encoder as "teacher" model
Trains TinyScorer via knowledge distillation
Saves versioned weight checkpoints atomically

Learning Flow

Event Capture: Every search logs query + candidates to NDJSON
Teacher Scoring: ONNX cross-encoder scores the candidates
Student Training: TinyScorer learns to match teacher rankings
Weight Update: New weights saved atomically with versioning
Hot Reload: Serving path picks up new weights within 60s
Score Integration: learning_score blends with other signals

Configuration

Variable	Description	Default
`RERANKER_WEIGHTS_DIR`	Directory for weight files	`/tmp/rerank_weights`
`RERANKER_WEIGHTS_RELOAD_INTERVAL`	Hot-reload check interval (seconds)	60
`RERANKER_MAX_CHECKPOINTS`	Number of weight versions to keep	5
`RERANKER_LR_DECAY_STEPS`	Steps between learning rate decay	1000
`RERANKER_LR_DECAY_RATE`	Learning rate decay multiplier	0.95
`RERANKER_MIN_LR`	Minimum learning rate	0.0001
`RERANK_EVENTS_DIR`	Directory for event logs	`/tmp/rerank_events`
`RERANK_EVENTS_RETENTION_DAYS`	Days to keep event files	7
`RERANK_LEARNING_BATCH_SIZE`	Events per training batch	32
`RERANK_LEARNING_POLL_INTERVAL`	Worker poll interval (seconds)	30
`RERANK_LEARNING_RATE`	Initial learning rate	0.001

Observability

Search results include learning metrics in the why field:

{
  "score": 3.2,
  "why": ["lexical:1.0", "dense_rrf:0.05", "learning:3", "score:3.2"],
  "components": {
    "learning_score": 3.2,
    "learning_iterations": 3
  }
}

Worker logs show training progress:

[codebase] Processed 5 events | v12 | lr=0.001 | avg_loss=1.8 | converged=False

Benefits

Zero Manual Training: Learns automatically from usage
Per-Collection Specialization: Each codebase gets tuned rankings
Fast Inference: TinyScorer adds <1ms to search latency
Continuous Improvement: Rankings improve over time
Offline Capable: Teacher runs locally, no external API calls

MCP Router (scripts/mcp_router.py)

Intent Classification: Determines which MCP tool to call based on query
Tool Orchestration: Routes to search, answer, memory, or index tools
HTTP Execution: Executes tools via RMCP/HTTP without extra dependencies
Plan Mode: Preview tool selection without execution

Data Flow Architecture

Search Request Flow

1. Client Query → MCP Server
2. Query Expansion (optional) → Multiple Query Variations
3. Parallel Execution → Dense Search + Lexical Search
4. RRF Fusion → Combined Results
5. Reranking (optional) → Enhanced Relevance
6. Result Formatting → Structured Response with Citations
7. Return to Client → MCP Protocol Response

Indexing Flow

1. File Change Detection → File System Watcher
2. Content Processing → Tokenization + Chunking
3. Embedding Generation → Model Inference
4. Vector Creation → Dense + Lexical + Mini
5. Metadata Assembly → Path, symbols, language, etc.
6. Batch Upsert → Qdrant Storage
7. Cache Updates → Local Cache Refresh

Configuration Architecture

Environment-Based Configuration

Docker-Native: All configuration via environment variables
Development Support: Local .env file configuration
Production Ready: External secret management integration

Key Configuration Areas

Service Configuration: Ports, hosts, transport protocols
Model Configuration: Embedding models, reranker settings
Performance Tuning: Cache sizes, batch sizes, timeouts
Feature Flags: Experimental features, debug modes

Transport Layer Architecture

Dual-Transport Design

SSE (Server-Sent Events): Legacy client compatibility
HTTP RMCP: Modern JSON-RPC over HTTP
Simultaneous Operation: Both protocols can run together
Automatic Fallback: Graceful degradation when transport fails

MCP Protocol Implementation

FastMCP Framework: Modern MCP server implementation
Tool Registry: Automatic tool discovery and registration
Health Endpoints: /readyz and /tools endpoints
Error Handling: Structured error responses and logging

Performance Architecture

Caching Strategy

Multi-Level Caching: Embedding cache, search cache, expansion cache
Intelligent Invalidation: TTL-based and LRU eviction
Memory Management: Configurable limits and monitoring
Performance Monitoring: Hit rates, response times, memory usage

Concurrency Model

Async I/O: Non-blocking operations throughout
Connection Pooling: Reused connections to external services
Batch Processing: Efficient bulk operations
Resource Management: Proper cleanup and resource limits

Security Architecture

Isolation and Safety

Container-Based: Docker isolation for all services
Network Segmentation: Internal service communication
Input Validation: Comprehensive parameter validation
Resource Limits: Configurable timeouts and memory limits

Data Protection

No Hardcoded Secrets: Environment-based configuration
API Key Management: External secret manager integration
Audit Logging: Structured logging for security events

Operational Architecture

Health Monitoring

Service Health: /readyz endpoints for all services
Tool Availability: Dynamic tool listing and status
Performance Metrics: Response times, cache statistics
Error Tracking: Structured error logging and alerting

Deployment Patterns

Docker Compose: Multi-service orchestration
Environment Parity: Development ↔ Production consistency
Graceful Shutdown: Proper resource cleanup on termination
Rolling Updates: Zero-downtime deployment support

Extensibility Architecture

Plugin System

MCP Tool Extension: Easy addition of new tools
Transport Flexibility: Support for future MCP transports
Model Pluggability: Support for different embedding models
Storage Abstraction: Potential for alternative vector stores

Configuration Extension

Environment-Driven: Easy configuration via environment variables
Feature Flags: Experimental feature toggling
A/B Testing: Multiple configuration variants support

This architecture enables Context Engine to serve as a production-ready, scalable context layer for AI applications while maintaining the flexibility to evolve with changing requirements and technologies.