YAMS Search System Architecture¶
This document describes the architecture and design of the YAMS search system, providing insights into its components, data flow, and design decisions.
Table of Contents¶
- Overview
- System Components
- Data Flow
- Query Processing Pipeline
- Indexing Architecture
- Cache Architecture
- Storage Integration
- Performance Characteristics
- Scalability Considerations
- Design Decisions
Overview¶
The YAMS search system is a high-performance, full-text search engine designed for document management and content retrieval. It provides advanced query capabilities, intelligent caching, and seamless integration with the YAMS storage system.
Key Features¶
- Full-Text Search: Powered by SQLite FTS5 for fast and accurate text retrieval
- Advanced Query Language: Support for boolean operators, phrase queries, field searches, and wildcards
- Intelligent Caching: LRU-based result caching with pattern-based invalidation
- Faceted Search: Dynamic faceting for content type, author, date, and custom metadata
- Real-time Indexing: Incremental indexing with automatic content extraction
- High Performance: Optimized for low latency and high throughput
- Thread Safety: Concurrent access support with fine-grained locking
Design Goals¶
- Performance: Sub-50ms query response times for typical workloads
- Scalability: Support for millions of documents with graceful degradation
- Reliability: Robust error handling and recovery mechanisms
- Flexibility: Extensible architecture for custom analyzers and filters
- Integration: Seamless integration with YAMS storage and metadata systems
System Components¶
The search system consists of several interconnected components:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Search API │ │ Query Parser │ │ Search Cache │
│ │ │ │ │ │
│ • REST-style │ │ • AST Generation│ │ • LRU Eviction │
│ • Request/Resp │ │ • Validation │ │ • TTL Support │
│ • Error Handling│ │ • Optimization │ │ • Pattern Inval │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────▼─────────────────────────────────┐
│ Search Executor │
│ │
│ • Query Execution • Result Ranking • Filter Application │
│ • Cache Integration • Highlighting • Pagination │
│ • Facet Generation • Spell Correction • Performance Metrics │
└─────────────────────────────────────────────────────────────────┘
│ │ │
┌─────────▼─────────┐ ┌───────▼──────────┐ ┌──────▼──────────┐
│ Document Indexer │ │ Metadata Repository│ │ Text Extractor│
│ │ │ │ │ │
│ • FTS5 Integration│ │ • Document Metadata│ │ • Content Extraction│
│ • Batch Processing│ │ • Schema Management│ │ • Format Detection │
│ • Incremental Sync│ │ • Query Interface │ │ • Language Detection│
└───────────────────┘ └────────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────▼─────────────────────────────────┐
│ SQLite Database │
│ │
│ • FTS5 Index • Metadata Tables • Connection Pooling │
│ • Document Content • Schema Migrations • Transaction Support │
│ • Statistics Tables • Query Optimization • WAL Mode │
└───────────────────────────────────────────────────────────────────┘
Component Details¶
Search Executor¶
The central orchestrator that coordinates query execution across all subsystems.
Responsibilities: - Query parsing and validation - Cache lookup and management - FTS5 query execution - Result ranking and scoring - Filter and facet application - Result highlighting and snippet generation
Query Parser¶
Converts natural language queries into structured Abstract Syntax Trees (AST).
Features: - Recursive descent parser - Support for complex boolean expressions - Field-specific query handling - Wildcard and proximity queries - Error recovery and validation
Search Cache¶
High-performance caching layer with intelligent eviction and invalidation.
Architecture: - Thread-safe LRU implementation - Configurable TTL and memory limits - Pattern-based cache invalidation - Statistics and monitoring - Persistent cache support
Data Flow¶
The following diagram illustrates the typical search request flow:
[User Query] → [API Request] → [Cache Check] → [Query Parse] → [Execute Search]
↑ ↓ ↓ ↓
[Response] ← [Format Results] ← [Cache Store] ← [Rank Results] ← [FTS5 Query]
Detailed Flow¶
- Request Reception: API receives search request with query and parameters
- Cache Lookup: Check if results are already cached
- Query Parsing: Parse query string into executable AST
- Query Validation: Validate syntax and semantics
- Filter Processing: Apply metadata and content filters
- FTS5 Execution: Execute full-text search query
- Result Ranking: Apply ranking algorithms (TF-IDF, BM25)
- Facet Generation: Calculate facets if requested
- Highlighting: Generate result snippets with highlights
- Pagination: Apply offset and limit
- Cache Storage: Store results for future requests
- Response Formatting: Format and return results
Query Processing Pipeline¶
The query processing pipeline transforms user queries into database operations:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Tokenization │ → │ Parsing │ → │ Optimization │ → │ Execution │
│ │ │ │ │ │ │ │
│ • Word Split │ │ • AST Build │ │ • Query Plan │ │ • FTS5 Query │
│ • Normalize │ │ • Validate │ │ • Index Sel │ │ • Filter App │
│ • Stop Words │ │ • Transform │ │ • Cost Est │ │ • Result Gen │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Query Language Grammar¶
query = expression
expression = term | compound_expr | group_expr
compound_expr = expression operator expression
group_expr = "(" expression ")"
term = simple_term | phrase_term | field_term | range_term | wildcard_term
simple_term = WORD
phrase_term = '"' WORD+ '"'
field_term = FIELD ":" term
range_term = FIELD ":" "[" VALUE "TO" VALUE "]"
wildcard_term = WORD "*" | WORD "?"
operator = "AND" | "OR" | "NOT"
WORD = [a-zA-Z0-9_]+
FIELD = [a-zA-Z][a-zA-Z0-9_]*
VALUE = [^\]\s]+
Indexing Architecture¶
The indexing system maintains up-to-date search indexes for all documents:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ File Monitor │ → │ Content Extract │ → │ Index Builder │
│ │ │ │ │ │
│ • Change Detect │ │ • Format Parse │ │ • FTS5 Insert │
│ • Batch Queue │ │ • Text Extract │ │ • Metadata Link │
│ • Event Filter │ │ • Language Det │ │ • Statistics │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Indexing Strategies¶
- Incremental Indexing: Process only changed documents
- Batch Processing: Group multiple updates for efficiency
- Parallel Processing: Utilize multiple threads for large batches
- Priority Queuing: Process important documents first
- Delta Compression: Minimize index update overhead
Content Extraction Pipeline¶
# Simplified content extraction flow
def extract_content(file_path):
# 1. Detect file type
content_type = detect_content_type(file_path)
# 2. Select appropriate extractor
extractor = get_extractor(content_type)
# 3. Extract text content
content = extractor.extract(file_path)
# 4. Normalize and clean
content = normalize_whitespace(content)
content = remove_control_chars(content)
# 5. Detect language
language = detect_language(content)
# 6. Apply language-specific processing
if language_processor := get_processor(language):
content = language_processor.process(content)
return content, language
Cache Architecture¶
The caching system provides multiple layers of optimization:
┌─────────────────────────────────────────────────────────────────┐
│ Cache Hierarchy │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Query Cache │ Result Cache │ Metadata Cache │
│ │ │ │
│ • Parsed AST │ • Search Results│ • Document Metadata │
│ • Query Plans │ • Ranked Items │ • Index Statistics │
│ • Validation │ • Facet Data │ • Schema Information │
└─────────────────┴─────────────────┴─────────────────────────────┘
Cache Policies¶
- LRU Eviction: Remove least recently used entries when full
- TTL Expiration: Automatic expiration based on time-to-live
- Memory Pressure: Proactive eviction under memory constraints
- Pattern Invalidation: Invalidate related entries on updates
- Refresh Ahead: Proactively refresh expiring popular entries
Cache Key Design¶
// Cache key components
struct CacheKey {
std::string queryHash; // Hash of normalized query
std::string filterHash; // Hash of applied filters
size_t offset; // Pagination offset
size_t limit; // Result limit
std::string versionHash; // Data version for invalidation
};
Storage Integration¶
The search system integrates deeply with the YAMS storage architecture:
┌──────────────────────────────────────────────────────────────────┐
│ YAMS Storage System │
├──────────────────┬─────────────────────────┬────────────────────┤
│ Content Store │ Metadata Store │ Search Index │
│ │ │ │
│ • Chunked Data │ • Document Properties │ • FTS5 Tables │
│ • Compression │ • Relationships │ • Term Indexes │
│ • Deduplication │ • Version History │ • Statistics │
└──────────────────┴─────────────────────────┴────────────────────┘
Data Synchronization¶
- Event-Driven Updates: React to storage system events
- Consistency Guarantees: Ensure search index reflects actual content
- Conflict Resolution: Handle concurrent updates gracefully
- Recovery Mechanisms: Rebuild indexes from authoritative storage
- Validation Checks: Periodic consistency validation
Performance Characteristics¶
Benchmarks¶
Operation | Typical Latency | Throughput |
---|---|---|
Simple Query | 5-15ms | 1000+ QPS |
Complex Query | 20-50ms | 500+ QPS |
Cache Hit | <1ms | 10000+ QPS |
Faceted Search | 10-30ms | 300+ QPS |
Index Update | 1-5ms | 200+ DPS |
Scaling Characteristics¶
- Document Count: Linear scaling to 10M+ documents
- Query Complexity: Logarithmic degradation with term count
- Concurrent Users: Linear scaling with available CPU cores
- Memory Usage: 100-500MB for typical workloads
- Storage Overhead: 20-40% of original document size
Scalability Considerations¶
Horizontal Scaling¶
- Read Replicas: Multiple read-only search instances
- Shard Distribution: Partition documents across instances
- Load Balancing: Distribute queries across available nodes
- Cache Coherence: Coordinate cache invalidation across nodes
Vertical Scaling¶
- Memory Optimization: Increase cache sizes for better hit rates
- CPU Utilization: Multi-threading for parallel query processing
- Storage Performance: SSD storage for index data
- Network Optimization: Connection pooling and keep-alives
Performance Tuning¶
// Example configuration for high-performance deployment
SearchConfig config;
config.maxResults = 1000;
config.defaultLimit = 20;
config.timeout = std::chrono::milliseconds(30000);
// Cache configuration
SearchCacheConfig cacheConfig;
cacheConfig.maxEntries = 10000;
cacheConfig.maxMemoryMB = 1000;
cacheConfig.defaultTTL = std::chrono::minutes(10);
cacheConfig.evictionPolicy = EvictionPolicy::LRU;
// Index configuration
IndexingConfig indexConfig;
indexConfig.batchSize = 1000;
indexConfig.maxWorkerThreads = 8;
indexConfig.enableIncrementalSync = true;
Design Decisions¶
SQLite FTS5 Choice¶
Rationale: SQLite FTS5 provides excellent performance for medium-scale deployments while maintaining simplicity and reliability.
Alternatives Considered: - Elasticsearch: Too heavy for embedded use cases - Lucene: Complex integration and maintenance - Custom index: Development overhead and reliability concerns
Trade-offs: - ✅ Simple deployment and maintenance - ✅ ACID transactions and consistency - ✅ Excellent performance for <10M documents - ❌ Limited horizontal scaling options - ❌ Memory usage scales with corpus size
Query Language Design¶
Philosophy: Balance expressiveness with usability
Decisions: 1. Boolean Logic: AND/OR/NOT operators for complex queries 2. Field Queries: Support for field-specific searches 3. Phrase Queries: Exact phrase matching with quotes 4. Wildcards: Star and question mark wildcards 5. Range Queries: Numeric and date range queries
Caching Strategy¶
Approach: Multi-level caching with intelligent invalidation
Decisions: 1. LRU Eviction: Simple and effective for typical access patterns 2. TTL Support: Balance freshness with performance 3. Pattern Invalidation: Efficient bulk invalidation on updates 4. Memory Limits: Prevent OOM conditions under load 5. Statistics Tracking: Enable performance monitoring and tuning
Thread Safety¶
Model: Reader-writer locks with fine-grained locking
Decisions: 1. Shared Reads: Multiple concurrent readers for queries 2. Exclusive Writes: Single writer for index updates 3. Lock-Free Paths: Atomic operations for statistics 4. Deadlock Prevention: Consistent lock ordering 5. Performance Priority: Optimize for read-heavy workloads
Conclusion¶
The YAMS search system provides a robust, high-performance foundation for document search and retrieval. Its modular architecture enables easy extension and customization while maintaining excellent performance characteristics and reliability.
Key strengths: - High performance with sub-50ms typical query times - Comprehensive query language supporting complex searches - Intelligent caching with 80%+ hit rates in typical workloads - Thread-safe design supporting high concurrency - Seamless integration with YAMS storage systems
The architecture balances simplicity and performance, making it suitable for a wide range of deployment scenarios from embedded applications to large-scale document management systems.