YAMS Search System Architecture¶

This document describes the architecture and design of the YAMS search system, providing insights into its components, data flow, and design decisions.

Table of Contents¶

Overview
System Components
Data Flow
Query Processing Pipeline
Indexing Architecture
Cache Architecture
Storage Integration
Performance Characteristics
Scalability Considerations
Design Decisions

Overview¶

The YAMS search system is a high-performance, full-text search engine designed for document management and content retrieval. It provides advanced query capabilities, intelligent caching, and seamless integration with the YAMS storage system.

Key Features¶

Full-Text Search: Powered by SQLite FTS5 for fast and accurate text retrieval
Advanced Query Language: Support for boolean operators, phrase queries, field searches, and wildcards
Intelligent Caching: LRU-based result caching with pattern-based invalidation
Faceted Search: Dynamic faceting for content type, author, date, and custom metadata
Real-time Indexing: Incremental indexing with automatic content extraction
High Performance: Optimized for low latency and high throughput
Thread Safety: Concurrent access support with fine-grained locking

Design Goals¶

Performance: Sub-50ms query response times for typical workloads
Scalability: Support for millions of documents with graceful degradation
Reliability: Robust error handling and recovery mechanisms
Flexibility: Extensible architecture for custom analyzers and filters
Integration: Seamless integration with YAMS storage and metadata systems

System Components¶

The search system consists of several interconnected components:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Search API    │    │   Query Parser  │    │  Search Cache   │
│                 │    │                 │    │                 │
│ • REST-style    │    │ • AST Generation│    │ • LRU Eviction  │
│ • Request/Resp  │    │ • Validation    │    │ • TTL Support   │
│ • Error Handling│    │ • Optimization  │    │ • Pattern Inval │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌─────────────────────────────────▼─────────────────────────────────┐
│                      Search Executor                              │
│                                                                   │
│ • Query Execution     • Result Ranking     • Filter Application  │
│ • Cache Integration   • Highlighting       • Pagination          │
│ • Facet Generation    • Spell Correction   • Performance Metrics │
└─────────────────────────────────────────────────────────────────┘
         │                       │                       │
┌─────────▼─────────┐    ┌───────▼──────────┐    ┌──────▼──────────┐
│  Document Indexer │    │ Metadata Repository│    │   Text Extractor│
│                   │    │                    │    │                 │
│ • FTS5 Integration│    │ • Document Metadata│   │ • Content Extraction│
│ • Batch Processing│    │ • Schema Management│    │ • Format Detection  │
│ • Incremental Sync│    │ • Query Interface  │    │ • Language Detection│
└───────────────────┘    └────────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌─────────────────────────────────▼─────────────────────────────────┐
│                         SQLite Database                           │
│                                                                   │
│ • FTS5 Index           • Metadata Tables    • Connection Pooling  │
│ • Document Content     • Schema Migrations  • Transaction Support │
│ • Statistics Tables    • Query Optimization • WAL Mode           │
└───────────────────────────────────────────────────────────────────┘

Component Details¶

Search Executor¶

The central orchestrator that coordinates query execution across all subsystems.

Responsibilities: - Query parsing and validation - Cache lookup and management - FTS5 query execution - Result ranking and scoring - Filter and facet application - Result highlighting and snippet generation

Query Parser¶

Converts natural language queries into structured Abstract Syntax Trees (AST).

Features: - Recursive descent parser - Support for complex boolean expressions - Field-specific query handling - Wildcard and proximity queries - Error recovery and validation

Search Cache¶

High-performance caching layer with intelligent eviction and invalidation.

Architecture: - Thread-safe LRU implementation - Configurable TTL and memory limits - Pattern-based cache invalidation - Statistics and monitoring - Persistent cache support

Data Flow¶

The following diagram illustrates the typical search request flow:

[User Query] → [API Request] → [Cache Check] → [Query Parse] → [Execute Search]
      ↑                                ↓              ↓              ↓
[Response]   ← [Format Results] ← [Cache Store] ← [Rank Results] ← [FTS5 Query]

Detailed Flow¶

Request Reception: API receives search request with query and parameters
Cache Lookup: Check if results are already cached
Query Parsing: Parse query string into executable AST
Query Validation: Validate syntax and semantics
Filter Processing: Apply metadata and content filters
FTS5 Execution: Execute full-text search query
Result Ranking: Apply ranking algorithms (TF-IDF, BM25)
Facet Generation: Calculate facets if requested
Highlighting: Generate result snippets with highlights
Pagination: Apply offset and limit
Cache Storage: Store results for future requests
Response Formatting: Format and return results

Query Processing Pipeline¶

The query processing pipeline transforms user queries into database operations:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Tokenization │ → │   Parsing    │ → │ Optimization │ → │  Execution   │
│              │    │              │    │              │    │              │
│ • Word Split │    │ • AST Build  │    │ • Query Plan │    │ • FTS5 Query │
│ • Normalize  │    │ • Validate   │    │ • Index Sel  │    │ • Filter App │
│ • Stop Words │    │ • Transform  │    │ • Cost Est   │    │ • Result Gen │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Query Language Grammar¶

query          = expression
expression     = term | compound_expr | group_expr
compound_expr  = expression operator expression
group_expr     = "(" expression ")"
term           = simple_term | phrase_term | field_term | range_term | wildcard_term
simple_term    = WORD
phrase_term    = '"' WORD+ '"'
field_term     = FIELD ":" term
range_term     = FIELD ":" "[" VALUE "TO" VALUE "]"
wildcard_term  = WORD "*" | WORD "?"
operator       = "AND" | "OR" | "NOT"
WORD           = [a-zA-Z0-9_]+
FIELD          = [a-zA-Z][a-zA-Z0-9_]*
VALUE          = [^\]\s]+

Indexing Architecture¶

The indexing system maintains up-to-date search indexes for all documents:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  File Monitor   │ → │ Content Extract │ → │  Index Builder  │
│                 │    │                 │    │                 │
│ • Change Detect │    │ • Format Parse  │    │ • FTS5 Insert   │
│ • Batch Queue   │    │ • Text Extract  │    │ • Metadata Link │
│ • Event Filter  │    │ • Language Det  │    │ • Statistics    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Indexing Strategies¶

Incremental Indexing: Process only changed documents
Batch Processing: Group multiple updates for efficiency
Parallel Processing: Utilize multiple threads for large batches
Priority Queuing: Process important documents first
Delta Compression: Minimize index update overhead

Content Extraction Pipeline¶

# Simplified content extraction flow
def extract_content(file_path):
    # 1. Detect file type
    content_type = detect_content_type(file_path)

    # 2. Select appropriate extractor
    extractor = get_extractor(content_type)

    # 3. Extract text content
    content = extractor.extract(file_path)

    # 4. Normalize and clean
    content = normalize_whitespace(content)
    content = remove_control_chars(content)

    # 5. Detect language
    language = detect_language(content)

    # 6. Apply language-specific processing
    if language_processor := get_processor(language):
        content = language_processor.process(content)

    return content, language

Cache Architecture¶

The caching system provides multiple layers of optimization:

┌─────────────────────────────────────────────────────────────────┐
│                        Cache Hierarchy                         │
├─────────────────┬─────────────────┬─────────────────────────────┤
│   Query Cache   │  Result Cache   │      Metadata Cache         │
│                 │                 │                             │
│ • Parsed AST    │ • Search Results│ • Document Metadata         │
│ • Query Plans   │ • Ranked Items  │ • Index Statistics          │
│ • Validation    │ • Facet Data    │ • Schema Information        │
└─────────────────┴─────────────────┴─────────────────────────────┘

Cache Policies¶

LRU Eviction: Remove least recently used entries when full
TTL Expiration: Automatic expiration based on time-to-live
Memory Pressure: Proactive eviction under memory constraints
Pattern Invalidation: Invalidate related entries on updates
Refresh Ahead: Proactively refresh expiring popular entries

Cache Key Design¶

// Cache key components
struct CacheKey {
    std::string queryHash;        // Hash of normalized query
    std::string filterHash;       // Hash of applied filters
    size_t offset;                // Pagination offset
    size_t limit;                 // Result limit
    std::string versionHash;      // Data version for invalidation
};

Storage Integration¶

The search system integrates deeply with the YAMS storage architecture:

┌──────────────────────────────────────────────────────────────────┐
│                     YAMS Storage System                         │
├──────────────────┬─────────────────────────┬────────────────────┤
│   Content Store  │    Metadata Store       │   Search Index     │
│                  │                         │                    │
│ • Chunked Data   │ • Document Properties   │ • FTS5 Tables      │
│ • Compression    │ • Relationships         │ • Term Indexes     │
│ • Deduplication  │ • Version History       │ • Statistics       │
└──────────────────┴─────────────────────────┴────────────────────┘

Data Synchronization¶

Event-Driven Updates: React to storage system events
Consistency Guarantees: Ensure search index reflects actual content
Conflict Resolution: Handle concurrent updates gracefully
Recovery Mechanisms: Rebuild indexes from authoritative storage
Validation Checks: Periodic consistency validation

Performance Characteristics¶

Benchmarks¶

Operation	Typical Latency	Throughput
Simple Query	5-15ms	1000+ QPS
Complex Query	20-50ms	500+ QPS
Cache Hit	<1ms	10000+ QPS
Faceted Search	10-30ms	300+ QPS
Index Update	1-5ms	200+ DPS

Scaling Characteristics¶

Document Count: Linear scaling to 10M+ documents
Query Complexity: Logarithmic degradation with term count
Concurrent Users: Linear scaling with available CPU cores
Memory Usage: 100-500MB for typical workloads
Storage Overhead: 20-40% of original document size

Scalability Considerations¶

Horizontal Scaling¶

Read Replicas: Multiple read-only search instances
Shard Distribution: Partition documents across instances
Load Balancing: Distribute queries across available nodes
Cache Coherence: Coordinate cache invalidation across nodes

Vertical Scaling¶

Memory Optimization: Increase cache sizes for better hit rates
CPU Utilization: Multi-threading for parallel query processing
Storage Performance: SSD storage for index data
Network Optimization: Connection pooling and keep-alives

Performance Tuning¶

// Example configuration for high-performance deployment
SearchConfig config;
config.maxResults = 1000;
config.defaultLimit = 20;
config.timeout = std::chrono::milliseconds(30000);

// Cache configuration
SearchCacheConfig cacheConfig;
cacheConfig.maxEntries = 10000;
cacheConfig.maxMemoryMB = 1000;
cacheConfig.defaultTTL = std::chrono::minutes(10);
cacheConfig.evictionPolicy = EvictionPolicy::LRU;

// Index configuration
IndexingConfig indexConfig;
indexConfig.batchSize = 1000;
indexConfig.maxWorkerThreads = 8;
indexConfig.enableIncrementalSync = true;

Design Decisions¶

SQLite FTS5 Choice¶

Rationale: SQLite FTS5 provides excellent performance for medium-scale deployments while maintaining simplicity and reliability.

Alternatives Considered: - Elasticsearch: Too heavy for embedded use cases - Lucene: Complex integration and maintenance - Custom index: Development overhead and reliability concerns

Trade-offs: - ✅ Simple deployment and maintenance - ✅ ACID transactions and consistency - ✅ Excellent performance for <10M documents - ❌ Limited horizontal scaling options - ❌ Memory usage scales with corpus size

Query Language Design¶

Philosophy: Balance expressiveness with usability

Decisions: 1. Boolean Logic: AND/OR/NOT operators for complex queries 2. Field Queries: Support for field-specific searches 3. Phrase Queries: Exact phrase matching with quotes 4. Wildcards: Star and question mark wildcards 5. Range Queries: Numeric and date range queries

Caching Strategy¶

Approach: Multi-level caching with intelligent invalidation

Decisions: 1. LRU Eviction: Simple and effective for typical access patterns 2. TTL Support: Balance freshness with performance 3. Pattern Invalidation: Efficient bulk invalidation on updates 4. Memory Limits: Prevent OOM conditions under load 5. Statistics Tracking: Enable performance monitoring and tuning

Thread Safety¶

Model: Reader-writer locks with fine-grained locking

Decisions: 1. Shared Reads: Multiple concurrent readers for queries 2. Exclusive Writes: Single writer for index updates 3. Lock-Free Paths: Atomic operations for statistics 4. Deadlock Prevention: Consistent lock ordering 5. Performance Priority: Optimize for read-heavy workloads

Conclusion¶

The YAMS search system provides a robust, high-performance foundation for document search and retrieval. Its modular architecture enables easy extension and customization while maintaining excellent performance characteristics and reliability.

Key strengths: - High performance with sub-50ms typical query times - Comprehensive query language supporting complex searches - Intelligent caching with 80%+ hit rates in typical workloads - Thread-safe design supporting high concurrency - Seamless integration with YAMS storage systems

The architecture balances simplicity and performance, making it suitable for a wide range of deployment scenarios from embedded applications to large-scale document management systems.