Skip to content

YAMS Search System Architecture

This document describes the architecture and design of the YAMS search system, providing insights into its components, data flow, and design decisions.

Table of Contents

  1. Overview
  2. System Components
  3. Data Flow
  4. Query Processing Pipeline
  5. Indexing Architecture
  6. Cache Architecture
  7. Storage Integration
  8. Performance Characteristics
  9. Scalability Considerations
  10. Design Decisions

Overview

The YAMS search system is a high-performance, full-text search engine designed for document management and content retrieval. It provides advanced query capabilities, intelligent caching, and seamless integration with the YAMS storage system.

Key Features

  • Full-Text Search: Powered by SQLite FTS5 for fast and accurate text retrieval
  • Advanced Query Language: Support for boolean operators, phrase queries, field searches, and wildcards
  • Intelligent Caching: LRU-based result caching with pattern-based invalidation
  • Faceted Search: Dynamic faceting for content type, author, date, and custom metadata
  • Real-time Indexing: Incremental indexing with automatic content extraction
  • High Performance: Optimized for low latency and high throughput
  • Thread Safety: Concurrent access support with fine-grained locking

Design Goals

  1. Performance: Sub-50ms query response times for typical workloads
  2. Scalability: Support for millions of documents with graceful degradation
  3. Reliability: Robust error handling and recovery mechanisms
  4. Flexibility: Extensible architecture for custom analyzers and filters
  5. Integration: Seamless integration with YAMS storage and metadata systems

System Components

The search system consists of several interconnected components:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Search API    │    │   Query Parser  │    │  Search Cache   │
│                 │    │                 │    │                 │
│ • REST-style    │    │ • AST Generation│    │ • LRU Eviction  │
│ • Request/Resp  │    │ • Validation    │    │ • TTL Support   │
│ • Error Handling│    │ • Optimization  │    │ • Pattern Inval │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌─────────────────────────────────▼─────────────────────────────────┐
│                      Search Executor                              │
│                                                                   │
│ • Query Execution     • Result Ranking     • Filter Application  │
│ • Cache Integration   • Highlighting       • Pagination          │
│ • Facet Generation    • Spell Correction   • Performance Metrics │
└─────────────────────────────────────────────────────────────────┘
         │                       │                       │
┌─────────▼─────────┐    ┌───────▼──────────┐    ┌──────▼──────────┐
│  Document Indexer │    │ Metadata Repository│    │   Text Extractor│
│                   │    │                    │    │                 │
│ • FTS5 Integration│    │ • Document Metadata│   │ • Content Extraction│
│ • Batch Processing│    │ • Schema Management│    │ • Format Detection  │
│ • Incremental Sync│    │ • Query Interface  │    │ • Language Detection│
└───────────────────┘    └────────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌─────────────────────────────────▼─────────────────────────────────┐
│                         SQLite Database                           │
│                                                                   │
│ • FTS5 Index           • Metadata Tables    • Connection Pooling  │
│ • Document Content     • Schema Migrations  • Transaction Support │
│ • Statistics Tables    • Query Optimization • WAL Mode           │
└───────────────────────────────────────────────────────────────────┘

Component Details

Search Executor

The central orchestrator that coordinates query execution across all subsystems.

Responsibilities: - Query parsing and validation - Cache lookup and management - FTS5 query execution - Result ranking and scoring - Filter and facet application - Result highlighting and snippet generation

Query Parser

Converts natural language queries into structured Abstract Syntax Trees (AST).

Features: - Recursive descent parser - Support for complex boolean expressions - Field-specific query handling - Wildcard and proximity queries - Error recovery and validation

Search Cache

High-performance caching layer with intelligent eviction and invalidation.

Architecture: - Thread-safe LRU implementation - Configurable TTL and memory limits - Pattern-based cache invalidation - Statistics and monitoring - Persistent cache support

Data Flow

The following diagram illustrates the typical search request flow:

[User Query] → [API Request] → [Cache Check] → [Query Parse] → [Execute Search]
      ↑                                ↓              ↓              ↓
[Response]   ← [Format Results] ← [Cache Store] ← [Rank Results] ← [FTS5 Query]

Detailed Flow

  1. Request Reception: API receives search request with query and parameters
  2. Cache Lookup: Check if results are already cached
  3. Query Parsing: Parse query string into executable AST
  4. Query Validation: Validate syntax and semantics
  5. Filter Processing: Apply metadata and content filters
  6. FTS5 Execution: Execute full-text search query
  7. Result Ranking: Apply ranking algorithms (TF-IDF, BM25)
  8. Facet Generation: Calculate facets if requested
  9. Highlighting: Generate result snippets with highlights
  10. Pagination: Apply offset and limit
  11. Cache Storage: Store results for future requests
  12. Response Formatting: Format and return results

Query Processing Pipeline

The query processing pipeline transforms user queries into database operations:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Tokenization │ → │   Parsing    │ → │ Optimization │ → │  Execution   │
│              │    │              │    │              │    │              │
│ • Word Split │    │ • AST Build  │    │ • Query Plan │    │ • FTS5 Query │
│ • Normalize  │    │ • Validate   │    │ • Index Sel  │    │ • Filter App │
│ • Stop Words │    │ • Transform  │    │ • Cost Est   │    │ • Result Gen │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Query Language Grammar

query          = expression
expression     = term | compound_expr | group_expr
compound_expr  = expression operator expression
group_expr     = "(" expression ")"
term           = simple_term | phrase_term | field_term | range_term | wildcard_term
simple_term    = WORD
phrase_term    = '"' WORD+ '"'
field_term     = FIELD ":" term
range_term     = FIELD ":" "[" VALUE "TO" VALUE "]"
wildcard_term  = WORD "*" | WORD "?"
operator       = "AND" | "OR" | "NOT"
WORD           = [a-zA-Z0-9_]+
FIELD          = [a-zA-Z][a-zA-Z0-9_]*
VALUE          = [^\]\s]+

Indexing Architecture

The indexing system maintains up-to-date search indexes for all documents:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  File Monitor   │ → │ Content Extract │ → │  Index Builder  │
│                 │    │                 │    │                 │
│ • Change Detect │    │ • Format Parse  │    │ • FTS5 Insert   │
│ • Batch Queue   │    │ • Text Extract  │    │ • Metadata Link │
│ • Event Filter  │    │ • Language Det  │    │ • Statistics    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Indexing Strategies

  1. Incremental Indexing: Process only changed documents
  2. Batch Processing: Group multiple updates for efficiency
  3. Parallel Processing: Utilize multiple threads for large batches
  4. Priority Queuing: Process important documents first
  5. Delta Compression: Minimize index update overhead

Content Extraction Pipeline

# Simplified content extraction flow
def extract_content(file_path):
    # 1. Detect file type
    content_type = detect_content_type(file_path)

    # 2. Select appropriate extractor
    extractor = get_extractor(content_type)

    # 3. Extract text content
    content = extractor.extract(file_path)

    # 4. Normalize and clean
    content = normalize_whitespace(content)
    content = remove_control_chars(content)

    # 5. Detect language
    language = detect_language(content)

    # 6. Apply language-specific processing
    if language_processor := get_processor(language):
        content = language_processor.process(content)

    return content, language

Cache Architecture

The caching system provides multiple layers of optimization:

┌─────────────────────────────────────────────────────────────────┐
│                        Cache Hierarchy                         │
├─────────────────┬─────────────────┬─────────────────────────────┤
│   Query Cache   │  Result Cache   │      Metadata Cache         │
│                 │                 │                             │
│ • Parsed AST    │ • Search Results│ • Document Metadata         │
│ • Query Plans   │ • Ranked Items  │ • Index Statistics          │
│ • Validation    │ • Facet Data    │ • Schema Information        │
└─────────────────┴─────────────────┴─────────────────────────────┘

Cache Policies

  1. LRU Eviction: Remove least recently used entries when full
  2. TTL Expiration: Automatic expiration based on time-to-live
  3. Memory Pressure: Proactive eviction under memory constraints
  4. Pattern Invalidation: Invalidate related entries on updates
  5. Refresh Ahead: Proactively refresh expiring popular entries

Cache Key Design

// Cache key components
struct CacheKey {
    std::string queryHash;        // Hash of normalized query
    std::string filterHash;       // Hash of applied filters
    size_t offset;                // Pagination offset
    size_t limit;                 // Result limit
    std::string versionHash;      // Data version for invalidation
};

Storage Integration

The search system integrates deeply with the YAMS storage architecture:

┌──────────────────────────────────────────────────────────────────┐
│                     YAMS Storage System                         │
├──────────────────┬─────────────────────────┬────────────────────┤
│   Content Store  │    Metadata Store       │   Search Index     │
│                  │                         │                    │
│ • Chunked Data   │ • Document Properties   │ • FTS5 Tables      │
│ • Compression    │ • Relationships         │ • Term Indexes     │
│ • Deduplication  │ • Version History       │ • Statistics       │
└──────────────────┴─────────────────────────┴────────────────────┘

Data Synchronization

  1. Event-Driven Updates: React to storage system events
  2. Consistency Guarantees: Ensure search index reflects actual content
  3. Conflict Resolution: Handle concurrent updates gracefully
  4. Recovery Mechanisms: Rebuild indexes from authoritative storage
  5. Validation Checks: Periodic consistency validation

Performance Characteristics

Benchmarks

Operation Typical Latency Throughput
Simple Query 5-15ms 1000+ QPS
Complex Query 20-50ms 500+ QPS
Cache Hit <1ms 10000+ QPS
Faceted Search 10-30ms 300+ QPS
Index Update 1-5ms 200+ DPS

Scaling Characteristics

  • Document Count: Linear scaling to 10M+ documents
  • Query Complexity: Logarithmic degradation with term count
  • Concurrent Users: Linear scaling with available CPU cores
  • Memory Usage: 100-500MB for typical workloads
  • Storage Overhead: 20-40% of original document size

Scalability Considerations

Horizontal Scaling

  1. Read Replicas: Multiple read-only search instances
  2. Shard Distribution: Partition documents across instances
  3. Load Balancing: Distribute queries across available nodes
  4. Cache Coherence: Coordinate cache invalidation across nodes

Vertical Scaling

  1. Memory Optimization: Increase cache sizes for better hit rates
  2. CPU Utilization: Multi-threading for parallel query processing
  3. Storage Performance: SSD storage for index data
  4. Network Optimization: Connection pooling and keep-alives

Performance Tuning

// Example configuration for high-performance deployment
SearchConfig config;
config.maxResults = 1000;
config.defaultLimit = 20;
config.timeout = std::chrono::milliseconds(30000);

// Cache configuration
SearchCacheConfig cacheConfig;
cacheConfig.maxEntries = 10000;
cacheConfig.maxMemoryMB = 1000;
cacheConfig.defaultTTL = std::chrono::minutes(10);
cacheConfig.evictionPolicy = EvictionPolicy::LRU;

// Index configuration
IndexingConfig indexConfig;
indexConfig.batchSize = 1000;
indexConfig.maxWorkerThreads = 8;
indexConfig.enableIncrementalSync = true;

Design Decisions

SQLite FTS5 Choice

Rationale: SQLite FTS5 provides excellent performance for medium-scale deployments while maintaining simplicity and reliability.

Alternatives Considered: - Elasticsearch: Too heavy for embedded use cases - Lucene: Complex integration and maintenance - Custom index: Development overhead and reliability concerns

Trade-offs: - ✅ Simple deployment and maintenance - ✅ ACID transactions and consistency - ✅ Excellent performance for <10M documents - ❌ Limited horizontal scaling options - ❌ Memory usage scales with corpus size

Query Language Design

Philosophy: Balance expressiveness with usability

Decisions: 1. Boolean Logic: AND/OR/NOT operators for complex queries 2. Field Queries: Support for field-specific searches 3. Phrase Queries: Exact phrase matching with quotes 4. Wildcards: Star and question mark wildcards 5. Range Queries: Numeric and date range queries

Caching Strategy

Approach: Multi-level caching with intelligent invalidation

Decisions: 1. LRU Eviction: Simple and effective for typical access patterns 2. TTL Support: Balance freshness with performance 3. Pattern Invalidation: Efficient bulk invalidation on updates 4. Memory Limits: Prevent OOM conditions under load 5. Statistics Tracking: Enable performance monitoring and tuning

Thread Safety

Model: Reader-writer locks with fine-grained locking

Decisions: 1. Shared Reads: Multiple concurrent readers for queries 2. Exclusive Writes: Single writer for index updates 3. Lock-Free Paths: Atomic operations for statistics 4. Deadlock Prevention: Consistent lock ordering 5. Performance Priority: Optimize for read-heavy workloads

Conclusion

The YAMS search system provides a robust, high-performance foundation for document search and retrieval. Its modular architecture enables easy extension and customization while maintaining excellent performance characteristics and reliability.

Key strengths: - High performance with sub-50ms typical query times - Comprehensive query language supporting complex searches - Intelligent caching with 80%+ hit rates in typical workloads - Thread-safe design supporting high concurrency - Seamless integration with YAMS storage systems

The architecture balances simplicity and performance, making it suitable for a wide range of deployment scenarios from embedded applications to large-scale document management systems.