Skip to main content

qndx: Building a Benchmark-Driven Regex Search Indexer in Rust

·12 mins

Regex search over a large codebase is a linear scan. ripgrep, the current gold standard, is extraordinarily fast at that scan – but it is still a scan. Every file gets read, every byte gets matched. On a 75,000-file Linux kernel tree, that takes measurable time no matter how good your SIMD vectorization is.

Trigram indexing is the established alternative. Google Code Search pioneered it, Zoekt refined it, Livegrep productionized it. The idea is simple: precompute an inverted index mapping 3-byte substrings to the files that contain them, then use the index to eliminate files that cannot possibly match before running the real regex. For selective queries, this turns O(corpus) into O(candidates).

qndx is my take on this, built as a local CLI tool in Rust. It builds a memory-mapped n-gram index over source files and uses it to narrow the search space before running the actual regex. For selective queries on large codebases, this is significantly faster than scanning every file – while guaranteeing no false negatives.

What makes this project unusual is not the core idea (trigram indexing is well-understood) but the approach to architecture: every design decision is backed by Criterion benchmarks, with explicit decision gates, CI-enforced performance budgets, and regression tracking built in from the start.

The three-stage pipeline #

qndx operates in three stages: index, search, and freshness.

Index #

Walk the repository (respecting .gitignore), extract overlapping trigrams and sparse n-grams from every file, and store them in a sorted lookup table with posting lists that map each n-gram hash to the files containing it.

The index is three files under .qndx/index/v1/:

File Magic Contents
ngrams.tbl QXNG Sorted n-gram hash table (20 bytes per entry: hash, offset, length, flags)
postings.dat QXPO Concatenated posting blocks (tagged: varint-delta for small lists, Roaring for large)
manifest.bin QXMF Metadata and file paths (postcard-serialized)

Each file has a 20-byte header: 4-byte magic, u32 version, u64 payload length, u32 CRC32 checksum.

At query time, qndx decomposes the regex into required literal fragments, plans trigram vs sparse strategy, resolves candidate files from posting lists, then verifies those candidates with the full regex.

Build pipeline:

flowchart TD A[files] --> B[extract grams] B --> C[ngrams.tbl] B --> D[postings.dat] B --> E[manifest.bin]

Search pipeline:

flowchart TD A[pattern] --> B[decompose] B --> C[plan] C --> D[candidates] D --> E[verify] E --> F[matches]

Sparse extraction now has two distinct modes:

  • extract_sparse_ngrams_all at index time (superset for coverage)
  • extract_sparse_ngrams_covering at query time (smaller lookup set)

The key correctness property is covering(L) <= all(F) for any literal L found in indexed content F, which keeps sparse planning safe without false negatives.

Every match returned by the index path is verified against the actual file content. The index only eliminates files that provably cannot match – it never introduces false negatives.

Freshness #

Track Git working tree state. Modified, added, and untracked files are re-indexed into a lightweight overlay that merges with the baseline index at query time, giving read-your-writes semantics without a full rebuild.

Inside the implementation #

The memory-mapped reader #

The index reader is the hot path. It uses memmap2 to memory-map ngrams.tbl and postings.dat, so opening a 2 GB index (Linux kernel) has near-zero startup cost. Only pages touched during a query consume physical RAM – a typical query touches about 100 KB.

The binary search over the n-gram table is designed to minimize cache pressure. It reads only the 4-byte hash field at each step, not the full 20-byte entry:

/// Read just the hash field from an ngram entry at the given index.
/// Avoids deserializing the full entry during binary search.
#[inline]
fn ngram_hash_at(&self, idx: usize) -> NgramHash {
    let payload = self.ngram_payload();
    let start = idx * NGRAM_ENTRY_SIZE;
    u32::from_le_bytes(payload[start..start + 4].try_into().unwrap())
}

/// Binary search the ngram table for a hash.
fn binary_search_hash(&self, hash: NgramHash) -> Option<usize> {
    let mut lo = 0usize;
    let mut hi = self.ngram_count;
    while lo < hi {
        let mid = lo + (hi - lo) / 2;
        let mid_hash = self.ngram_hash_at(mid);
        match mid_hash.cmp(&hash) {
            std::cmp::Ordering::Less => lo = mid + 1,
            std::cmp::Ordering::Greater => hi = mid,
            std::cmp::Ordering::Equal => return Some(mid),
        }
    }
    None
}

Once a hash is found, the full entry is deserialized to get the offset and length into postings.dat, and the posting list is decoded directly from the mmap’d region with no copying.

Hybrid posting lists #

Posting lists are the main data structure. A naive approach would use Vec<u32> everywhere, but that wastes space for large lists. Roaring bitmaps are more compact for dense sets, but have overhead for small ones. qndx uses both:

/// A posting list that can be either a simple sorted vec or a Roaring bitmap.
#[derive(Debug, Clone)]
pub enum PostingList {
    /// Small posting list stored as a sorted Vec<u32>.
    Vec(Vec<FileId>),
    /// Large posting list stored as a Roaring bitmap.
    Roaring(RoaringBitmap),
}

The threshold is 64 entries: below that, varint-delta-encoded Vec; above, Roaring bitmap. The on-disk format uses a 1-byte tag prefix for auto-detection:

  • 0x01: fixed-width delta-encoded u32s
  • 0x02: varint delta-encoded (LEB128-style) – the default for small lists
  • 0x03: Roaring bitmap native serialization

Intersection handles the mixed case efficiently by filtering the smaller Vec against the larger Roaring:

pub fn intersect(&self, other: &PostingList) -> PostingList {
    match (self, other) {
        (PostingList::Vec(a), PostingList::Vec(b)) =>
            PostingList::Vec(vec_intersect(a, b)),
        (PostingList::Roaring(a), PostingList::Roaring(b)) =>
            PostingList::Roaring(a & b),
        (PostingList::Vec(v), PostingList::Roaring(r))
        | (PostingList::Roaring(r), PostingList::Vec(v)) => {
            let result: Vec<FileId> = v.iter()
                .copied()
                .filter(|id| r.contains(*id))
                .collect();
            PostingList::from_vec(result)
        }
    }
}

The query planner #

Not every regex benefits equally from indexing. A query like enum AgentMode produces 12 highly selective trigrams and narrows 722 files down to 8 candidates. A query like pub fn is common enough that the candidate set is still 240/722 files – still a win, but a smaller one.

qndx stores two kinds of n-grams: classic overlapping trigrams and longer sparse n-grams. The planner evaluates both strategies per query and picks the cheaper one:

pub trait SelectivityEstimator {
    fn estimate(&self, hash: NgramHash, gram_len: usize) -> f64;
}

/// Hash-based selectivity estimator (default).
/// Assumes longer n-grams are more selective.
pub struct HashSelectivity;

impl SelectivityEstimator for HashSelectivity {
    fn estimate(&self, _hash: NgramHash, gram_len: usize) -> f64 {
        // Cost decreases with gram length: a trigram (len=3) has cost 1.0,
        // a 6-gram has cost 0.5, etc.
        3.0 / gram_len.max(1) as f64
    }
}

The planner produces a QueryPlan with strategy, hash lookups, and estimated cost. Internally it now scores sparse first and only materializes sparse hashes if sparse wins, which avoids allocation on trigram-dominant queries.

use std::error::Error;
use std::path::Path;

use qndx_index::IndexReader;
use qndx_query::{index_search_with_strategy_and_timing, StrategyOverride};

fn main() -> Result<(), Box<dyn Error>> {
    let root = Path::new(".");
    let reader = IndexReader::open(Path::new(".qndx/index/v1"))?;

    let result = index_search_with_strategy_and_timing(
        &reader,
        root,
        "enum AgentMode",
        StrategyOverride::Auto,
        true, // collect per-stage timings
    )?;

    println!("strategy: {}", result.stats.strategy);
    println!("candidates: {}", result.stats.candidate_count);
    println!(
        "timing(ms): plan={:.3}, candidates={:.3}, verify={:.3}",
        result.stats.plan_time_ms,
        result.stats.candidate_time_ms,
        result.stats.verify_time_ms,
    );

    Ok(())
}

The qndx plan command exposes strategy diagnostics without running a search:

Pattern: enum AgentMode

Literals: ["enum AgentMode"]

Trigram plan:
  lookups: 12
  cost:    12.00

Sparse plan:
  lookups: 13
  cost:    11.53

Selected:  sparse
Lookups:   13
Cost:      11.53

Regex decomposition #

The decomposition step extracts literal segments from a regex pattern. It handles top-level alternation (a|b) by producing OR branches, with AND semantics within each branch.

A minimal copy-paste example that mirrors the public API:

use qndx_query::plan_diagnostics;

let diag = plan_diagnostics("foo|bar");
println!("literals: {:?}", diag.literals);
println!("trigram lookups: {}", diag.trigram_lookups);
println!("sparse lookups: {:?}", diag.sparse_lookups);
println!("selected strategy: {}", diag.selected.strategy);

Both trigram and sparse candidates are derived from the same extracted literals: trigrams use overlapping 3-byte windows, while sparse uses query-time covering grams that are guaranteed to be represented by the index-time build_all extraction. The planner picks the cheaper strategy based on estimated selectivity.

Performance #

Measured on a 722-file / 8.6 MB Rust codebase:

Query Strategy Candidates Scan Indexed Speedup
enum AgentMode trigram 8 / 722 27 ms 0.008 ms 3375x
TODO trigram 45 / 722 27 ms 1.4 ms 19x
self\.\w+ trigram 214 / 722 28 ms 6.9 ms 4x
pub fn trigram 240 / 722 27 ms 6.7 ms 4x
impl.*for trigram 427 / 722 28 ms 9.9 ms 3x

The pattern is clear: selectivity drives speedup. A rare identifier like enum AgentMode narrows 722 files to 8 candidates and achieves a 3375x speedup. A broad pattern like impl.*for still filters out 40% of files, yielding a 3x improvement.

Even in the worst case the index path is never slower than a scan, because the verification stage is the same regex engine over file content – the index only decides which files to verify.

With --stats, qndx now also prints stage timings for indexed search:

3 matches in 8 files (174185 bytes, 8 candidates / 722 total, 12 lookups, strategy: trigram) in 0.008s [indexed]
  timing: open=3.412ms, plan=0.071ms, candidates=0.204ms, verify=4.033ms

Timing collection is opt-in: without --stats, the query path skips timing instrumentation overhead.

Memory characteristics #

The index reader uses memory-mapped I/O via memmap2. This means:

  • Opening a 2 GB index is near-instant (no allocation, no read)
  • Only pages touched during the search are paged into physical RAM
  • A typical query on the Linux kernel index (~2 GB on disk) requires only ~100 KB of resident memory

The manifest (file paths and metadata) is the only part fully deserialized into heap memory, and it is typically a few KB even for large corpora.

Benchmark-driven architecture decisions #

qndx defines three explicit decision gates, each resolved by running Criterion benchmarks and comparing against measurable thresholds.

Gate A: Manifest serializer #

Candidates: postcard vs wincode vs serde_json

The serializer_choice benchmark compares all three at four manifest sizes (10, 100, 1K, 10K files). The serializer is used for manifest.bin – the file that stores metadata and paths.

Outcome: postcard. wincode did not show the required 15% decode throughput advantage on manifest-heavy workloads. Postcard is simpler, has a stable wire format, and is well-maintained. The chart uses a log y-axis because decode latency spans roughly 4 orders of magnitude across manifest sizes.

Gate B: Postings representation #

Candidates: Vec<u32> (fixed-width delta), Vec<u32> (varint delta), Roaring bitmap, hybrid (auto-threshold)

The postings_choice benchmark measures intersection and union latency at three cardinalities (20, 500, 5000 entries), plus encode/decode throughput and encoded sizes.

Outcome: hybrid with threshold at 64. Varint-delta Vec for small lists, Roaring for large lists. The hybrid approach achieves the best intersection latency across all cardinalities because it avoids Roaring’s fixed overhead for small lists and Vec’s scaling issues for large ones. The chart is normalized to vec so relative behavior stays readable across cardinalities.

The encoded-size comparison at high cardinality (5000 IDs):

Format Encoded size
Vec fixed-width ~20,000 bytes
Vec varint ~5,000 bytes
Roaring ~2,500 bytes
Hybrid (auto) ~2,500 bytes (selects Roaring)

Gate C: N-gram strategy #

Candidates: pure trigram decomposition vs sparse n-gram covering

The planner evaluates both strategies per query. Sparse grams are longer (often more selective), but lookups and selectivity both affect cost. The current planner computes a sparse score first and only materializes sparse hashes if sparse is selected, reducing allocation overhead on trigram-dominant queries.

Outcome: planner-selected per query. Neither strategy dominates across all query types. The planner uses a selectivity model (default: hash/length heuristic) to pick the lower-cost option automatically. For long literals sparse often wins on cost; for short or fragmented patterns trigram commonly wins. The chart is normalized to lit_simple to emphasize relative planner cost while keeping raw microseconds in tooltips.

The query_planner benchmark prints a comparison table:

  pattern                   strategy  lookups tri_grams  spr_grams       cost
  ---------------------------------------------------------------------------
  literal_simple            Trigram       11       11         12     11.000
  literal_underscore        Trigram       12       12         13     12.000
  literal_camel             Trigram       10       10         11     10.000
  regex_alternation         Trigram       22        0          0     22.000
  regex_class               Sparse        7        6          7      5.975
  regex_wildcard            Trigram       10       10         12     10.000
  regex_digit               Trigram        5        5          6      5.000
  regex_word_boundary       Trigram        0        0          0     -0.000
  regex_complex             Trigram       14       14         17     14.000
  regex_broad               Trigram        0        0          0     -0.000

Standard benchmark corpora #

qndx benchmarks against a tiered set of well-known open-source repositories – the same ones used by Zoekt, ripgrep, and other code-search tools, making results directly comparable:

Tier Corpus Files Size Language Use case
small rust (rust-lang/rust) ~35K ~500 MB Rust Fast local iteration
medium linux (torvalds/linux) ~75K ~1.2 GB C Industry-standard grep benchmark
medium kubernetes (kubernetes/kubernetes) ~15K ~200 MB Go Multi-language coverage
large chromium (chromium/src) ~400K ~20 GB C++ Stress testing

Each corpus has its own pattern file tuned to the language and codebase. For example, the Linux kernel patterns include printk, EXPORT_SYMBOL, #ifdef CONFIG_\w+, and static int __init \w+. The Kubernetes patterns include type \w+ interface, fmt\.Errorf\(.*%w, and ^func Test\w+.

The real_corpus benchmark runs the full pipeline – index build, indexed search (auto/trigram/sparse strategies), and scan-only baseline – then prints a summary table comparing scan vs indexed performance with speedup factors for every pattern.

Recent additions also benchmark:

  • real_{name}/index_open: index open cost (mmap + header/checksum validation)
  • real_{name}/indexed/timing_overhead: timing disabled vs enabled (--stats-style instrumentation)

Linux data is now included as a first-class corpus. For cross-corpus charts below, I use the shared 10-pattern core set (literal_* + regex_*) so Rust vs Linux is apples-to-apples.

I intentionally keep corpus-specific patterns (for example kernel-only symbols) out of the cross-corpus comparisons, because they are still useful for per-corpus profiling but not directly comparable across languages and repositories.

# Download standard corpora (excludes large tier by default)
./benchmarks/fetch_corpora.sh

# Run benchmarks against all downloaded corpora
./benchmarks/run_standard_benches.sh

# Run against a specific corpus with its patterns
QNDX_BENCH_CORPUS=benchmarks/corpora/linux \
QNDX_BENCH_PATTERNS=benchmarks/patterns/linux.txt \
cargo bench --bench real_corpus

Index-open latency is in the same order of magnitude for both corpora, with Linux slightly lower in this run. This is startup overhead: it matters most for one-shot CLI invocations and less for repeated searches in the same process.

The summary chart shows the distribution shape: Linux has stronger central speedup on this shared set, while both corpora have heavy tails where selective patterns produce very large wins.

Pattern-level speedup varies by selectivity, not just language. Highly selective patterns create outsized gains, while broad patterns (like regex_broad) converge toward scan-like performance.

Sparse dominates the shared core set in both corpora, while auto and trigram still win in a few cases. That is exactly what the planner is supposed to do: choose strategy per-pattern, not globally.

Current status and what is next #

qndx is at v0.1.0 (MVP). The core pipeline works: index build, memory-mapped reading, trigram/sparse search with planner-selected strategy, and scan-only fallback. 202 tests pass across all crates, including differential tests that verify index results match scan results for every query pattern.

What remains:

  • Freshness hardening: Keep improving dirty-file detection and overlay merge behavior, especially around edge cases and very large repos.
  • External-sort builder: The inverted index (BTreeMap<NgramHash, Vec<FileId>>) currently requires full in-memory residence. For truly huge repos (Chromium-scale), an external-sort-based builder would cap peak memory.
  • Frequency-based selectivity: The FrequencySelectivity estimator is implemented but not yet wired into the default path. It uses actual document-frequency counts from the index, which should improve planner decisions for skewed corpora.
  • Incremental index updates: Currently the index is rebuilt from scratch. Incremental updates (add/remove files without full rebuild) would improve the developer experience.

The code is MIT-licensed. The benchmark infrastructure – budgets, CI workflow, regression triage, corpus management – was arguably more work than the indexer itself. But it means every future change is measurable, and regressions are caught before they land.