← All articles
Engineering

Delta: Lossless Token Sequence Compression for Large Language Models

Delta LTSC Visualization

A Technical Exposition on Dictionary-Based Compression, Algorithmic Guarantees, and the Economics of Inference Cost Reduction

Nikhil SrivastavaNikhil Srivastava
Feb 2, 202622 min read

Abstract

We present Delta, an open-source implementation of Lossless Token Sequence Compression (LTSC) designed to reduce the computational and economic cost of large language model inference. Delta identifies repeated multi-token subsequences within input sequences and replaces them with compact meta-token references backed by a learnable dictionary format, achieving 30-60% compression on structured inputs while guaranteeing perfect reconstruction. This technical exposition details the algorithmic foundations, multiple discovery strategies, selection algorithms, and advanced features that underpin Delta's design.

GitHub link preview for Triage-Sec/delta

1. Introduction: The Redundancy Problem

Modern LLM deployments increasingly rely on context augmentation—the practice of prepending relevant information to input prompts to ground model responses. This encompasses retrieval-augmented generation (RAG), tool schema injection, code repository context, and multi-turn conversation history. While effective for improving output quality, these techniques produce token sequences with substantial redundancy.

Consider the structure of a typical agentic workflow:

Requesti=STCiQi\text{Request}_i = S \oplus T \oplus C_i \oplus Q_i

Where SS denotes the system prompt, TT the tool schemas, CiC_i the context for request ii, and QiQ_i the user query. Across NN requests in a session, SS and TT are transmitted NN times, and CiC_i often shares significant overlap with CjC_j for iji \ne j.

The economic implications are substantial. Given a model with input pricing pp (USD per million tokens) and average context length LL tokens across RR requests per day, daily input costs are:

Costdaily=pLR106\text{Cost}_{\text{daily}} = \frac{p \cdot L \cdot R}{10^6}

For Claude Opus 4.5 at p=5.00p = 5.00, with L=50,000L = 50{,}000 and R=10,000R = 10{,}000:

Costdaily=5.00×50,000×10,000106=$2,500\text{Cost}_{\text{daily}} = \frac{5.00 \times 50{,}000 \times 10{,}000}{10^6} = \$2{,}500

This yields monthly costs exceeding $75,000—a significant fraction of which is spent retransmitting identical subsequences.

2. Theoretical Foundation: The Compressibility Constraint

Delta's core operation is the replacement of repeated token patterns with dictionary references. The fundamental constraint governing this replacement ensures that compression never increases sequence length.

2.1 Formal Definition

Let π=(t1,t2,,tL)\pi = (t_1, t_2, \ldots, t_L) be a token pattern of length LL that occurs at CC non-overlapping positions within input sequence x=(x1,x2,,xn)x = (x_1, x_2, \ldots, x_n).

Definition 2.1 (Original Token Cost)

Costoriginal(π,C)=LC\text{Cost}_{\text{original}}(\pi, C) = L \cdot C

Definition 2.2 (Compressed Token Cost)

Costcompressed(π,C)=1+L+C+δ\text{Cost}_{\text{compressed}}(\pi, C) = 1 + L + C + \delta

Where: 1 = meta-token, L = definition tokens, C = reference tokens, δ = format overhead

Theorem 2.1 (Compressibility Constraint)

Pattern π\pi is compressible if and only if:

LC>1+L+C+δL \cdot C > 1 + L + C + \delta

This constraint ensures compression never increases sequence length. Rearranging yields the minimum occurrence threshold:

Cmin=1+L+δL1+1C_{\min} = \left\lfloor \frac{1 + L + \delta}{L - 1} \right\rfloor + 1

2.2 Compressibility Table

For the default configuration with length tokens enabled (δ=1\delta = 1):

Pattern Length LLCminC_{\min}Savings per Occurrence
251 token
332 tokens
433 tokens
524 tokens
827 tokens
16215 tokens

Observation: Longer patterns amortize dictionary overhead more efficiently, requiring fewer occurrences to achieve compressibility. However, longer patterns are statistically less likely to repeat, creating a fundamental tension that the selection algorithm must navigate.

3. Algorithmic Architecture

Delta's compression pipeline consists of five sequential stages: Discovery, Filtering, Selection, Hierarchical Compression, and Serialization.

3.1 Stage 1: Pattern Discovery

Delta implements multiple discovery strategies optimized for different input characteristics:

3.1.1 Suffix Array Discovery (Default)

The primary discovery method uses suffix arrays with LCP (Longest Common Prefix) arrays, achieving O(n)O(n) to O(nlogn)O(n \log n) construction time depending on implementation. This approach efficiently identifies all repeated subsequences of length L[Lmin,Lmax]L \in [L_{\min}, L_{\max}].

Algorithm: LCP_INTERVALS(SA, LCP, min_len)
  intervals ← []
  stack ← []
  for i ← 0 to |LCP| - 1:
    start ← i
    while stack ≠ ∅ and stack.top().lcp > LCP[i]:
      (prev_start, prev_lcp) ← stack.pop()
      if prev_lcp ≥ min_len:
        intervals.append((prev_start, i, prev_lcp))
      start ← prev_start
    if stack = ∅ or stack.top().lcp < LCP[i]:
      stack.push((start, LCP[i]))
  return intervals

3.1.2 BPE-Style Iterative Discovery

For hierarchical pattern discovery, Delta supports a byte-pair encoding inspired approach that iteratively merges the most frequent adjacent token pairs:

merge(ti,ti+1)tnewwherefreq(ti,ti+1)=max\text{merge}(t_i, t_{i+1}) \rightarrow t_{\text{new}} \quad \text{where} \quad \text{freq}(t_i, t_{i+1}) = \max

3.1.3 AST-Aware Discovery

For code inputs, Delta can parse abstract syntax trees to identify structurally repeated subtrees—function signatures, import blocks, class definitions—that may not appear as exact token subsequences but represent semantic duplicates.

3.1.4 Fuzzy Discovery

For inputs with near-duplicate sequences (e.g., templated content with minor variations), fuzzy discovery clusters patterns with small Hamming distances and patches differences, enabling compression of approximately-repeated content.

3.2 Stage 2: Candidate Filtering

3.2.1 Compressibility Check

Each candidate pattern is evaluated against Theorem 2.1. Patterns failing the compressibility constraint are immediately discarded.

3.2.2 Subsumption Analysis

A pattern π1\pi_1 is subsumed by π2\pi_2 if π1π2\pi_1 \subset \pi_2 and all occurrences of π1\pi_1 appear within occurrences of π2\pi_2. Subsumed patterns are pruned unless they have independent occurrences meeting the compressibility threshold:

keep(π1)     occurrence of π1 not contained in any π2\text{keep}(\pi_1) \iff \exists \text{ occurrence of } \pi_1 \text{ not contained in any } \pi_2

3.3 Stage 3: Pattern Selection

Given filtered candidates, we select a subset of non-overlapping occurrences that maximizes total savings. This is formulated as a weighted interval scheduling problem. Delta implements four selection strategies:

3.3.1 Greedy Selection

Occurrences are sorted by savings density—the ratio of net savings to pattern length—and selected greedily while respecting non-overlap constraints:

density(π)=LC(1+L+C+δ)L\text{density}(\pi) = \frac{L \cdot C - (1 + L + C + \delta)}{L}

Complexity: O(nlogn)O(n \log n) due to sorting.

3.3.2 Optimal Dynamic Programming

For exact solutions, Delta uses weighted interval scheduling via dynamic programming. Let wiw_i be the weight (savings) of occurrence ii and p(i)p(i) the largest index jj such that occurrence jj doesn't overlap with ii:

OPT(i)=max(OPT(i1),wi+OPT(p(i)))\text{OPT}(i) = \max(\text{OPT}(i-1), w_i + \text{OPT}(p(i)))

Complexity: O(nlogn)O(n \log n) with binary search optimization.

3.3.3 Beam Search

Maintains the top WW partial solutions at each step, expanding by marginal savings:

marginal(π,S)=savings(S{π})savings(S)\text{marginal}(\pi, S) = \text{savings}(S \cup \{\pi\}) - \text{savings}(S)

Complexity: O(nW)O(n \cdot W) where WW is beam width.

3.3.4 ILP Solver

For globally optimal solutions on smaller inputs, Delta supports integer linear programming formulation with binary decision variables for each occurrence. Complexity is exponential but practical for inputs under 10K tokens.

ModeComplexityUse Case
greedyO(nlogn)O(n \log n)Production default, fast
optimalO(nlogn)O(n \log n)Better compression ratio
beamO(nW)O(n \cdot W)Tunable quality/speed
ilpExponentialGlobal optimum, small inputs

3.4 Stage 4: Hierarchical Compression

Delta supports multi-pass compression where the compressed output becomes input for subsequent passes. This captures nested redundancy (patterns within patterns). Early stopping criteria prevent diminishing returns:

stop if savingsksavingsk1<θ or k>kmax\text{stop if } \frac{\text{savings}_k}{\text{savings}_{k-1}} < \theta \text{ or } k > k_{\max}

3.5 Stage 5: Serialization

The output format places the dictionary first, followed by the compressed body:

Original:  [the, quick, fox, jumped, the, quick, fox, over, the, quick, fox]
                ^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^^

Compressed: [<Dict>, <MT_0>, <Len:3>, the, quick, fox, </Dict>,
             <MT_0>, jumped, <MT_0>, over, <MT_0>]

// L=3, C=3: Original cost = 3×3 = 9 tokens
// Compressed cost = 1 + 3 + 3 + 1 = 8 tokens (dictionary overhead + references)
// Savings: 1 token ✓

Round-trip verification (verify=True) ensures decompression restores the original sequence exactly.

4. Advanced Features

4.1 Pattern Importance Scoring

Not all patterns should be compressed equally. Delta computes a composite importance score to optionally preserve high-salience content:

I(π)=αIpos+βIfreq+γIlen+ϵIembI(\pi) = \alpha \cdot I_{\text{pos}} + \beta \cdot I_{\text{freq}} + \gamma \cdot I_{\text{len}} + \epsilon \cdot I_{\text{emb}}

Where:

  • IposI_{\text{pos}}: Positional importance (patterns near start/end weighted higher)
  • IfreqI_{\text{freq}}: Frequency importance (rare patterns may be more salient)
  • IlenI_{\text{len}}: Length importance (very short patterns may be structural)
  • IembI_{\text{emb}}: Embedding-based importance (semantic diversity across occurrences)

4.2 Region-Aware Compression

Different prompt regions have different compression tolerances. Delta detects semantic regions and applies per-region limits:

RegionMax CompressionRationale
SYSTEM20%Instructions are critical
USER30%Query intent must be preserved
CONTEXT70%RAG chunks often highly redundant
CODE60%Structural repetition is common

4.3 Quality Prediction

Delta includes an ML-based quality predictor that estimates whether compression might degrade model performance for a given task. The predictor outputs a recommendation (compress, caution, or avoid) based on:

  • Task type (code generation, summarization, QA, etc.)
  • Compression ratio achieved
  • Pattern characteristics (length distribution, positional spread)

4.4 Static Dictionaries

For domain-specific content, pre-built dictionaries eliminate discovery overhead:

  • python-v1
  • typescript-v1
  • markdown-v1
  • json-v1
  • sql-v1

4.5 Cross-Document Pattern Cache

Delta can cache discovered patterns across compressions to warm-start future discovery. The cache implements frequency decay to prevent stale patterns from dominating and reinjects top patterns into the candidate pool for subsequent inputs.

4.6 Streaming Compression

For inputs exceeding memory limits, Delta processes chunks with overlapping windows. Memory usage is bounded by chunk size rather than total input length, enabling compression of arbitrarily large sequences.

5. MCP Integration

Delta provides a Model Context Protocol (MCP) server for seamless integration with AI coding assistants like Cursor and Claude Desktop. The MCP server runs as a background service, enabling on-the-fly compression of prompts before they're sent to the model.

# Install with MCP support
pip install "delta-ltsc[mcp]"

# Run the server
delta-mcp

Configure in ~/.cursor/mcp.json:

{
  "mcpServers": {
    "delta-ltsc": {
      "command": "/path/to/venv/bin/python",
      "args": ["-m", "delta.mcp"]
    }
  }
}

See the MCP documentation for available tools and configuration options.

6. Empirical Performance

6.1 Compression Ratios

Input TypeTypical SavingsCharacteristics
Highly repetitive50-70%Same pattern >50 times
Structured code35-50%Repeated signatures, imports
RAG with overlap30-45%Retrieved chunks share structure
Multi-turn25-40%Prior turns in context

6.2 Latency Benchmarks

Approximate latencies, single-threaded:

Input Size (tokens)PythonTypeScript/WASM
1,0005ms2ms
8,00040ms8ms
32,000150ms25ms
128,000600ms100ms

7. Installation

Delta is available as both a Python package and TypeScript SDK:

# Python
pip install delta-ltsc
pip install "delta-ltsc[all]"  # With ML features, MCP server

# TypeScript/JavaScript
npm install @delta-ltsc/sdk

For detailed API documentation, configuration options, and integration guides, see the GitHub repository.

8. Case Study: Delta + Claude Code on the Fern Codebase

To validate Delta's real-world performance, we equipped a Claude Code agent with delta-ltsc MCP tools and ran comprehensive compression benchmarks against the Fern codebase. Fern is an open-source platform that transforms API definitions (OpenAPI, AsyncAPI, Protobuf) into production-ready SDKs and documentation—a Y Combinator-backed project with over 3,500 GitHub stars. Its diverse codebase spanning TypeScript, Python, Java, Go, and generated SDK code made it an ideal test subject.

8.1 Built-in Benchmark Results

Content TypeInput SizeAvg Time (ms)SavingsTokens/sec
Repeated Pattern (1k)1,00010.1149.4%98,914
Code-like (1k)1,0004.0127.6%249,376
Random (1k)1,0001.430.0%697,062
Repeated Pattern (5k)5,000228.2437.5%21,906
Code-like (5k)5,00040.0720.8%124,796
Random (5k)5,0007.330.0%682,437

8.2 Real Codebase Content Results

Content TypeFileTokensCompressedSavingsPatterns
TypeScript CLIcli.ts29424815.6%4
JSON Schemagenerators-yml.schema.json17115310.5%6
Markdown DocsCLAUDE.md2022020.0%2
Python Codecode_writer.py2072003.4%4
Java CodeSampleAppGenerator.java5815249.8%15
Go Codesdk.go42336513.7%19
Generated TS SDKClient.ts6335995.4%16
Serialization CodeContainerType.ts52738427.1%12

8.3 Session Summary

Total Operations: 14
Input Tokens: 4,369
Output Tokens: 3,782
Tokens Saved: 587 (13.4%)
Avg Compression: 87.7%
Processing Speed: ~50,908 tok/s
Best Savings: 36.3%
Worst Savings: 0.0%

8.4 Key Findings

Best compression (27-37% savings):

  • Generated/serialization code with repetitive patterns
  • Code with repeated method signatures or boilerplate
  • Files with many similar struct/type definitions

Good compression (10-16% savings):

  • Import-heavy TypeScript files
  • Go code with embedded file directives
  • JSON schemas with repeated structures

Moderate compression (3-10% savings):

  • Java boilerplate code
  • Generated SDK client code
  • Python with type annotations

Minimal compression (0% savings):

  • Prose documentation (Markdown)
  • Highly varied natural language text
  • Random/unique content

9. Why We Open-Sourced Delta

The decision to release Delta as open-source software reflects our perspective on infrastructure-level tooling and the current state of AI deployment economics.

Inference cost remains one of the most underappreciated constraints on AI adoption. Teams routinely make architectural compromises to manage token budgets: truncating context windows, limiting agent autonomy, reducing RAG retrieval depth, or avoiding multi-turn interactions entirely. These compromises directly impact capability.

By releasing Delta openly, we aim to:

  • Lower barriers to AI adoption. Teams with limited budgets can deploy more capable systems.
  • Enable architectural experimentation. Reduced costs make longer context windows and deeper RAG viable.
  • Establish interoperability standards. A common format enables cross-tool compatibility.
  • Accelerate research. Compression-aware training requires accessible tooling.

Get Started with Delta

Delta is available under the MIT License. Enterprise support and compression-aware observability are available through Triage.