Delta: Lossless Token Sequence Compression for Large Language Models

A Technical Exposition on Dictionary-Based Compression, Algorithmic Guarantees, and the Economics of Inference Cost Reduction
Abstract
We present Delta, an open-source implementation of Lossless Token Sequence Compression (LTSC) designed to reduce the computational and economic cost of large language model inference. Delta identifies repeated multi-token subsequences within input sequences and replaces them with compact meta-token references backed by a learnable dictionary format, achieving 30-60% compression on structured inputs while guaranteeing perfect reconstruction. This technical exposition details the algorithmic foundations, multiple discovery strategies, selection algorithms, and advanced features that underpin Delta's design.
1. Introduction: The Redundancy Problem
Modern LLM deployments increasingly rely on context augmentation—the practice of prepending relevant information to input prompts to ground model responses. This encompasses retrieval-augmented generation (RAG), tool schema injection, code repository context, and multi-turn conversation history. While effective for improving output quality, these techniques produce token sequences with substantial redundancy.
Consider the structure of a typical agentic workflow:
Where denotes the system prompt, the tool schemas, the context for request , and the user query. Across requests in a session, and are transmitted times, and often shares significant overlap with for .
The economic implications are substantial. Given a model with input pricing (USD per million tokens) and average context length tokens across requests per day, daily input costs are:
For Claude Opus 4.5 at , with and :
This yields monthly costs exceeding $75,000—a significant fraction of which is spent retransmitting identical subsequences.
2. Theoretical Foundation: The Compressibility Constraint
Delta's core operation is the replacement of repeated token patterns with dictionary references. The fundamental constraint governing this replacement ensures that compression never increases sequence length.
2.1 Formal Definition
Let be a token pattern of length that occurs at non-overlapping positions within input sequence .
Definition 2.1 (Original Token Cost)
Definition 2.2 (Compressed Token Cost)
Where: 1 = meta-token, L = definition tokens, C = reference tokens, δ = format overhead
Theorem 2.1 (Compressibility Constraint)
Pattern is compressible if and only if:
This constraint ensures compression never increases sequence length. Rearranging yields the minimum occurrence threshold:
2.2 Compressibility Table
For the default configuration with length tokens enabled ():
| Pattern Length | Savings per Occurrence | |
|---|---|---|
| 2 | 5 | 1 token |
| 3 | 3 | 2 tokens |
| 4 | 3 | 3 tokens |
| 5 | 2 | 4 tokens |
| 8 | 2 | 7 tokens |
| 16 | 2 | 15 tokens |
Observation: Longer patterns amortize dictionary overhead more efficiently, requiring fewer occurrences to achieve compressibility. However, longer patterns are statistically less likely to repeat, creating a fundamental tension that the selection algorithm must navigate.
3. Algorithmic Architecture
Delta's compression pipeline consists of five sequential stages: Discovery, Filtering, Selection, Hierarchical Compression, and Serialization.
3.1 Stage 1: Pattern Discovery
Delta implements multiple discovery strategies optimized for different input characteristics:
3.1.1 Suffix Array Discovery (Default)
The primary discovery method uses suffix arrays with LCP (Longest Common Prefix) arrays, achieving to construction time depending on implementation. This approach efficiently identifies all repeated subsequences of length .
Algorithm: LCP_INTERVALS(SA, LCP, min_len)
intervals ← []
stack ← []
for i ← 0 to |LCP| - 1:
start ← i
while stack ≠ ∅ and stack.top().lcp > LCP[i]:
(prev_start, prev_lcp) ← stack.pop()
if prev_lcp ≥ min_len:
intervals.append((prev_start, i, prev_lcp))
start ← prev_start
if stack = ∅ or stack.top().lcp < LCP[i]:
stack.push((start, LCP[i]))
return intervals3.1.2 BPE-Style Iterative Discovery
For hierarchical pattern discovery, Delta supports a byte-pair encoding inspired approach that iteratively merges the most frequent adjacent token pairs:
3.1.3 AST-Aware Discovery
For code inputs, Delta can parse abstract syntax trees to identify structurally repeated subtrees—function signatures, import blocks, class definitions—that may not appear as exact token subsequences but represent semantic duplicates.
3.1.4 Fuzzy Discovery
For inputs with near-duplicate sequences (e.g., templated content with minor variations), fuzzy discovery clusters patterns with small Hamming distances and patches differences, enabling compression of approximately-repeated content.
3.2 Stage 2: Candidate Filtering
3.2.1 Compressibility Check
Each candidate pattern is evaluated against Theorem 2.1. Patterns failing the compressibility constraint are immediately discarded.
3.2.2 Subsumption Analysis
A pattern is subsumed by if and all occurrences of appear within occurrences of . Subsumed patterns are pruned unless they have independent occurrences meeting the compressibility threshold:
3.3 Stage 3: Pattern Selection
Given filtered candidates, we select a subset of non-overlapping occurrences that maximizes total savings. This is formulated as a weighted interval scheduling problem. Delta implements four selection strategies:
3.3.1 Greedy Selection
Occurrences are sorted by savings density—the ratio of net savings to pattern length—and selected greedily while respecting non-overlap constraints:
Complexity: due to sorting.
3.3.2 Optimal Dynamic Programming
For exact solutions, Delta uses weighted interval scheduling via dynamic programming. Let be the weight (savings) of occurrence and the largest index such that occurrence doesn't overlap with :
Complexity: with binary search optimization.
3.3.3 Beam Search
Maintains the top partial solutions at each step, expanding by marginal savings:
Complexity: where is beam width.
3.3.4 ILP Solver
For globally optimal solutions on smaller inputs, Delta supports integer linear programming formulation with binary decision variables for each occurrence. Complexity is exponential but practical for inputs under 10K tokens.
| Mode | Complexity | Use Case |
|---|---|---|
| greedy | Production default, fast | |
| optimal | Better compression ratio | |
| beam | Tunable quality/speed | |
| ilp | Exponential | Global optimum, small inputs |
3.4 Stage 4: Hierarchical Compression
Delta supports multi-pass compression where the compressed output becomes input for subsequent passes. This captures nested redundancy (patterns within patterns). Early stopping criteria prevent diminishing returns:
3.5 Stage 5: Serialization
The output format places the dictionary first, followed by the compressed body:
Original: [the, quick, fox, jumped, the, quick, fox, over, the, quick, fox]
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
Compressed: [<Dict>, <MT_0>, <Len:3>, the, quick, fox, </Dict>,
<MT_0>, jumped, <MT_0>, over, <MT_0>]
// L=3, C=3: Original cost = 3×3 = 9 tokens
// Compressed cost = 1 + 3 + 3 + 1 = 8 tokens (dictionary overhead + references)
// Savings: 1 token ✓Round-trip verification (verify=True) ensures decompression restores the original sequence exactly.
4. Advanced Features
4.1 Pattern Importance Scoring
Not all patterns should be compressed equally. Delta computes a composite importance score to optionally preserve high-salience content:
Where:
- –: Positional importance (patterns near start/end weighted higher)
- –: Frequency importance (rare patterns may be more salient)
- –: Length importance (very short patterns may be structural)
- –: Embedding-based importance (semantic diversity across occurrences)
4.2 Region-Aware Compression
Different prompt regions have different compression tolerances. Delta detects semantic regions and applies per-region limits:
| Region | Max Compression | Rationale |
|---|---|---|
| SYSTEM | 20% | Instructions are critical |
| USER | 30% | Query intent must be preserved |
| CONTEXT | 70% | RAG chunks often highly redundant |
| CODE | 60% | Structural repetition is common |
4.3 Quality Prediction
Delta includes an ML-based quality predictor that estimates whether compression might degrade model performance for a given task. The predictor outputs a recommendation (compress, caution, or avoid) based on:
- –Task type (code generation, summarization, QA, etc.)
- –Compression ratio achieved
- –Pattern characteristics (length distribution, positional spread)
4.4 Static Dictionaries
For domain-specific content, pre-built dictionaries eliminate discovery overhead:
- –
python-v1 - –
typescript-v1 - –
markdown-v1 - –
json-v1 - –
sql-v1
4.5 Cross-Document Pattern Cache
Delta can cache discovered patterns across compressions to warm-start future discovery. The cache implements frequency decay to prevent stale patterns from dominating and reinjects top patterns into the candidate pool for subsequent inputs.
4.6 Streaming Compression
For inputs exceeding memory limits, Delta processes chunks with overlapping windows. Memory usage is bounded by chunk size rather than total input length, enabling compression of arbitrarily large sequences.
5. MCP Integration
Delta provides a Model Context Protocol (MCP) server for seamless integration with AI coding assistants like Cursor and Claude Desktop. The MCP server runs as a background service, enabling on-the-fly compression of prompts before they're sent to the model.
# Install with MCP support pip install "delta-ltsc[mcp]" # Run the server delta-mcp
Configure in ~/.cursor/mcp.json:
{
"mcpServers": {
"delta-ltsc": {
"command": "/path/to/venv/bin/python",
"args": ["-m", "delta.mcp"]
}
}
}See the MCP documentation for available tools and configuration options.
6. Empirical Performance
6.1 Compression Ratios
| Input Type | Typical Savings | Characteristics |
|---|---|---|
| Highly repetitive | 50-70% | Same pattern >50 times |
| Structured code | 35-50% | Repeated signatures, imports |
| RAG with overlap | 30-45% | Retrieved chunks share structure |
| Multi-turn | 25-40% | Prior turns in context |
6.2 Latency Benchmarks
Approximate latencies, single-threaded:
| Input Size (tokens) | Python | TypeScript/WASM |
|---|---|---|
| 1,000 | 5ms | 2ms |
| 8,000 | 40ms | 8ms |
| 32,000 | 150ms | 25ms |
| 128,000 | 600ms | 100ms |
7. Installation
Delta is available as both a Python package and TypeScript SDK:
# Python pip install delta-ltsc pip install "delta-ltsc[all]" # With ML features, MCP server # TypeScript/JavaScript npm install @delta-ltsc/sdk
For detailed API documentation, configuration options, and integration guides, see the GitHub repository.
8. Case Study: Delta + Claude Code on the Fern Codebase
To validate Delta's real-world performance, we equipped a Claude Code agent with delta-ltsc MCP tools and ran comprehensive compression benchmarks against the Fern codebase. Fern is an open-source platform that transforms API definitions (OpenAPI, AsyncAPI, Protobuf) into production-ready SDKs and documentation—a Y Combinator-backed project with over 3,500 GitHub stars. Its diverse codebase spanning TypeScript, Python, Java, Go, and generated SDK code made it an ideal test subject.
8.1 Built-in Benchmark Results
| Content Type | Input Size | Avg Time (ms) | Savings | Tokens/sec |
|---|---|---|---|---|
| Repeated Pattern (1k) | 1,000 | 10.11 | 49.4% | 98,914 |
| Code-like (1k) | 1,000 | 4.01 | 27.6% | 249,376 |
| Random (1k) | 1,000 | 1.43 | 0.0% | 697,062 |
| Repeated Pattern (5k) | 5,000 | 228.24 | 37.5% | 21,906 |
| Code-like (5k) | 5,000 | 40.07 | 20.8% | 124,796 |
| Random (5k) | 5,000 | 7.33 | 0.0% | 682,437 |
8.2 Real Codebase Content Results
| Content Type | File | Tokens | Compressed | Savings | Patterns |
|---|---|---|---|---|---|
| TypeScript CLI | cli.ts | 294 | 248 | 15.6% | 4 |
| JSON Schema | generators-yml.schema.json | 171 | 153 | 10.5% | 6 |
| Markdown Docs | CLAUDE.md | 202 | 202 | 0.0% | 2 |
| Python Code | code_writer.py | 207 | 200 | 3.4% | 4 |
| Java Code | SampleAppGenerator.java | 581 | 524 | 9.8% | 15 |
| Go Code | sdk.go | 423 | 365 | 13.7% | 19 |
| Generated TS SDK | Client.ts | 633 | 599 | 5.4% | 16 |
| Serialization Code | ContainerType.ts | 527 | 384 | 27.1% | 12 |
8.3 Session Summary
8.4 Key Findings
Best compression (27-37% savings):
- –Generated/serialization code with repetitive patterns
- –Code with repeated method signatures or boilerplate
- –Files with many similar struct/type definitions
Good compression (10-16% savings):
- –Import-heavy TypeScript files
- –Go code with embedded file directives
- –JSON schemas with repeated structures
Moderate compression (3-10% savings):
- –Java boilerplate code
- –Generated SDK client code
- –Python with type annotations
Minimal compression (0% savings):
- –Prose documentation (Markdown)
- –Highly varied natural language text
- –Random/unique content
9. Why We Open-Sourced Delta
The decision to release Delta as open-source software reflects our perspective on infrastructure-level tooling and the current state of AI deployment economics.
Inference cost remains one of the most underappreciated constraints on AI adoption. Teams routinely make architectural compromises to manage token budgets: truncating context windows, limiting agent autonomy, reducing RAG retrieval depth, or avoiding multi-turn interactions entirely. These compromises directly impact capability.
By releasing Delta openly, we aim to:
- –Lower barriers to AI adoption. Teams with limited budgets can deploy more capable systems.
- –Enable architectural experimentation. Reduced costs make longer context windows and deeper RAG viable.
- –Establish interoperability standards. A common format enables cross-tool compatibility.
- –Accelerate research. Compression-aware training requires accessible tooling.