Triage - AI Security and Observability

Abstract

We present Delta, an open-source implementation of Lossless Token Sequence Compression (LTSC) designed to reduce the computational and economic cost of large language model inference. Delta identifies repeated multi-token subsequences within input sequences and replaces them with compact meta-token references backed by a learnable dictionary format, achieving 30-60% compression on structured inputs while guaranteeing perfect reconstruction. This technical exposition details the algorithmic foundations, multiple discovery strategies, selection algorithms, and advanced features that underpin Delta's design.

GitHub link preview for Triage-Sec/delta

1. Introduction: The Redundancy Problem

Modern LLM deployments increasingly rely on context augmentation—the practice of prepending relevant information to input prompts to ground model responses. This encompasses retrieval-augmented generation (RAG), tool schema injection, code repository context, and multi-turn conversation history. While effective for improving output quality, these techniques produce token sequences with substantial redundancy.

Consider the structure of a typical agentic workflow:

\text{Request}_i = S \oplus T \oplus C_i \oplus Q_i

Where $S$ denotes the system prompt, $T$ the tool schemas, $C_i$ the context for request $i$ , and $Q_i$ the user query. Across $N$ requests in a session, $S$ and $T$ are transmitted $N$ times, and $C_i$ often shares significant overlap with $C_j$ for $i \ne j$ .

The economic implications are substantial. Given a model with input pricing $p$ (USD per million tokens) and average context length $L$ tokens across $R$ requests per day, daily input costs are:

\text{Cost}_{\text{daily}} = \frac{p \cdot L \cdot R}{10^6}

For Claude Opus 4.5 at $p = 5.00$ , with $L = 50{,}000$ and $R = 10{,}000$ :

\text{Cost}_{\text{daily}} = \frac{5.00 \times 50{,}000 \times 10{,}000}{10^6} = \$2{,}500

This yields monthly costs exceeding $75,000—a significant fraction of which is spent retransmitting identical subsequences.

2. Theoretical Foundation: The Compressibility Constraint

Delta's core operation is the replacement of repeated token patterns with dictionary references. The fundamental constraint governing this replacement ensures that compression never increases sequence length.

2.1 Formal Definition

Let $\pi = (t_1, t_2, \ldots, t_L)$ be a token pattern of length $L$ that occurs at $C$ non-overlapping positions within input sequence $x = (x_1, x_2, \ldots, x_n)$ .

Definition 2.1 (Original Token Cost)

\text{Cost}_{\text{original}}(\pi, C) = L \cdot C

Definition 2.2 (Compressed Token Cost)

\text{Cost}_{\text{compressed}}(\pi, C) = 1 + L + C + \delta

Where: 1 = meta-token, L = definition tokens, C = reference tokens, δ = format overhead

Theorem 2.1 (Compressibility Constraint)

Pattern $\pi$ is compressible if and only if:

L \cdot C > 1 + L + C + \delta

This constraint ensures compression never increases sequence length. Rearranging yields the minimum occurrence threshold:

C_{\min} = \left\lfloor \frac{1 + L + \delta}{L - 1} \right\rfloor + 1

2.2 Compressibility Table

For the default configuration with length tokens enabled ( $\delta = 1$ ):

Pattern Length $L$	$C_{\min}$	Savings per Occurrence
2	5	1 token
3	3	2 tokens
4	3	3 tokens
5	2	4 tokens
8	2	7 tokens
16	2	15 tokens

Observation: Longer patterns amortize dictionary overhead more efficiently, requiring fewer occurrences to achieve compressibility. However, longer patterns are statistically less likely to repeat, creating a fundamental tension that the selection algorithm must navigate.

3. Algorithmic Architecture

Delta's compression pipeline consists of five sequential stages: Discovery, Filtering, Selection, Hierarchical Compression, and Serialization.

3.1 Stage 1: Pattern Discovery

Delta implements multiple discovery strategies optimized for different input characteristics:

3.1.1 Suffix Array Discovery (Default)

The primary discovery method uses suffix arrays with LCP (Longest Common Prefix) arrays, achieving $O(n)$ to $O(n \log n)$ construction time depending on implementation. This approach efficiently identifies all repeated subsequences of length $L \in [L_{\min}, L_{\max}]$ .

Algorithm: LCP_INTERVALS(SA, LCP, min_len)
  intervals ← []
  stack ← []
  for i ← 0 to |LCP| - 1:
    start ← i
    while stack ≠ ∅ and stack.top().lcp > LCP[i]:
      (prev_start, prev_lcp) ← stack.pop()
      if prev_lcp ≥ min_len:
        intervals.append((prev_start, i, prev_lcp))
      start ← prev_start
    if stack = ∅ or stack.top().lcp < LCP[i]:
      stack.push((start, LCP[i]))
  return intervals

3.1.2 BPE-Style Iterative Discovery

For hierarchical pattern discovery, Delta supports a byte-pair encoding inspired approach that iteratively merges the most frequent adjacent token pairs:

\text{merge}(t_i, t_{i+1}) \rightarrow t_{\text{new}} \quad \text{where} \quad \text{freq}(t_i, t_{i+1}) = \max

3.1.3 AST-Aware Discovery

For code inputs, Delta can parse abstract syntax trees to identify structurally repeated subtrees—function signatures, import blocks, class definitions—that may not appear as exact token subsequences but represent semantic duplicates.

3.1.4 Fuzzy Discovery

For inputs with near-duplicate sequences (e.g., templated content with minor variations), fuzzy discovery clusters patterns with small Hamming distances and patches differences, enabling compression of approximately-repeated content.

3.2 Stage 2: Candidate Filtering

3.2.1 Compressibility Check

Each candidate pattern is evaluated against Theorem 2.1. Patterns failing the compressibility constraint are immediately discarded.

3.2.2 Subsumption Analysis

A pattern $\pi_1$ is subsumed by $\pi_2$ if $\pi_1 \subset \pi_2$ and all occurrences of $\pi_1$ appear within occurrences of $\pi_2$ . Subsumed patterns are pruned unless they have independent occurrences meeting the compressibility threshold:

\text{keep}(\pi_1) \iff \exists \text{ occurrence of } \pi_1 \text{ not contained in any } \pi_2

3.3 Stage 3: Pattern Selection

Given filtered candidates, we select a subset of non-overlapping occurrences that maximizes total savings. This is formulated as a weighted interval scheduling problem. Delta implements four selection strategies:

3.3.1 Greedy Selection

Occurrences are sorted by savings density—the ratio of net savings to pattern length—and selected greedily while respecting non-overlap constraints:

\text{density}(\pi) = \frac{L \cdot C - (1 + L + C + \delta)}{L}

Complexity: $O(n \log n)$ due to sorting.

3.3.2 Optimal Dynamic Programming

For exact solutions, Delta uses weighted interval scheduling via dynamic programming. Let $w_i$ be the weight (savings) of occurrence $i$ and $p(i)$ the largest index $j$ such that occurrence $j$ doesn't overlap with $i$ :

\text{OPT}(i) = \max(\text{OPT}(i-1), w_i + \text{OPT}(p(i)))

Complexity: $O(n \log n)$ with binary search optimization.

3.3.3 Beam Search

Maintains the top $W$ partial solutions at each step, expanding by marginal savings:

\text{marginal}(\pi, S) = \text{savings}(S \cup \{\pi\}) - \text{savings}(S)

Complexity: $O(n \cdot W)$ where $W$ is beam width.

3.3.4 ILP Solver

For globally optimal solutions on smaller inputs, Delta supports integer linear programming formulation with binary decision variables for each occurrence. Complexity is exponential but practical for inputs under 10K tokens.

Mode	Complexity	Use Case
greedy	$O(n \log n)$	Production default, fast
optimal	$O(n \log n)$	Better compression ratio
beam	$O(n \cdot W)$	Tunable quality/speed
ilp	Exponential	Global optimum, small inputs

3.4 Stage 4: Hierarchical Compression

Delta supports multi-pass compression where the compressed output becomes input for subsequent passes. This captures nested redundancy (patterns within patterns). Early stopping criteria prevent diminishing returns:

\text{stop if } \frac{\text{savings}_k}{\text{savings}_{k-1}} < \theta \text{ or } k > k_{\max}

3.5 Stage 5: Serialization

The output format places the dictionary first, followed by the compressed body:

Original:  [the, quick, fox, jumped, the, quick, fox, over, the, quick, fox]
                ^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^^

Compressed: [<Dict>, <MT_0>, <Len:3>, the, quick, fox, </Dict>,
             <MT_0>, jumped, <MT_0>, over, <MT_0>]

// L=3, C=3: Original cost = 3×3 = 9 tokens
// Compressed cost = 1 + 3 + 3 + 1 = 8 tokens (dictionary overhead + references)
// Savings: 1 token ✓

Round-trip verification (verify=True) ensures decompression restores the original sequence exactly.

4. Advanced Features

4.1 Pattern Importance Scoring

Not all patterns should be compressed equally. Delta computes a composite importance score to optionally preserve high-salience content:

I(\pi) = \alpha \cdot I_{\text{pos}} + \beta \cdot I_{\text{freq}} + \gamma \cdot I_{\text{len}} + \epsilon \cdot I_{\text{emb}}

Where:

– $I_{\text{pos}}$ : Positional importance (patterns near start/end weighted higher)
– $I_{\text{freq}}$ : Frequency importance (rare patterns may be more salient)
– $I_{\text{len}}$ : Length importance (very short patterns may be structural)
– $I_{\text{emb}}$ : Embedding-based importance (semantic diversity across occurrences)

4.2 Region-Aware Compression

Different prompt regions have different compression tolerances. Delta detects semantic regions and applies per-region limits:

Region	Max Compression	Rationale
SYSTEM	20%	Instructions are critical
USER	30%	Query intent must be preserved
CONTEXT	70%	RAG chunks often highly redundant
CODE	60%	Structural repetition is common

4.3 Quality Prediction

Delta includes an ML-based quality predictor that estimates whether compression might degrade model performance for a given task. The predictor outputs a recommendation (compress, caution, or avoid) based on:

–Task type (code generation, summarization, QA, etc.)
–Compression ratio achieved
–Pattern characteristics (length distribution, positional spread)

4.4 Static Dictionaries

For domain-specific content, pre-built dictionaries eliminate discovery overhead:

–python-v1
–typescript-v1
–markdown-v1
–json-v1
–sql-v1

4.5 Cross-Document Pattern Cache

Delta can cache discovered patterns across compressions to warm-start future discovery. The cache implements frequency decay to prevent stale patterns from dominating and reinjects top patterns into the candidate pool for subsequent inputs.

4.6 Streaming Compression

For inputs exceeding memory limits, Delta processes chunks with overlapping windows. Memory usage is bounded by chunk size rather than total input length, enabling compression of arbitrarily large sequences.

5. MCP Integration

Delta provides a Model Context Protocol (MCP) server for seamless integration with AI coding assistants like Cursor and Claude Desktop. The MCP server runs as a background service, enabling on-the-fly compression of prompts before they're sent to the model.

# Install with MCP support
pip install "delta-ltsc[mcp]"

# Run the server
delta-mcp

Configure in ~/.cursor/mcp.json:

{
  "mcpServers": {
    "delta-ltsc": {
      "command": "/path/to/venv/bin/python",
      "args": ["-m", "delta.mcp"]
    }
  }
}

See the MCP documentation for available tools and configuration options.

6. Empirical Performance

6.1 Compression Ratios

Input Type	Typical Savings	Characteristics
Highly repetitive	50-70%	Same pattern >50 times
Structured code	35-50%	Repeated signatures, imports
RAG with overlap	30-45%	Retrieved chunks share structure
Multi-turn	25-40%	Prior turns in context

6.2 Latency Benchmarks

Approximate latencies, single-threaded:

Input Size (tokens)	Python	TypeScript/WASM
1,000	5ms	2ms
8,000	40ms	8ms
32,000	150ms	25ms
128,000	600ms	100ms

7. Installation

Delta is available as both a Python package and TypeScript SDK:

# Python
pip install delta-ltsc
pip install "delta-ltsc[all]"  # With ML features, MCP server

# TypeScript/JavaScript
npm install @delta-ltsc/sdk

For detailed API documentation, configuration options, and integration guides, see the GitHub repository.

8. Case Study: Delta + Claude Code on the Fern Codebase

To validate Delta's real-world performance, we equipped a Claude Code agent with delta-ltsc MCP tools and ran comprehensive compression benchmarks against the Fern codebase. Fern is an open-source platform that transforms API definitions (OpenAPI, AsyncAPI, Protobuf) into production-ready SDKs and documentation—a Y Combinator-backed project with over 3,500 GitHub stars. Its diverse codebase spanning TypeScript, Python, Java, Go, and generated SDK code made it an ideal test subject.

8.1 Built-in Benchmark Results

Content Type	Input Size	Avg Time (ms)	Savings	Tokens/sec
Repeated Pattern (1k)	1,000	10.11	49.4%	98,914
Code-like (1k)	1,000	4.01	27.6%	249,376
Random (1k)	1,000	1.43	0.0%	697,062
Repeated Pattern (5k)	5,000	228.24	37.5%	21,906
Code-like (5k)	5,000	40.07	20.8%	124,796
Random (5k)	5,000	7.33	0.0%	682,437

8.2 Real Codebase Content Results

Content Type	File	Tokens	Compressed	Savings	Patterns
TypeScript CLI	cli.ts	294	248	15.6%	4
JSON Schema	generators-yml.schema.json	171	153	10.5%	6
Markdown Docs	CLAUDE.md	202	202	0.0%	2
Python Code	code_writer.py	207	200	3.4%	4
Java Code	SampleAppGenerator.java	581	524	9.8%	15
Go Code	sdk.go	423	365	13.7%	19
Generated TS SDK	Client.ts	633	599	5.4%	16
Serialization Code	ContainerType.ts	527	384	27.1%	12

8.3 Session Summary

Total Operations: 14

Input Tokens: 4,369

Output Tokens: 3,782

Tokens Saved: 587 (13.4%)

Avg Compression: 87.7%

Processing Speed: ~50,908 tok/s

Best Savings: 36.3%

Worst Savings: 0.0%

8.4 Key Findings

Best compression (27-37% savings):

–Generated/serialization code with repetitive patterns
–Code with repeated method signatures or boilerplate
–Files with many similar struct/type definitions

Good compression (10-16% savings):

–Import-heavy TypeScript files
–Go code with embedded file directives
–JSON schemas with repeated structures

Moderate compression (3-10% savings):

–Java boilerplate code
–Generated SDK client code
–Python with type annotations

Minimal compression (0% savings):

–Prose documentation (Markdown)
–Highly varied natural language text
–Random/unique content

9. Why We Open-Sourced Delta

The decision to release Delta as open-source software reflects our perspective on infrastructure-level tooling and the current state of AI deployment economics.

Inference cost remains one of the most underappreciated constraints on AI adoption. Teams routinely make architectural compromises to manage token budgets: truncating context windows, limiting agent autonomy, reducing RAG retrieval depth, or avoiding multi-turn interactions entirely. These compromises directly impact capability.

By releasing Delta openly, we aim to:

–Lower barriers to AI adoption. Teams with limited budgets can deploy more capable systems.
–Enable architectural experimentation. Reduced costs make longer context windows and deeper RAG viable.
–Establish interoperability standards. A common format enables cross-tool compatibility.
–Accelerate research. Compression-aware training requires accessible tooling.

Get Started with Delta

Delta is available under the MIT License. Enterprise support and compression-aware observability are available through Triage.

GitHub|PyPI|npm|MCP Docs