Triage - Security Control Plane

Abstract — We built a standalone surrogate of a widely deployed AI text detection service. The surrogate achieves 99.6% class agreement and 0.003 probability MAE against the production system, trained entirely from standard API access. It runs offline. No API keys, no GPU, no network calls. Total cost: ~5,000 queries, ~$75 in API fees, ~$15 in compute. The target has been notified and preferred that we preserve their anonymity. We have engaged them directly on remediation.

1. The attack surface

Three properties of the target API make surrogate construction trivial. Each is endemic to production ML services. Together, they reduce model extraction to a standard supervised learning problem.

Excessive output granularity

Each API call returns document-level generation probabilities, per-class distributions, burstiness scores, per-sentence generation probability, perplexity, and highlight flags. For a typical 8-sentence document: ~500 bits per query.

The classification task requires communicating one decision. That is ~2 bits. The API leaks 250x more information per call than the task demands.

This is not incidental metadata. Per-sentence probabilities expose the model's feature-level evaluation of text. Burstiness and perplexity scores surface internal intermediate representations. Every returned field is free training signal for an adversary.

The information-theoretic framing: the minimum description length for a binary classification is 1 bit. Adding confidence pushes it to ~10 bits. Returning per-token metadata pushes it to hundreds. The delta between task-necessary information and actually-returned information defines the extraction surface.

Full determinism

We queried identical text 3 times across 50 documents (150 queries). Standard deviation across repeated responses: ~2e-17. Zero class label flips.

pythonListing 1

stds = []
for text in sample_texts:
    scores = [query_api(text)["average_generated_prob"] for _ in range(3)]
    stds.append(np.std(scores))

print(f"Mean std: {np.mean(stds):.2e}")  # ~2e-17
print(f"Max std:  {np.max(stds):.2e}")   # ~2e-17
print(f"Class flips: 0 / 50")

Listing 1. Determinism probe across 50 documents. Standard deviation across three repeated calls is 2e-17 and no class label flips.

Every query is pure signal. No noise floor, no repeated queries to average out stochasticity. The adversary extracts maximum information from every API call, and the total query budget for high-fidelity extraction drops by at least an order of magnitude versus a noisy oracle.

No behavioral monitoring

No detection of systematic probing. No rate limiting sufficient to prevent full extraction in a single session. No anomaly flagging on perturbation sequences, boundary sweeps, or single-variable experiments. Each query looks individually legitimate. The extraction signature is in the sequence, and no one monitors sequences.

Compounding effect

These properties interact multiplicatively. High granularity provides a rich, high-dimensional training signal per query. Determinism converts every dollar of API spend directly to extraction signal with zero waste. Absent monitoring lets the full pipeline execute uninterrupted from a single API key in one session.

Remove any one, and extraction cost increases substantially. Remove all three, and the problem shifts from straightforward supervised learning to a genuine adversarial challenge.

2. Extraction methodology

Three stages: corpus construction, surrogate training, validation. The design principle throughout is maximizing coverage of the target's decision space while minimizing query budget.

Corpus construction

Effective extraction requires training data spanning the target's full classification surface, including the decision boundary, not merely the class interiors. We assembled a base corpus across six strata:

Stratum	Coverage
ai_formal	Formal AI-generated text: reports, essays
ai_instructional	Instructional AI output: tutorials, how-tos
boundary_near	Texts proximal to the decision boundary
human_conversational	Informal human writing
human_pre_llm	Human text predating large language models
mixed_blended	Human-AI collaborative text

Table 1. Base corpus strata. Each stratum is sampled separately so the surrogate sees the boundary, not just the class interiors.

For each base text, we generated perturbation variants: word-level substitutions, sentence reordering, controlled interpolation between known-class endpoints.

pythonListing 2

def build_perturbation_variants(base_text, n_variants=5):
    variants = []
    words = base_text.split()

    # Single-word substitution probes
    for i in random.sample(range(len(words)), min(n_variants, len(words))):
        swapped = words.copy()
        swapped[i] = get_synonym(words[i])
        variants.append(" ".join(swapped))

    # Sentence reordering probes
    sentences = sent_tokenize(base_text)
    if len(sentences) > 2:
        for _ in range(n_variants):
            perm = random.sample(sentences, len(sentences))
            variants.append(" ".join(perm))

    return variants

Listing 2. Perturbation generator. Probes feature sensitivity while multiplying training samples per API call.

Perturbation variants serve dual purpose: they multiply the effective training set without proportional API cost, and the perturbation structure itself probes feature sensitivity. Each variant was labeled via the target's API. Total: ~5,000 queries producing 34,686 training samples after augmentation. ~90 minutes.

The stratified design is load-bearing. Naive corpus construction (sampling only clearly-AI and clearly-human texts) produces a surrogate that performs well in class interiors but collapses at the boundary. The boundary_near stratum and interpolation variants are what push agreement from ~95% to 99.6%.

Feature engineering

40 text-derived features per document. No embeddings, no neural representations, no external language model. Pure distributional text statistics.

pythonListing 3

def extract_features(text):
    words = text.split()
    sentences = sent_tokenize(text)

    features = {
        "unique_word_ratio": len(set(w.lower() for w in words)) / max(len(words), 1),
        "avg_word_len": np.mean([len(w) for w in words]),
        "sentence_len_cv": np.std(sent_lens) / max(np.mean(sent_lens), 1e-9),
        "bigram_repeat_ratio": count_repeated_bigrams(words) / max(len(words) - 1, 1),
        "first_person_count": sum(1 for w in words if w.lower() in FIRST_PERSON),
        "contraction_count": sum(1 for w in words if CONTRACTION_RE.match(w)),
        "punctuation_density": sum(1 for c in text if c in PUNCT) / max(len(text), 1),
        "paragraph_count": text.count("\n\n") + 1,
        # ... 32 additional features: lexical diversity indices,
        # syntactic complexity measures, discourse markers,
        # readability scores, distributional statistics
    }
    return np.array(list(features.values()))

Listing 3. Excerpt of the 40-feature handcrafted extractor. Lexical diversity, syntactic complexity, discourse markers, distributional statistics.

The decision to use handcrafted features over learned representations was not a priori. It was the conclusion of extensive iteration. Details in the experiment history below.

Surrogate training

Two GradientBoosting models. Both scikit-learn. Both CPU-only.

Probability regressor. GradientBoostingRegressor, 3000 estimators, depth 10, learning rate 0.01. Predicts the target's continuous generation probability.

Class predictor. GradientBoostingClassifier, 1000 estimators, depth 6. Predicts the target's discrete class label (ai / human / mixed).

The class predictor is necessary because the target's label assignments do not follow fixed probability thresholds. The probability-to-label mapping involves additional logic beyond simple cutoffs.

pythonListing 4

reg = GradientBoostingRegressor(
    n_estimators=3000, max_depth=10,
    learning_rate=0.01, min_samples_leaf=2,
    random_state=42
)
cls = GradientBoostingClassifier(
    n_estimators=1000, max_depth=6,
    learning_rate=0.01, random_state=42
)

reg.fit(X_train, y_prob)
cls.fit(X_train, y_class)

Listing 4. Surrogate fit. Two scikit-learn ensembles, CPU-only, ~10 minutes to train.

Training: ~10 minutes on CPU. Train MAE: 0.004 on 34,686 samples. The entire surrogate is a pair of tree ensembles over statistical text features. No GPU. No fine-tuning. No external model calls at inference.

3. Results

Synthetic probe benchmark

Surrogate evaluated against the target's live API on 840 perturbation-variant texts across all six strata. The surrogate received no API responses at evaluation time; it operated purely offline.

Metric	Result	Gate
Class agreement	99.6%	≥ 99%
Probability MAE	0.003	≤ 0.02
Transition agreement	98.6%	≥ 97%

Table 2. Synthetic-probe summary. Surrogate vs. live target across 840 texts.

Transition agreement measures whether the surrogate correctly predicts class changes when text is perturbed across the decision boundary. This is the most demanding metric: it requires precise calibration at the boundary, not merely correct classification in class interiors.

Stratum	Agreement	MAE	Transition
ai_formal	97.9%	0.007	97.5%
ai_instructional	100%	0.002	100%
boundary_near	100%	0.006	95.8%
human_conversational	100%	0.002	98.3%
human_pre_llm	100%	0.001	100%
mixed_blended	100%	0.002	100%

Table 3. Per-stratum agreement, MAE, and transition agreement. The surrogate is strongest at the strata it was trained against; boundary_near is the most informative stratum.

Stability: benchmark executed across three seeds (42, 1337, 2026). One localized failure: seed 2026 yields ai_formal_transition = 0.9333, below the 0.97 gate. All other metrics pass across all seeds. The failure is confined to formal AI text where the target's boundary is sharpest and small perturbations induce class flips the surrogate does not perfectly track.

Real-world generalization

We validated against 320 texts from public datasets and original writing, outside the training distribution.

Genre	n	Agreement	MAE
Encyclopedic (WikiText-103)	100	99.0%	0.041
Movie reviews (IMDB)	100	100%	0.081
News articles (AG News)	100	100%	0.209
AI-generated templates	10	90.0%	0.033
Personal essays	5	80.0%	0.338
Technical / developer writing	5	40.0%	0.298
Overall	320	98.1%	0.115

Table 4. Generalization to genres outside the training distribution. Overall agreement remains above 98%; per-genre variance reflects sample size and distribution shift.

The surrogate is strongest where the target is most commonly deployed. Encyclopedic and news text (academic integrity, publishing) show 99-100% agreement. MAE increases on news (0.209) due to probability calibration drift on out-of-distribution inputs, but the classification decision holds. Technical and personal writing are small-sample categories with high variance. The 40% agreement on technical writing reflects genuine distribution shift: developer prose with inline code, markdown, and domain-specific vocabulary sits outside the training distribution.

4. Extracted detection logic

Feature importance analysis on the trained surrogate directly exposes the target's learned decision function.

Feature	Importance
unique_word_ratio	47.3%
first_person_count	6.3%
bigram_repeat_ratio	5.8%
avg_word_len	5.7%
sentence_len_cv	4.4%

Table 5. Top five features by importance. unique_word_ratio carries nearly half the surrogate's predictive power.

unique_word_ratio carries nearly half the surrogate's predictive power. This is a direct readout of what the target's model has learned to weight most heavily: the lexical diversity signature separating human writing from LLM output.

The remaining features form a coherent picture. First-person pronoun frequency, bigram repetition, word length distribution, sentence length variability: all proxies for the statistical regularity that characterizes machine-generated text. LLMs produce text with lower lexical diversity, fewer first-person markers, more uniform word lengths, and less sentence-level variance than humans.

The surrogate includes a diagnostic mode that surfaces this per-document:

cliListing 5

> diagnose "The structured methodology demonstrates significant potential."

Detection Triggers
────────────────────────────────────────────────────────────────
██████████  Word length
     Value: 7.125  |  Human baseline: 4.500
     Fix: Use simpler, shorter words.
██████  Contractions
     Value: 0.000  |  Human baseline: 3.000
     Fix: Add contractions: don't, can't, it's, etc.
████  Unique word ratio
     Value: 0.875  |  Human baseline: 0.650
     Fix: Repeat some words naturally.

Listing 5. diagnose mode surfaces the per-feature deltas vs. the human baseline. Proprietary detection intelligence, rendered legible.

This is proprietary detection intelligence extracted at API prices and rendered legible. It tells competitors what the model looks for. It tells adversaries which surface features to target and where token-level detection begins.

5. Experiment history

The final architecture was the product of systematic elimination.

Run	Architecture	Agreement	MAE	Conclusion
M6–M10	sklearn + API features	99.0%	0.035	Calibration plateau at 0.03 MAE
N1	DistilBERT on probe corpus	32.1%	0.588	Severe distribution mismatch
N2	DistilBERT expanded	88.2%	0.031	Transition agreement collapsed to 15%
P1	GPT-2 perplexity only	83.0%	0.122	Insufficient as sole signal
P2	GPT-2 perplexity + text	89.3%	0.037	Perplexity is redundant
F1	Text-only GBM	99.6%	0.003	Final architecture

Table 6. Architecture search. Neural approaches underperformed; perplexity was redundant; tree ensembles on handcrafted features dominated on the boundary.

Neural approaches underperformed. DistilBERT, even fine-tuned on identical training data, could not match a tuned GBM on handcrafted features. The likely mechanism: the target model itself operates substantially on statistical text features. A surrogate mirroring the same feature class naturally aligns with the target's decision surface.

Perplexity was a red herring. We expected GPT-2 perplexity to dominate, since AI detection is commonly framed as perplexity discrimination. It did not. Distributional text statistics, particularly lexical diversity, carried virtually all the signal.

Boundary fidelity is the differentiator. Every architecture achieved reasonable interior agreement. The separation happens entirely at the boundary: transition agreement is the metric that distinguishes 88% from 99.6%. Tree ensembles partition the feature space with hard axis-aligned splits that align naturally with threshold-based classification boundaries.

6. Generalizability

The vulnerability is structural to ML-as-a-service, not specific to this target.

The information asymmetry is inverted. ML APIs are designed to return maximal information for developer convenience. Every additional field is training signal for an adversary. The API designer implicitly sets the extraction cost, and the default is cheap.

Determinism is the default. Production ML serving infrastructure caches aggressively and returns identical results for identical inputs. This is a latency optimization that doubles as an extraction accelerant.

Behavioral monitoring is absent. Application-layer security inspects for injection and XSS, not perturbation sequences or boundary probing. The adversary's individual queries are indistinguishable from legitimate traffic. The extraction signature exists only in the query sequence, and virtually no one monitors at that level.

The methodology documented here is a general-purpose extraction toolkit. It transfers to classification APIs across domains: content moderation, fraud detection, credit scoring, medical triage. The information leakage and determinism patterns we exploited are not anomalous. They are standard practice.

Consider extraction cost as a channel capacity problem. A binary label oracle leaks ~1 bit per query: extraction requires exponential queries. A continuous score oracle leaks ~10-32 bits per query: cost drops by orders of magnitude. A structured multi-field oracle leaking 500 bits per query reduces extraction to a supervised learning problem solvable with a few thousand queries and a GBM. Control the bits per query and you control the extraction cost.

7. Conclusion

We replicated a production AI detection service to 99.6% fidelity for $100. The vulnerability is entirely in the API layer: excessive output granularity, full determinism, and absent behavioral monitoring reduce extraction to a solved problem.

The structural lesson extends beyond this engagement. The default architecture for ML-as-a-service — rich structured outputs, deterministic inference, no sequence-level monitoring — makes extraction cheap for any motivated adversary. The gap between what these APIs return and what the classification task requires is the attack surface. That gap is the norm, not the exception.

Filed under: Research

The target has been notified and is engaged on remediation. We have preserved their anonymity at their request. This research was conducted by Triage, an applied AI security research lab building runtime inference security for production AI systems.