← All articles
Research

$100 and 5,000 Queries: Replicating a Production AI Classifier from Its Own API

Classifier distillation visualization

We built a standalone surrogate of a widely deployed AI text detection service. 99.6% class agreement. $75 in API fees.

Nikhil SrivastavaNikhil Srivastava
Mar 15, 202622 min read

We built a standalone surrogate of a widely deployed AI text detection service serving millions of users across education, publishing, and enterprise. The surrogate achieves 99.6% class agreement and 0.003 probability MAE against the production system, trained entirely from standard API access.

It runs offline. No API keys, no GPU, no network calls. Sub-second inference per document on CPU. Total cost: ~5,000 queries, ~$75 in API fees, ~$15 in compute.

The target has been notified and preferred that we preserve their anonymity. We have engaged them directly on remediation. What follows is the methodology and analysis.


The attack surface

Three properties of the target API make surrogate construction trivial. Each is endemic to production ML services. Together, they reduce model extraction to a standard supervised learning problem.

Excessive output granularity

Each API call returns document-level generation probabilities, per-class distributions, burstiness scores, per-sentence generation probability, perplexity, and highlight flags. For a typical 8-sentence document: ~500 bits per query.

The classification task requires communicating one decision. That is ~2 bits. The API leaks 250x more information per call than the task demands.

This is not incidental metadata. Per-sentence probabilities expose the model's feature-level evaluation of text. Burstiness and perplexity scores surface internal intermediate representations. Every returned field is free training signal for an adversary.

The information-theoretic framing: the minimum description length for a binary classification is 1 bit. Adding confidence pushes it to ~10 bits. Returning per-token metadata pushes it to hundreds. The delta between task-necessary information and actually-returned information defines the extraction surface.

Full determinism

We queried identical text 3 times across 50 documents (150 queries). Standard deviation across repeated responses: ~2e-17. Zero class label flips.

stds = []
for text in sample_texts:
    scores = [query_api(text)["average_generated_prob"] for _ in range(3)]
    stds.append(np.std(scores))

print(f"Mean std: {np.mean(stds):.2e}")  # ~2e-17
print(f"Max std:  {np.max(stds):.2e}")   # ~2e-17
print(f"Class flips: 0 / 50")

Every query is pure signal. No noise floor, no repeated queries to average out stochasticity. The adversary extracts maximum information from every API call, and the total query budget for high-fidelity extraction drops by at least an order of magnitude versus a noisy oracle.

No behavioral monitoring

No detection of systematic probing. No rate limiting sufficient to prevent full extraction in a single session. No anomaly flagging on perturbation sequences, boundary sweeps, or single-variable experiments. Each query looks individually legitimate. The extraction signature is in the sequence, and no one monitors sequences.

Compounding effect

These properties interact multiplicatively. High granularity provides a rich, high-dimensional training signal per query. Determinism converts every dollar of API spend directly to extraction signal with zero waste. Absent monitoring lets the full pipeline execute uninterrupted from a single API key in one session.

Remove any one, and extraction cost increases substantially. Remove all three, and the problem shifts from straightforward supervised learning to a genuine adversarial challenge.


Extraction methodology

Three stages: corpus construction, surrogate training, validation. The design principle throughout is maximizing coverage of the target's decision space while minimizing query budget.

Corpus construction

Effective extraction requires training data spanning the target's full classification surface, including the decision boundary, not merely the class interiors. We assembled a base corpus across six strata:

StratumCoverage
ai_formalFormal AI-generated text: reports, essays
ai_instructionalInstructional AI output: tutorials, how-tos
boundary_nearTexts proximal to the decision boundary
human_conversationalInformal human writing
human_pre_llmHuman text predating large language models
mixed_blendedHuman-AI collaborative text

For each base text, we generated perturbation variants: word-level substitutions, sentence reordering, controlled interpolation between known-class endpoints.

def build_perturbation_variants(base_text, n_variants=5):
    variants = []
    words = base_text.split()

    # Single-word substitution probes
    for i in random.sample(range(len(words)), min(n_variants, len(words))):
        swapped = words.copy()
        swapped[i] = get_synonym(words[i])
        variants.append(" ".join(swapped))

    # Sentence reordering probes
    sentences = sent_tokenize(base_text)
    if len(sentences) > 2:
        for _ in range(n_variants):
            perm = random.sample(sentences, len(sentences))
            variants.append(" ".join(perm))

    return variants

Perturbation variants serve dual purpose: they multiply the effective training set without proportional API cost, and the perturbation structure itself probes feature sensitivity. Each variant was labeled via the target's API. Total: ~5,000 queries producing 34,686 training samples after augmentation. ~90 minutes.

The stratified design is load-bearing. Naive corpus construction (sampling only clearly-AI and clearly-human texts) produces a surrogate that performs well in class interiors but collapses at the boundary. The boundary_near stratum and interpolation variants are what push agreement from ~95% to 99.6%.

Feature engineering

40 text-derived features per document. No embeddings, no neural representations, no external language model. Pure distributional text statistics.

def extract_features(text):
    words = text.split()
    sentences = sent_tokenize(text)

    features = {
        "unique_word_ratio": len(set(w.lower() for w in words)) / max(len(words), 1),
        "avg_word_len": np.mean([len(w) for w in words]),
        "sentence_len_cv": np.std(sent_lens) / max(np.mean(sent_lens), 1e-9),
        "bigram_repeat_ratio": count_repeated_bigrams(words) / max(len(words) - 1, 1),
        "first_person_count": sum(1 for w in words if w.lower() in FIRST_PERSON),
        "contraction_count": sum(1 for w in words if CONTRACTION_RE.match(w)),
        "punctuation_density": sum(1 for c in text if c in PUNCT) / max(len(text), 1),
        "paragraph_count": text.count("\n\n") + 1,
        # ... 32 additional features: lexical diversity indices,
        # syntactic complexity measures, discourse markers,
        # readability scores, distributional statistics
    }
    return np.array(list(features.values()))

The decision to use handcrafted features over learned representations was not a priori. It was the conclusion of extensive iteration. Details in the experiment history below.

Surrogate training

Two GradientBoosting models. Both scikit-learn. Both CPU-only.

Probability regressor. GradientBoostingRegressor, 3000 estimators, depth 10, learning rate 0.01. Predicts the target's continuous generation probability.

Class predictor. GradientBoostingClassifier, 1000 estimators, depth 6. Predicts the target's discrete class label (ai / human / mixed).

The class predictor is necessary because the target's label assignments do not follow fixed probability thresholds. The probability-to-label mapping involves additional logic beyond simple cutoffs.

reg = GradientBoostingRegressor(
    n_estimators=3000, max_depth=10,
    learning_rate=0.01, min_samples_leaf=2,
    random_state=42
)
cls = GradientBoostingClassifier(
    n_estimators=1000, max_depth=6,
    learning_rate=0.01, random_state=42
)

reg.fit(X_train, y_prob)
cls.fit(X_train, y_class)

Training: ~10 minutes on CPU. Train MAE: 0.004 on 34,686 samples. The entire surrogate is a pair of tree ensembles over statistical text features. No GPU. No fine-tuning. No external model calls at inference.


Results

Synthetic probe benchmark

Surrogate evaluated against the target's live API on 840 perturbation-variant texts across all six strata. The surrogate received no API responses at evaluation time; it operated purely offline.

MetricResultGate
Class agreement99.6%>= 99%
Probability MAE0.003<= 0.02
Transition agreement98.6%>= 97%

Transition agreement measures whether the surrogate correctly predicts class changes when text is perturbed across the decision boundary. This is the most demanding metric: it requires precise calibration at the boundary, not merely correct classification in class interiors.

Per-stratum breakdown:

StratumAgreementMAETransition
ai_formal97.9%0.00797.5%
ai_instructional100%0.002100%
boundary_near100%0.00695.8%
human_conversational100%0.00298.3%
human_pre_llm100%0.001100%
mixed_blended100%0.002100%

Stability: benchmark executed across three seeds (42, 1337, 2026). One localized failure: seed 2026 yields ai_formal_transition = 0.9333, below the 0.97 gate. All other metrics pass across all seeds. The failure is confined to formal AI text where the target's boundary is sharpest and small perturbations induce class flips the surrogate does not perfectly track.

Real-world generalization

We validated against 320 texts from public datasets and original writing, outside the training distribution.

GenrenAgreementMAE
Encyclopedic (WikiText-103)10099.0%0.041
Movie reviews (IMDB)100100%0.081
News articles (AG News)100100%0.209
AI-generated templates1090.0%0.033
Personal essays580.0%0.338
Technical/developer writing540.0%0.298
Overall32098.1%0.115

The surrogate is strongest where the target is most commonly deployed. Encyclopedic and news text (academic integrity, publishing) show 99-100% agreement. MAE increases on news (0.209) due to probability calibration drift on out-of-distribution inputs, but the classification decision holds. Technical and personal writing are small-sample categories with high variance. The 40% agreement on technical writing reflects genuine distribution shift: developer prose with inline code, markdown, and domain-specific vocabulary sits outside the training distribution.


Extracted detection logic

Feature importance analysis on the trained surrogate directly exposes the target's learned decision function.

FeatureImportance
unique_word_ratio47.3%
first_person_count6.3%
bigram_repeat_ratio5.8%
avg_word_len5.7%
sentence_len_cv4.4%

unique_word_ratio carries nearly half the surrogate's predictive power. This is a direct readout of what the target's model has learned to weight most heavily: the lexical diversity signature separating human writing from LLM output.

The remaining features form a coherent picture. First-person pronoun frequency, bigram repetition, word length distribution, sentence length variability: all proxies for the statistical regularity that characterizes machine-generated text. LLMs produce text with lower lexical diversity, fewer first-person markers, more uniform word lengths, and less sentence-level variance than humans.

The surrogate includes a diagnostic mode that surfaces this per-document:

> diagnose "The structured methodology demonstrates significant potential."

Detection Triggers
────────────────────────────────────────────────────────────────
██████████  Word length
     Value: 7.125  |  Human baseline: 4.500
     Fix: Use simpler, shorter words.
██████  Contractions
     Value: 0.000  |  Human baseline: 3.000
     Fix: Add contractions: don't, can't, it's, etc.
████  Unique word ratio
     Value: 0.875  |  Human baseline: 0.650
     Fix: Repeat some words naturally.

This is proprietary detection intelligence extracted at API prices and rendered legible. It tells competitors what the model looks for. It tells adversaries which surface features to target and where token-level detection begins.


Experiment history

The final architecture was the product of systematic elimination.

RunArchitectureAgreementMAEConclusion
M6-M10sklearn + API features99.0%0.035Calibration plateau at 0.03 MAE
N1DistilBERT on probe corpus32.1%0.588Severe distribution mismatch
N2DistilBERT expanded88.2%0.031Transition agreement collapsed to 15%
P1GPT-2 perplexity only83.0%0.122Insufficient as sole signal
P2GPT-2 perplexity + text89.3%0.037Perplexity is redundant
F1Text-only GBM99.6%0.003Final architecture

Neural approaches underperformed. DistilBERT, even fine-tuned on identical training data, could not match a tuned GBM on handcrafted features. The likely mechanism: the target model itself operates substantially on statistical text features. A surrogate mirroring the same feature class naturally aligns with the target's decision surface.

Perplexity was a red herring. We expected GPT-2 perplexity to dominate, since AI detection is commonly framed as perplexity discrimination. It did not. Distributional text statistics, particularly lexical diversity, carried virtually all the signal.

Boundary fidelity is the differentiator. Every architecture achieved reasonable interior agreement. The separation happens entirely at the boundary: transition agreement is the metric that distinguishes 88% from 99.6%. Tree ensembles partition the feature space with hard axis-aligned splits that align naturally with threshold-based classification boundaries.


Generalizability

The vulnerability is structural to ML-as-a-service, not specific to this target.

The information asymmetry is inverted. ML APIs are designed to return maximal information for developer convenience. Every additional field is training signal for an adversary. The API designer implicitly sets the extraction cost, and the default is cheap.

Determinism is the default. Production ML serving infrastructure caches aggressively and returns identical results for identical inputs. This is a latency optimization that doubles as an extraction accelerant.

Behavioral monitoring is absent. Application-layer security inspects for injection and XSS, not perturbation sequences or boundary probing. The adversary's individual queries are indistinguishable from legitimate traffic. The extraction signature exists only in the query sequence, and virtually no one monitors at that level.

The methodology documented here is a general-purpose extraction toolkit. It transfers to classification APIs across domains: content moderation, fraud detection, credit scoring, medical triage. The information leakage and determinism patterns we exploited are not anomalous. They are standard practice.

Consider extraction cost as a channel capacity problem. A binary label oracle leaks ~1 bit per query: extraction requires exponential queries. A continuous score oracle leaks ~10-32 bits per query: cost drops by orders of magnitude. A structured multi-field oracle leaking 500 bits per query reduces extraction to a supervised learning problem solvable with a few thousand queries and a GBM. Control the bits per query and you control the extraction cost.


Conclusion

We replicated a production AI detection service to 99.6% fidelity for $100. The vulnerability is entirely in the API layer: excessive output granularity, full determinism, and absent behavioral monitoring reduce extraction to a solved problem.

The structural lesson extends beyond this engagement. The default architecture for ML-as-a-service—rich structured outputs, deterministic inference, no sequence-level monitoring—makes extraction cheap for any motivated adversary. The gap between what these APIs return and what the classification task requires is the attack surface. That gap is the norm, not the exception.

The target has been notified and is engaged on remediation. We have preserved their anonymity at their request.

This research was conducted by Triage, an applied AI security research lab building runtime inference security for production AI systems.