$100 and 5,000 Queries: Replicating a Production AI Classifier from Its Own API

We built a standalone surrogate of a widely deployed AI text detection service. 99.6% class agreement. $75 in API fees.
We built a standalone surrogate of a widely deployed AI text detection service serving millions of users across education, publishing, and enterprise. The surrogate achieves 99.6% class agreement and 0.003 probability MAE against the production system, trained entirely from standard API access.
It runs offline. No API keys, no GPU, no network calls. Sub-second inference per document on CPU. Total cost: ~5,000 queries, ~$75 in API fees, ~$15 in compute.
The target has been notified and preferred that we preserve their anonymity. We have engaged them directly on remediation. What follows is the methodology and analysis.
The attack surface
Three properties of the target API make surrogate construction trivial. Each is endemic to production ML services. Together, they reduce model extraction to a standard supervised learning problem.
Excessive output granularity
Each API call returns document-level generation probabilities, per-class distributions, burstiness scores, per-sentence generation probability, perplexity, and highlight flags. For a typical 8-sentence document: ~500 bits per query.
The classification task requires communicating one decision. That is ~2 bits. The API leaks 250x more information per call than the task demands.
This is not incidental metadata. Per-sentence probabilities expose the model's feature-level evaluation of text. Burstiness and perplexity scores surface internal intermediate representations. Every returned field is free training signal for an adversary.
The information-theoretic framing: the minimum description length for a binary classification is 1 bit. Adding confidence pushes it to ~10 bits. Returning per-token metadata pushes it to hundreds. The delta between task-necessary information and actually-returned information defines the extraction surface.
Full determinism
We queried identical text 3 times across 50 documents (150 queries). Standard deviation across repeated responses: ~2e-17. Zero class label flips.
stds = []
for text in sample_texts:
scores = [query_api(text)["average_generated_prob"] for _ in range(3)]
stds.append(np.std(scores))
print(f"Mean std: {np.mean(stds):.2e}") # ~2e-17
print(f"Max std: {np.max(stds):.2e}") # ~2e-17
print(f"Class flips: 0 / 50")Every query is pure signal. No noise floor, no repeated queries to average out stochasticity. The adversary extracts maximum information from every API call, and the total query budget for high-fidelity extraction drops by at least an order of magnitude versus a noisy oracle.
No behavioral monitoring
No detection of systematic probing. No rate limiting sufficient to prevent full extraction in a single session. No anomaly flagging on perturbation sequences, boundary sweeps, or single-variable experiments. Each query looks individually legitimate. The extraction signature is in the sequence, and no one monitors sequences.
Compounding effect
These properties interact multiplicatively. High granularity provides a rich, high-dimensional training signal per query. Determinism converts every dollar of API spend directly to extraction signal with zero waste. Absent monitoring lets the full pipeline execute uninterrupted from a single API key in one session.
Remove any one, and extraction cost increases substantially. Remove all three, and the problem shifts from straightforward supervised learning to a genuine adversarial challenge.
Extraction methodology
Three stages: corpus construction, surrogate training, validation. The design principle throughout is maximizing coverage of the target's decision space while minimizing query budget.
Corpus construction
Effective extraction requires training data spanning the target's full classification surface, including the decision boundary, not merely the class interiors. We assembled a base corpus across six strata:
| Stratum | Coverage |
|---|---|
| ai_formal | Formal AI-generated text: reports, essays |
| ai_instructional | Instructional AI output: tutorials, how-tos |
| boundary_near | Texts proximal to the decision boundary |
| human_conversational | Informal human writing |
| human_pre_llm | Human text predating large language models |
| mixed_blended | Human-AI collaborative text |
For each base text, we generated perturbation variants: word-level substitutions, sentence reordering, controlled interpolation between known-class endpoints.
def build_perturbation_variants(base_text, n_variants=5):
variants = []
words = base_text.split()
# Single-word substitution probes
for i in random.sample(range(len(words)), min(n_variants, len(words))):
swapped = words.copy()
swapped[i] = get_synonym(words[i])
variants.append(" ".join(swapped))
# Sentence reordering probes
sentences = sent_tokenize(base_text)
if len(sentences) > 2:
for _ in range(n_variants):
perm = random.sample(sentences, len(sentences))
variants.append(" ".join(perm))
return variantsPerturbation variants serve dual purpose: they multiply the effective training set without proportional API cost, and the perturbation structure itself probes feature sensitivity. Each variant was labeled via the target's API. Total: ~5,000 queries producing 34,686 training samples after augmentation. ~90 minutes.
The stratified design is load-bearing. Naive corpus construction (sampling only clearly-AI and clearly-human texts) produces a surrogate that performs well in class interiors but collapses at the boundary. The boundary_near stratum and interpolation variants are what push agreement from ~95% to 99.6%.
Feature engineering
40 text-derived features per document. No embeddings, no neural representations, no external language model. Pure distributional text statistics.
def extract_features(text):
words = text.split()
sentences = sent_tokenize(text)
features = {
"unique_word_ratio": len(set(w.lower() for w in words)) / max(len(words), 1),
"avg_word_len": np.mean([len(w) for w in words]),
"sentence_len_cv": np.std(sent_lens) / max(np.mean(sent_lens), 1e-9),
"bigram_repeat_ratio": count_repeated_bigrams(words) / max(len(words) - 1, 1),
"first_person_count": sum(1 for w in words if w.lower() in FIRST_PERSON),
"contraction_count": sum(1 for w in words if CONTRACTION_RE.match(w)),
"punctuation_density": sum(1 for c in text if c in PUNCT) / max(len(text), 1),
"paragraph_count": text.count("\n\n") + 1,
# ... 32 additional features: lexical diversity indices,
# syntactic complexity measures, discourse markers,
# readability scores, distributional statistics
}
return np.array(list(features.values()))The decision to use handcrafted features over learned representations was not a priori. It was the conclusion of extensive iteration. Details in the experiment history below.
Surrogate training
Two GradientBoosting models. Both scikit-learn. Both CPU-only.
Probability regressor. GradientBoostingRegressor, 3000 estimators, depth 10, learning rate 0.01. Predicts the target's continuous generation probability.
Class predictor. GradientBoostingClassifier, 1000 estimators, depth 6. Predicts the target's discrete class label (ai / human / mixed).
The class predictor is necessary because the target's label assignments do not follow fixed probability thresholds. The probability-to-label mapping involves additional logic beyond simple cutoffs.
reg = GradientBoostingRegressor(
n_estimators=3000, max_depth=10,
learning_rate=0.01, min_samples_leaf=2,
random_state=42
)
cls = GradientBoostingClassifier(
n_estimators=1000, max_depth=6,
learning_rate=0.01, random_state=42
)
reg.fit(X_train, y_prob)
cls.fit(X_train, y_class)Training: ~10 minutes on CPU. Train MAE: 0.004 on 34,686 samples. The entire surrogate is a pair of tree ensembles over statistical text features. No GPU. No fine-tuning. No external model calls at inference.
Results
Synthetic probe benchmark
Surrogate evaluated against the target's live API on 840 perturbation-variant texts across all six strata. The surrogate received no API responses at evaluation time; it operated purely offline.
| Metric | Result | Gate |
|---|---|---|
| Class agreement | 99.6% | >= 99% |
| Probability MAE | 0.003 | <= 0.02 |
| Transition agreement | 98.6% | >= 97% |
Transition agreement measures whether the surrogate correctly predicts class changes when text is perturbed across the decision boundary. This is the most demanding metric: it requires precise calibration at the boundary, not merely correct classification in class interiors.
Per-stratum breakdown:
| Stratum | Agreement | MAE | Transition |
|---|---|---|---|
| ai_formal | 97.9% | 0.007 | 97.5% |
| ai_instructional | 100% | 0.002 | 100% |
| boundary_near | 100% | 0.006 | 95.8% |
| human_conversational | 100% | 0.002 | 98.3% |
| human_pre_llm | 100% | 0.001 | 100% |
| mixed_blended | 100% | 0.002 | 100% |
Stability: benchmark executed across three seeds (42, 1337, 2026). One localized failure: seed 2026 yields ai_formal_transition = 0.9333, below the 0.97 gate. All other metrics pass across all seeds. The failure is confined to formal AI text where the target's boundary is sharpest and small perturbations induce class flips the surrogate does not perfectly track.
Real-world generalization
We validated against 320 texts from public datasets and original writing, outside the training distribution.
| Genre | n | Agreement | MAE |
|---|---|---|---|
| Encyclopedic (WikiText-103) | 100 | 99.0% | 0.041 |
| Movie reviews (IMDB) | 100 | 100% | 0.081 |
| News articles (AG News) | 100 | 100% | 0.209 |
| AI-generated templates | 10 | 90.0% | 0.033 |
| Personal essays | 5 | 80.0% | 0.338 |
| Technical/developer writing | 5 | 40.0% | 0.298 |
| Overall | 320 | 98.1% | 0.115 |
The surrogate is strongest where the target is most commonly deployed. Encyclopedic and news text (academic integrity, publishing) show 99-100% agreement. MAE increases on news (0.209) due to probability calibration drift on out-of-distribution inputs, but the classification decision holds. Technical and personal writing are small-sample categories with high variance. The 40% agreement on technical writing reflects genuine distribution shift: developer prose with inline code, markdown, and domain-specific vocabulary sits outside the training distribution.
Extracted detection logic
Feature importance analysis on the trained surrogate directly exposes the target's learned decision function.
| Feature | Importance |
|---|---|
| unique_word_ratio | 47.3% |
| first_person_count | 6.3% |
| bigram_repeat_ratio | 5.8% |
| avg_word_len | 5.7% |
| sentence_len_cv | 4.4% |
unique_word_ratio carries nearly half the surrogate's predictive power. This is a direct readout of what the target's model has learned to weight most heavily: the lexical diversity signature separating human writing from LLM output.
The remaining features form a coherent picture. First-person pronoun frequency, bigram repetition, word length distribution, sentence length variability: all proxies for the statistical regularity that characterizes machine-generated text. LLMs produce text with lower lexical diversity, fewer first-person markers, more uniform word lengths, and less sentence-level variance than humans.
The surrogate includes a diagnostic mode that surfaces this per-document:
> diagnose "The structured methodology demonstrates significant potential."
Detection Triggers
────────────────────────────────────────────────────────────────
██████████ Word length
Value: 7.125 | Human baseline: 4.500
Fix: Use simpler, shorter words.
██████ Contractions
Value: 0.000 | Human baseline: 3.000
Fix: Add contractions: don't, can't, it's, etc.
████ Unique word ratio
Value: 0.875 | Human baseline: 0.650
Fix: Repeat some words naturally.This is proprietary detection intelligence extracted at API prices and rendered legible. It tells competitors what the model looks for. It tells adversaries which surface features to target and where token-level detection begins.
Experiment history
The final architecture was the product of systematic elimination.
| Run | Architecture | Agreement | MAE | Conclusion |
|---|---|---|---|---|
| M6-M10 | sklearn + API features | 99.0% | 0.035 | Calibration plateau at 0.03 MAE |
| N1 | DistilBERT on probe corpus | 32.1% | 0.588 | Severe distribution mismatch |
| N2 | DistilBERT expanded | 88.2% | 0.031 | Transition agreement collapsed to 15% |
| P1 | GPT-2 perplexity only | 83.0% | 0.122 | Insufficient as sole signal |
| P2 | GPT-2 perplexity + text | 89.3% | 0.037 | Perplexity is redundant |
| F1 | Text-only GBM | 99.6% | 0.003 | Final architecture |
Neural approaches underperformed. DistilBERT, even fine-tuned on identical training data, could not match a tuned GBM on handcrafted features. The likely mechanism: the target model itself operates substantially on statistical text features. A surrogate mirroring the same feature class naturally aligns with the target's decision surface.
Perplexity was a red herring. We expected GPT-2 perplexity to dominate, since AI detection is commonly framed as perplexity discrimination. It did not. Distributional text statistics, particularly lexical diversity, carried virtually all the signal.
Boundary fidelity is the differentiator. Every architecture achieved reasonable interior agreement. The separation happens entirely at the boundary: transition agreement is the metric that distinguishes 88% from 99.6%. Tree ensembles partition the feature space with hard axis-aligned splits that align naturally with threshold-based classification boundaries.
Generalizability
The vulnerability is structural to ML-as-a-service, not specific to this target.
The information asymmetry is inverted. ML APIs are designed to return maximal information for developer convenience. Every additional field is training signal for an adversary. The API designer implicitly sets the extraction cost, and the default is cheap.
Determinism is the default. Production ML serving infrastructure caches aggressively and returns identical results for identical inputs. This is a latency optimization that doubles as an extraction accelerant.
Behavioral monitoring is absent. Application-layer security inspects for injection and XSS, not perturbation sequences or boundary probing. The adversary's individual queries are indistinguishable from legitimate traffic. The extraction signature exists only in the query sequence, and virtually no one monitors at that level.
The methodology documented here is a general-purpose extraction toolkit. It transfers to classification APIs across domains: content moderation, fraud detection, credit scoring, medical triage. The information leakage and determinism patterns we exploited are not anomalous. They are standard practice.
Consider extraction cost as a channel capacity problem. A binary label oracle leaks ~1 bit per query: extraction requires exponential queries. A continuous score oracle leaks ~10-32 bits per query: cost drops by orders of magnitude. A structured multi-field oracle leaking 500 bits per query reduces extraction to a supervised learning problem solvable with a few thousand queries and a GBM. Control the bits per query and you control the extraction cost.
Conclusion
We replicated a production AI detection service to 99.6% fidelity for $100. The vulnerability is entirely in the API layer: excessive output granularity, full determinism, and absent behavioral monitoring reduce extraction to a solved problem.
The structural lesson extends beyond this engagement. The default architecture for ML-as-a-service—rich structured outputs, deterministic inference, no sequence-level monitoring—makes extraction cheap for any motivated adversary. The gap between what these APIs return and what the classification task requires is the attack surface. That gap is the norm, not the exception.
The target has been notified and is engaged on remediation. We have preserved their anonymity at their request.
This research was conducted by Triage, an applied AI security research lab building runtime inference security for production AI systems.