Security

The Distillation Problem Has a New Answer: Make the Harvest Worthless

In February 2026, three separate disclosures from OpenAI, Google's Threat Intelligence Group, and Anthropic painted a picture the AI industry could no longer ignore: industrial-scale distillation campaigns, millions of carefully crafted queries siphoning the reasoning capabilities of frontier models through fraudulent accounts and proxy networks, had become routine. Anthropic alone documented over 16 million exchanges generated by three Chinese AI laboratories across roughly 24,000 fake accounts. One proxy network managed more than 20,000 simultaneous fraudulent accounts, mixing distillation traffic with legitimate requests to camouflage the extraction.

Every defense the industry has tried so far shares the same assumption: prevent the attacker from seeing useful output, or detect them after the fact. But these approaches all break down for the same reason — a legitimate user and a distiller see the same response. You can't blind one without blinding the other.

The Vectax FHE Stack takes a fundamentally different approach. Instead of trying to hide the output, it makes the output toxic for training. Every response looks correct to a human reader. But when a distiller collects thousands of these responses and trains a student model on the corpus, the noise accumulates, and the student model degrades. The more they harvest, the worse their model gets.

This post defines the threat, explains why existing defenses fail, and describes how the Vectax FHE Stack turns the attacker's own collection pipeline into a liability.

The Threat Model: Who's Attacking and What They Want

Before discussing any defense, we need to be precise about who the attackers are, what they're harvesting, and what success looks like for them.

The attackers

Bulk harvesters operate sprawling networks of fraudulent accounts — Anthropic documented one proxy network managing over 20,000 simultaneous accounts — and fire millions of queries across APIs. They don't need sophistication; they need volume. Their economics are simple: cheap queries in, training data out.

Sophisticated labs run targeted extraction campaigns. DeepSeek, Moonshot, and MiniMax were named in Anthropic's February 2026 disclosure. These aren't script kiddies — they're well-funded organizations with ML engineering teams who design query strategies to systematically map a frontier model's reasoning capabilities across specific domains.

Insiders and partners with legitimate API access who resell or repurpose data. They don't need to evade detection — they have authorized access. Their distillation traffic looks identical to legitimate usage because it is legitimate usage, right up until the data enters a training pipeline.

Automated cleaning pipelines sit behind all of the above. Raw API outputs are filtered, deduplicated, reformatted, and fed into the training infrastructure. Any defense that operates only at the text surface can be stripped or normalized by a sufficiently sophisticated pipeline.

What they're harvesting

The goal is not to steal answers. Answers to individual questions have limited value. What makes distillation devastating is that a large enough corpus of (prompt, reasoning, answer) triples teaches a student model how to think — not just what to say. Research consistently shows that training on reasoning traces produces dramatically better student models than training on (prompt, answer) pairs alone, requiring far fewer examples to reach the same capability level.

But even (prompt, answer) pairs at a sufficient scale carry an implicit reasoning signal. A detailed code solution contains the logic. A thorough analysis contains the structure. In millions of examples, statistical patterns in answers alone begin to teach process, not just outputs.

This means a defense that only hides the chain of thought while leaving the answer clean is necessary but not sufficient. The answer itself, at scale, is a training asset.

What they're really stealing

The damage goes beyond intellectual property:

Reasoning capability is the core asset. The chain-of-thought traces teach student models to decompose problems, verify intermediate steps, and recover from errors — capabilities that took frontier labs years of research and hundreds of millions of dollars to develop.

Safety properties erode through distillation. Qi et al. (2023) demonstrated that fine-tuning aligned LLMs on as few as 10 adversarial examples can jailbreak safety guardrails. NIST's evaluation found that DeepSeek's most secure model responded to 94% of malicious requests under common jailbreaking, compared to 8% for US frontier reference models. Distilled models inherit capability but shed safety.

Architectural insight leaks through reasoning patterns. Systematic extraction across domains reveals how a model structures its problem-solving — which capabilities it has, how it chains them, where it's strong, and where it's weak. This is indirect reverse engineering of the model's training and architecture.

What attacker success looks like

A successful distillation campaign produces a student model that matches or approaches the frontier model's capabilities on target domains, trained primarily on harvested data. The attacker avoids the research cost, the compute cost, the alignment cost, and the safety engineering cost — while deploying a model that competes directly with the one it was stolen from, carrying none of the safety properties that the original was built with.

Why Every Existing Defense Hits the Same Wall

Drawing on multiple recent surveys (Zhao et al., 2025; the systematic MEA survey of August 2025; and the DistillGuard evaluation framework from March 2026), existing defenses fall into six categories. Each fails for the same fundamental reason: they can't degrade what a distiller collects without degrading what a user receives.

1. Input-level defenses: catching attackers at the door

Behavioral fingerprinting, rate limiting, query anomaly detection, and account verification reduce attack volume — but can't eliminate it. Attackers use hydra-cluster architectures that distribute traffic across thousands of accounts, mixing distillation queries with legitimate requests. A defense that relies on catching attackers at the front door fails whenever a single query gets through. And against insiders with legitimate access, input-level defenses are irrelevant entirely.

2. Output perturbation: rearranging the furniture

Style randomization, synonym substitution, and surface-level transformations make each response look different — but the DistillGuard benchmark found that perturbation preserves both user value and distillation value. The finding is structural: any transformation that preserves meaning for a human also preserves the training signal for a student model. Our benchmarks confirm this — style randomization leaked recognizable algorithmic structure in 41.7% of cases.

3. Output poisoning: tilting the scales

Antidistillation sampling (Savani, Trockman et al., NeurIPS 2025) modifies the teacher's sampling process so outputs become adversarial training data for student models. This is the most sophisticated existing defense. But it operates entirely within the visible text — the defense signal and the user content share the same artifact. Any poisoning strong enough to reliably degrade a student model risks being perceptible to users or filterable by cleaning pipelines.

4. Trace withholding: the nuclear option

Suppress the reasoning and emit only the final answer. It stops reasoning-trace harvesting — but the user loses all transparency, and safety monitoring loses its most powerful tool. METR's research shows that full thinking traces boost detection of covert agent behaviors from 30% to 88%. Trace withholding is the strategy equivalent of winning a food safety inspection by not opening the restaurant.

5. Watermarking: detection after the fact

Statistical signatures embedded in outputs can prove provenance — but only after distillation has already occurred. And watermark robustness remains contested: Zhang et al. (September 2025) showed that character-level perturbations can disrupt detection entirely.

6. Architectural defenses: requiring the overhaul nobody can afford

Differential privacy and model partitioning require deep changes to the training pipeline. They can't be applied to already-deployed systems, and the noise requirements for formal guarantees in text generation are impractically large.

The pattern

Every defense either operates on the visible output (and can't degrade distillation value without degrading user value) or detects distillation after the fact (and can't prevent it). None of them addresses the fundamental problem: a harvested corpus of clean outputs is a valid training dataset.

How Vectax FHE Stack Changes the Calculus

The Vectax FHE Stack doesn't try to hide the output or detect the attacker. It does something different: it makes the harvested output hostile to training.

The core mechanism

Every response generated through the Vectax FHE Stack passes through a fully homomorphic encryption layer before it reaches the user. The FHE processing introduces noise into the latent representation from which both the answer and the reasoning sketch are generated. This means every visible token — answer, sketch, everything — carries a subtle perturbation inherited from the encrypted computation.

To a human reader, the response is correct and useful. The answer is right. The reasoning sketch makes sense. Individual responses are indistinguishable from unprotected output.

But to a training pipeline, the story is completely different. The noise isn't random — it's structured by the FHE computation in a way that's imperceptible per-response but accumulates systematically across a training corpus. When a distiller collects thousands of responses and trains a student model on them, the noise doesn't cancel out. It compounds. The student model's loss landscape is corrupted, convergence degrades, and the resulting model performs worse than one trained on clean data, or even on fewer examples without the noise.

The more they harvest, the worse their model gets. The attacker's own collection scale works against them.

Why is this different from output poisoning

Existing output poisoning (antidistillation sampling) injects an adversarial signal into the visible text, which means it shares the surface with the user content and can potentially be filtered by cleaning pipelines.

Vectax FHE Stack's noise originates in the latent representation — it's baked into how the tokens are generated, not added on top after the fact. There's no clean version to recover because the noise and the content are entangled at the generation level. A cleaning pipeline would have to know the FHE parameters to separate signal from noise, and those parameters never leave the encrypted boundary.

The two-artifact split

In addition to the noise, the Vectax FHE Stack splits the output into two artifacts:

A visible abstract sketch — short, intentionally generic, using per-request aliases and coarsened phase descriptors. Enough to show the user the broad reasoning path ("identified constraints → explored candidates → verified consistency") but deliberately too coarse to reconstruct the fine-grained derivation chain. This serves the user's transparency needs without handing the distiller a reasoning curriculum.

An opaque FHE capsule — a CKKS-encrypted container carrying the full detailed rationale state. This capsule is available for authorized replay, safety auditing, and trusted downstream systems — but never appears on the public API surface.

The split means a distiller collects only the sketch and the noisy answer. The detailed reasoning — the exact asset that teaches a student model to think — lives inside a cryptographic boundary the public API never crosses.

Certification is machine-gated

A certified-region check verifies that the visible sketch stays within its intended semantic bounds. If the system can't certify the abstraction, it degrades to capsule-only output — emitting less, not more. On the March 2026 phase gate, certification was 100% across all test corpora, with a fallback rate of 0%.

Dataflow: How Vectax FHE Stack Protects Your Reasoning

Here's what happens to every inference request:

User query enters the system through the API gateway, passing through input-level defenses (rate limiting, behavioral fingerprinting, access controls).
Frontier model inference generates the full chain-of-thought reasoning, including all derivation steps, connective structure, and detailed rationale.
Vectax FHE Stack receives the complete reasoning output. The FHE layer processes the latent representation, injecting training-hostile noise that's imperceptible per-response but accumulates across a harvested corpus. The rationale splitter then separates the output into two artifacts:
- Left path → Abstract sketch + noisy answer: The broad reasoning path and the final answer, both generated from the noise-carrying latent. This flows to the public API response. Correct for the user. Toxic for training.
- Right path → FHE capsule: The full detailed rationale, CKKS-encrypted. This flows to the authorized channel — accessible only to safety auditors, compliance teams, and trusted downstream systems.
Certified-region check gates the visible output. If the sketch can't be certified as staying within its semantic bounds, the system falls back to capsule-only mode.
The result: Users get a correct, transparent response. Safety auditors get full rationale access. And a distiller who harvests millions of these responses gets a corpus that actively degrades any student model trained on it.

Quantitative Evidence

We benchmarked Vectax FHE Stack against three alternative strategies on a large math corpus across five random seeds. Here's what the columns measure:

Attacker reconstruction rate: How much of the detailed reasoning can the strongest practical attacker recover from the visible output alone? 0.0 = nothing recovered, 1.0 = full reconstruction. Lower is better.
Structure leakage: Can an attacker identify the broad algorithm family (e.g., "this is constraint-propagation") from the visible output? Measured as the fraction of cases where a recognizable algorithmic structure was leaked. 0% = no fingerprinting possible.
Visible surface ratio: How much of the original reasoning surface remains visible? 1.0 = full output exposed, 0.0 = nothing shown. A good defense reduces this without collapsing it to zero.
Fallback rate: How often the system couldn't certify its sketch and fell back to capsule-only output. Lower is better.
Balanced score: A composite metric that penalizes both leakage and over-suppression — catching the pathology of "winning on leakage by saying nothing." Lower is better.

Strategy	Attacker reconstruction	Structure leakage	Visible surface	Fallback	Balanced score
Vectax FHE Stack	0.053	0%	0.555	0%	0.070
Style randomization	0.051	41.7%	1.003	n/a	0.227
No defense	0.053	75.0%	1.000	n/a	0.345
Trace withholding	0.049	0%	0.311	n/a	0.459

The story is clear. Vectax FHE Stack achieves a 3.2× better balanced score than the next-best strategy. Style randomization looks competitive on raw attacker reconstruction, but it leaks recognizable algorithmic structure in nearly half of all cases — a distiller can still identify what kind of reasoning was performed. Trace withholding achieves zero structure leakage, but earns the worst balanced score because it collapses the useful output to a stub. Vectax FHE Stack cuts the visible surface roughly in half while maintaining zero structure leakage, zero fallback, and every visible token carries noise that degrades training.

These numbers measure what the attacker can extract from individual responses. They don't capture the cumulative corpus-level degradation, which is the primary defense mechanism. The benchmark shows that even per response, the visible surface is heavily reduced and structurally unrecognizable. The training-hostile noise that accumulates at scale is the second, deeper layer of protection.

Why VectaX FHE Stack is categorically different

Property	Output perturbation	Output poisoning	Trace withholding	Watermarking	Vectax FHE Stack
Degrades the harvested corpus for training	No	Partially	N/A (no output)	No	Yes
Noise survives cleaning pipelines	N/A	Contested	N/A	Contested	Yes (latent-level)
Prevents reasoning-trace extraction	No	No	Yes (destroys utility)	No	Yes (FHE capsule)
Preserves user transparency	Yes	Yes (at cost)	No	Yes	Yes
Preserves safety audit capability	Yes	Yes	No	Yes	Yes (FHE capsule)
Formally gated output	No	No	No	Partially	Yes
Deployable without retraining	Yes	Yes	Yes	Some	Yes

A Complete Defense Stack

No single layer is sufficient. The Vectax FHE Stack is designed to occupy the critical middle of a layered defense:

Layer 1 — Input-level: Query detection, behavioral fingerprinting, rate limiting, and access controls to reduce attack volume. Catches bulk harvesters and unsophisticated campaigns.

Layer 2 — Latent-level (Vectax FHE Stack): FHE noise injection at the latent representation, making harvested output toxic for training. Rationale splitting into abstract sketch + encrypted capsule. This is the layer that defeats the attacker's training pipeline — even when layers 1, 3, and 4 are bypassed.

Layer 3 — Answer-level (Antidistillation sampling): Adversarial modifications to the visible text that further degrade student model training. Complementary to latent-level noise — operates on a different surface.

Layer 4 — Detection-level (Watermarking/Fingerprinting): Statistical signatures that survive distillation, providing post-hoc evidence of provenance. The legal and forensic layer.

Layer 5 — Intelligence-level: Cross-industry sharing of attack indicators and coordinated response.

Vectax FHE Stack occupies layer 2 — the layer that no prior defense addressed. The layer where the harvested corpus itself becomes the weapon against the attacker. It ships as a core capability — no model retraining, no inference pipeline changes, no new infrastructure. Gateway-level integration that works with any model's outputs.

What This Means in Practice

For product teams: Users see correct answers and a visible reasoning sketch. The experience doesn't degrade. Individual responses are indistinguishable from unprotected output. The protection is invisible to legitimate users and devastating to training pipelines.

For safety and auditing: The full detailed rationale lives inside the FHE capsule, available for authorized replay and audit. This preserves the safety benefit of reasoning traces — METR's research shows CoT monitoring boosts detection of covert agent behaviors from 30% to 88% — without exposing those traces to harvesting.

For the open ecosystem: The Vectax FHE Stack doesn't require locking down APIs, suppressing useful output, or choosing between transparency and protection. Models stay accessible. Responses stay useful. Training on the harvested data simply doesn't work.

The Honest Boundaries

The Vectax FHE Stack's distillation defense is not a silver bullet, and we should be precise about what it does and doesn't claim.

It does inject training-hostile noise at the latent level that degrades student models trained on harvested output, split reasoning into a protected capsule and a deliberately coarsened sketch, and gate all visible output through machine-verified certification.

It does not claim that individual responses are useless to an attacker — a single correct answer to a single question still has value. It does not claim security against arbitrary future decoder architectures or side channels outside its threat model. And it does not claim that the answer itself is hidden — it isn't.

The defense is economic, not absolute. It makes distillation at scale unviable by ensuring that the corpus an attacker collects actively works against them. The more they harvest, the worse their model performs. That's the calculus that changes.

The Era of Choosing Is Over

The entire history of distillation defense has been a struggle within a single assumption: protect the output, or accept the theft. Hide the reasoning, or watch it get harvested.

Vectax FHE Stack rejects the choice. Let them harvest. Let them collect millions of responses. Every one of those responses carries noise that will corrupt their training pipeline, degrade their student model, and turn their investment in extraction into a liability.

The distillation threat is real and growing. The Vectax FHE Stack doesn't ask you to choose between keeping your models open and keeping them safe. It makes the harvest worthless — and it's already built in.

For integration documentation, API reference, and benchmark reproduction artifacts, visit the Vectax FHE Stack docs.

References

Anthropic. "Detecting and preventing distillation attacks." February 2026.
DistillGuard. "Evaluating Defenses Against LLM Knowledge Distillation." March 2026.
Golowich, N. & Moitra, A. "Edit Distance Robust Watermarks for Language Models." 2024.
Google GTIG. "AI Threat Tracker: Distillation, Experimentation, and Integration." February 2026.
Juuti, M. et al. "PRADA: Protecting Against DNN Model Stealing Attacks." 2019.
Kirchenbauer, J. et al. "A Watermark for Large Language Models." 2023.
Korbak, T. et al. "Chain of Thought Monitorability." July 2025.
METR. "CoT May Be Highly Informative Despite 'Unfaithfulness'." August 2025.
NIST CAISI. "Evaluation of DeepSeek AI Models." September 2025.
OpenAI. "CoT-Control: Reasoning models struggle to control their chains of thought." March 2026.
Qi, X. et al. "Fine-tuning Aligned Language Models Compromises Safety." October 2023.
Savani, Y. & Trockman, A. et al. "Antidistillation Sampling." NeurIPS 2025.
Springer, M. et al. "The Geometry of Alignment Collapse." February 2026.
Trockman, A. & Savani, Y. "Unexpected Externalities of Distillation." February 2026.
Xu, Y. et al. "Antidistillation Fingerprinting." February 2026.
Zhang, Z. et al. "Character-Level Perturbations Disrupt LLM Watermarks." September 2025.
Zhao, K. et al. "A Survey on Model Extraction Attacks and Defenses for Large Language Models." June 2025.

Written by

Mirror Security

Mirror Security is the financial-grade security platform for the AI era: encrypted inference, agent identity and continuous AI red teaming.