Security

DeepSeek R1 & R1-Zero: A New Milestone in Language Model Reasoning & Safe AI Adoption

Mirror Security

DeepSeek has introduced two significant models—DeepSeek-R1-Zero and DeepSeek-R1—both aimed at improving the reasoning capabilities of large language models. The journey involved multiple phases of training, reinforcement learning (RL), and strategic fine-tuning, culminating in high accuracy and strong interpretability across various tasks. This also covers the security considerations how organizations can more safely deploy R1 and R1-Zero models and their variants or any other reasoning models.

The Two Approaches

DeepSeek-R1-Zero

Goal: Achieve strong reasoning purely via reinforcement learning, without any supervised fine-tuning.
Method
- Trained directly from a base model with a novel technique called Group Relative Policy Optimization (GRPO).
- Used rule-based verifiers for rewards—primarily checking correctness (for tasks like math or coding) and ensuring outputs followed a specific <think>...</think> and <answer>...</answer> format.
Outcome
- High raw accuracy (71% on AIME 2024) but occasionally mixed languages or produced less fluent text due to the absence of a fully supervised fine-tuning stage.

DeepSeek-R1

Goal: Combine best-in-class reasoning with top-tier readability and coherence.
Method
- Collected high-quality chain-of-thought examples to supervise the base model.
- Applied large-scale RL using accuracy and language consistency as rewards.
- Performed rejection sampling to gather improved outputs from the partially trained model, then mixed in more supervised data for a final fine-tuning.
- Applied another RL phase for user preferences and refined reasoning.
Outcome
- Achieved performance comparable to OpenAI’s o1-1217 model, with superior readability and fluency.

Key Difference: R1-Zero is RL-only (no initial supervised stage). R1 uses a multi-stage process that blends supervised fine-tuning and RL, which leads to more natural language outputs.

Group Relative Policy Optimization (GRPO) Explained

Traditional RL methods typically require: - A policy model: the main AI system. - A critic model: to estimate how good or bad the policy’s output is.

This setup nearly doubles computational overhead and can be unstable. In contrast, GRPO uses a different tactic:

Generate multiple candidate solutions (e.g., 16) for each prompt.
Compare each solution to the average performance.
Reinforce solutions that beat the average; penalize those that don’t.

By relativizing rewards, GRPO sidesteps the need for a large critic model and achieves more stable, cost-efficient training.

Verifying Correctness With Simple, Rule-Based Rewards

DeepSeek uses a suite of verifiers that assign rewards based on:

Accuracy: For math tasks, compare the final answer with a known solution. For coding tasks, run test cases.
Format: Ensure the chain-of-thought remains within <think> tags and the final answer within <answer> tags.

While limited in nuance (e.g., no partial credit, no verification of each intermediate step), these verifiers still proved surprisingly effective at driving improvements in performance.

Why Avoid Complex Reward Models (PRM, MCTS)

DeepSeek explored more complex techniques—like Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS)—but encountered:

Difficulty Defining “Good” Reasoning Steps (PRM)
- Intermediate steps can be correct or incorrect in subtle ways, making them hard to auto-verify
- Neural reward models introduced “reward hacking,” where the model learned to trick the reward function
Scalability Issues (MCTS):
- Token generation space is huge, unlike well-bounded domains such as board games
- Truncated search often converged to local optima.

Instead, simple rule-based verifiers combined with GRPO were more stable, cost-effective, and easier to scale.

Building Your Own R1: Two Paths

Distillation Path
1. Use the original DeepSeek-R1 to generate high-quality reasoning data.
2. Fine-tune your own model (e.g., LLaMA or QWen) on this distilled dataset.
3. The resulting model—“YOUR R1-Distill”—inherits much of DeepSeek-R1’s capabilities.

RL Training Path
1. Start with a base model and a reasoning-focused dataset for RL.
2. Train via GRPO (with simple verifiers) to create “YOUR R1-Zero.”
3. Use YOUR R1-Zero to generate new supervised data.
4. Fine-tune a fresh base model on that new data.
5. Apply GRPO again for final RL tuning → “YOUR R1.”

Going Beyond <think> Tags: Marker-Based Reasoning

To handle more specialized domains (e.g., cyber, legal, medical), one can replace <think>/<answer> with richer markers. For instance:

Cyber/Red Teaming
- [ASSESS] for initial analysis
- [PLAN] for strategy development
- [PIVOT] for backtrack and improve
- [VERIFY] for self-checks
- [OUTPUT] for the final result

Legal
- [ISSUE] for identifying the question
- [PRECEDENT] for referencing case law
- [ANALYSIS] for applying laws to facts

Medical
- [SYMPTOMS], [HISTORY], [DIFFERENTIAL], [DIAGNOSIS], etc.

Rewarding Markers: Each marker can have its own success criteria—encouraging thorough [PLAN], accurate [ANALYSIS], or correct [OUTPUT].

Security Considerations

The deployment of R1 and R1-Zero models introduces several security considerations that must be carefully evaluated, particularly given their enhanced reasoning capabilities. Three key aspects require special attention:

Securing Reasoning Tokens

Token Protection Mechanisms

Encryption of chain-of-thought sequences during both training and inference.
Secure storage for intermediate reasoning steps to prevent unauthorized access.
Access control policies restricting visibility into internal model states.
Real-time monitoring of token exposure risks and suspicious access patterns.

Data Leakage Prevention

Automated encryption or masking of critical data points before output.
Audit logging of token access events for traceability.
Isolation of high-risk tasks or reasoning processes to minimize lateral movement.

Domain Alignment

Validation Framework

Real-time checking of model outputs against domain-specific compliance and operational rules.
Integration with established policy or legal rule sets for immediate feedback.
Dynamic adjustment of validation thresholds based on context or risk level.
Continuous monitoring of alignment metrics, ensuring outputs remain within acceptable bounds.

Implementation Strategies

Pre-deployment validation that the model can handle domain constraints (e.g., legal disclaimers, medical guidelines).
Runtime enforcement of safety boundaries to block or revise outputs that violate constraints.
Regular updates to constraint definitions as regulations or best practices evolve.
Integration with existing security frameworks (IAM, network segmentation, etc.) for seamless oversight.

Adversarial Protection

Detection Capabilities

Identification of potential reward-gaming tactics where the model exploits loopholes in verifiers.
Analysis of output patterns to catch subtle manipulation or malicious red-teaming attempts.
Monitoring the integrity of verification systems for tampering.
Early warning system for suspicious behaviors or anomalies in reasoning traces.

Mitigation Approaches

Implementation of multi-layer validation steps, ensuring no single checkpoint can be gamed.
Regular security assessments to evaluate and strengthen verifier robustness against adversarial attacks.

By integrating these security measures—ranging from token-level protection (Mirror VectaX) to domain-aligned output checks (Mirror AgentIQ) and adversarial monitoring (Mirror Discover)—organizations can more safely deploy R1 and R1-Zero models and their variants or any other reasoning models. These safeguards help ensure that advanced reasoning capabilities do not inadvertently expose sensitive information or produce non-compliant outputs, thereby maintaining the delicate balance between innovation and security.

Lessons Learned in our Internal reproduction

Rule-Based Verifiers: Surprisingly robust for large-scale training but limited in nuance.
Strong Base model : Use a strong base model, smaller models <3B don’t pick up much signal.
Marker-Specific Rewards: Promising approach to systematically guide each step of reasoning.
Generalization: Verifiable approaches may yield “superhuman” performance in narrow tasks, but it’s uncertain how well they generalize to broader, open-ended problems.
Practical Gains: Our Internal benchmarks suggest significant performance boosts (e.g., +32 points on a cyber red-teaming task), validating these strategies in specialized domains.

Final Thoughts

DeepSeek’s R1 and R1-Zero highlight a pragmatic approach to RL-based language model training. By focusing on simple, verifiable reward signals and avoiding overly complex reward models, they sidestepped the high compute and instability issues often seen in RL for large LMs. The introduction of marker-based reasoning also shows a path forward for domain-specific solutions—whether in cybersecurity, law, or medicine. As methods to verify and reward intermediate reasoning become more refined, we can expect further breakthroughs in both accuracy and interpretability.

Mirror Security team will open source the implementation based on the full benchmark runs.

See what we written lately

View Our Posts

Mirror Security Announces Strategic Collaboration with SiSys AI, Inc. to Develop Next-Generation Fully Homomorphic Encryption Co-Processor

Mirror Security

Mirror Security Launches Zero Exposure Code Solution for AI Coding Assistants at Black Hat 2025

Mirror Security

Accops and Mirror Security Partner to Deliver Industry-First Zero Code Exposure Solution for AI Coding Assistants

Mirror Security

Mirror Security Announces Strategic Collaboration with SiSys AI, Inc. to Develop Next-Generation Fully Homomorphic Encryption Co-Processor

Mirror Security

Mirror Security Launches Zero Exposure Code Solution for AI Coding Assistants at Black Hat 2025

Mirror Security

Mirror Security Announces Strategic Collaboration with SiSys AI, Inc. to Develop Next-Generation Fully Homomorphic Encryption Co-Processor

Mirror Security

Mirror Security Launches Zero Exposure Code Solution for AI Coding Assistants at Black Hat 2025

Mirror Security