Security

DeepSeek R1 & R1-Zero: A New Milestone in Language Model Reasoning & Safe AI Adoption

Mirror Security

Mirror Security

DeepSeek R1 & R1-Zero: A New Milestone in Language Model Reasoning & Safe AI Adoption

DeepSeek R1 & R1-Zero: A New Milestone in Language Model Reasoning & Safe AI Adoption

DeepSeek R1 & R1-Zero: A New Milestone in Language Model Reasoning & Safe AI Adoption

DeepSeek has introduced two significant models—DeepSeek-R1-Zero and DeepSeek-R1—both aimed at improving the reasoning capabilities of large language models. The journey involved multiple phases of training, reinforcement learning (RL), and strategic fine-tuning, culminating in high accuracy and strong interpretability across various tasks. This also covers the security considerations how organizations can more safely deploy R1 and R1-Zero models and their variants or any other reasoning models.

The Two Approaches 

DeepSeek-R1-Zero

  • Goal: Achieve strong reasoning purely via reinforcement learning, without any supervised fine-tuning

  • Method

    • Trained directly from a base model with a novel technique called Group Relative Policy Optimization (GRPO).

    • Used rule-based verifiers for rewards—primarily checking correctness (for tasks like math or coding) and ensuring outputs followed a specific <think>...</think> and <answer>...</answer> format. 

  • Outcome

    • High raw accuracy (71% on AIME 2024) but occasionally mixed languages or produced less fluent text due to the absence of a fully supervised fine-tuning stage. 

DeepSeek-R1 

  • Goal: Combine best-in-class reasoning with top-tier readability and coherence.

  • Method

    • Collected high-quality chain-of-thought examples to supervise the base model. 

    • Applied large-scale RL using accuracy and language consistency as rewards.

    • Performed rejection sampling to gather improved outputs from the partially trained model, then mixed in more supervised data for a final fine-tuning.

    • Applied another RL phase for user preferences and refined reasoning.

  • Outcome

    • Achieved performance comparable to OpenAI’s o1-1217 model, with superior readability and fluency.

Key Difference: R1-Zero is RL-only (no initial supervised stage). R1 uses a multi-stage process that blends supervised fine-tuning and RL, which leads to more natural language outputs.

Group Relative Policy Optimization (GRPO) Explained

Traditional RL methods typically require: - A policy model: the main AI system. - A critic model: to estimate how good or bad the policy’s output is.

This setup nearly doubles computational overhead and can be unstable. In contrast, GRPO uses a different tactic:

  1. Generate multiple candidate solutions (e.g., 16) for each prompt.

  2. Compare each solution to the average performance.

  3. Reinforce solutions that beat the average; penalize those that don’t.

By relativizing rewards, GRPO sidesteps the need for a large critic model and achieves more stable, cost-efficient training.

Verifying Correctness With Simple, Rule-Based Rewards 

DeepSeek uses a suite of verifiers that assign rewards based on:

  1. Accuracy: For math tasks, compare the final answer with a known solution. For coding tasks, run test cases.

  2. Format: Ensure the chain-of-thought remains within <think> tags and the final answer within <answer> tags.

While limited in nuance (e.g., no partial credit, no verification of each intermediate step), these verifiers still proved surprisingly effective at driving improvements in performance.

Why Avoid Complex Reward Models (PRM, MCTS)

DeepSeek explored more complex techniques—like Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS)—but encountered:

  • Difficulty Defining “Good” Reasoning Steps (PRM)

    • Intermediate steps can be correct or incorrect in subtle ways, making them hard to auto-verify

    • Neural reward models introduced “reward hacking,” where the model learned to trick the reward function

  • Scalability Issues (MCTS):

    • Token generation space is huge, unlike well-bounded domains such as board games

    • Truncated search often converged to local optima.

Instead, simple rule-based verifiers combined with GRPO were more stable, cost-effective, and easier to scale.

Building Your Own R1: Two Paths 

  • Distillation Path 

    1. Use the original DeepSeek-R1 to generate high-quality reasoning data. 

    2. Fine-tune your own model (e.g., LLaMA or QWen) on this distilled dataset.

    3. The resulting model—“YOUR R1-Distill”—inherits much of DeepSeek-R1’s capabilities. 

  • RL Training Path 

    1. Start with a base model and a reasoning-focused dataset for RL. 

    2. Train via GRPO (with simple verifiers) to create “YOUR R1-Zero.” 

    3. Use YOUR R1-Zero to generate new supervised data. 

    4. Fine-tune a fresh base model on that new data. 

    5. Apply GRPO again for final RL tuning → “YOUR R1.” 

Going Beyond <think> Tags: Marker-Based Reasoning 

To handle more specialized domains (e.g., cyber, legal, medical), one can replace <think>/<answer> with richer markers. For instance: 

  • Cyber/Red Teaming 

    • [ASSESS] for initial analysis 

    • [PLAN] for strategy development 

    • [PIVOT] for backtrack and improve 

    • [VERIFY] for self-checks 

    • [OUTPUT] for the final result 

  • Legal

    • [ISSUE] for identifying the question 

    • [PRECEDENT] for referencing case law 

    • [ANALYSIS] for applying laws to facts 

  • Medical 

    • [SYMPTOMS], [HISTORY], [DIFFERENTIAL], [DIAGNOSIS], etc. 

Rewarding Markers: Each marker can have its own success criteria—encouraging thorough [PLAN], accurate [ANALYSIS], or correct [OUTPUT]. 

Security Considerations 

The deployment of R1 and R1-Zero models introduces several security considerations that must be carefully evaluated, particularly given their enhanced reasoning capabilities. Three key aspects require special attention: 

Securing Reasoning Tokens

Token Protection Mechanisms 
  • Encryption of chain-of-thought sequences during both training and inference. 

  • Secure storage for intermediate reasoning steps to prevent unauthorized access. 

  • Access control policies restricting visibility into internal model states. 

  • Real-time monitoring of token exposure risks and suspicious access patterns. 

Data Leakage Prevention 
  • Automated encryption  or masking of critical data points before output. 

  • Audit logging of token access events for traceability. 

  • Isolation of high-risk tasks or reasoning processes to minimize lateral movement. 

Domain Alignment

Validation Framework
  • Real-time checking of model outputs against domain-specific compliance and operational rules. 

  • Integration with established policy or legal rule sets for immediate feedback. 

  • Dynamic adjustment of validation thresholds based on context or risk level. 

  • Continuous monitoring of alignment metrics, ensuring outputs remain within acceptable bounds. 

Implementation Strategies 
  • Pre-deployment validation that the model can handle domain constraints (e.g., legal disclaimers, medical guidelines). 

  • Runtime enforcement of safety boundaries to block or revise outputs that violate constraints. 

  • Regular updates to constraint definitions as regulations or best practices evolve. 

  • Integration with existing security frameworks (IAM, network segmentation, etc.) for seamless oversight. 

Adversarial Protection

Detection Capabilities 
  • Identification of potential reward-gaming tactics where the model exploits loopholes in verifiers. 

  • Analysis of output patterns to catch subtle manipulation or malicious red-teaming attempts. 

  • Monitoring the integrity of verification systems for tampering. 

  • Early warning system for suspicious behaviors or anomalies in reasoning traces. 

Mitigation Approaches 
  • Implementation of multi-layer validation steps, ensuring no single checkpoint can be gamed. 

  • Regular security assessments to evaluate and strengthen verifier robustness against adversarial attacks.

By integrating these security measures—ranging from token-level protection (Mirror VectaX) to domain-aligned output checks (Mirror AgentIQ) and adversarial monitoring (Mirror Discover)—organizations can more safely deploy R1 and R1-Zero models and their variants or any other reasoning models. These safeguards help ensure that advanced reasoning capabilities do not inadvertently expose sensitive information or produce non-compliant outputs, thereby maintaining the delicate balance between innovation and security.

Lessons Learned in our Internal reproduction  

  • Rule-Based Verifiers: Surprisingly robust for large-scale training but limited in nuance. 

  • Strong Base model : Use a strong base model, smaller models <3B don’t pick up much signal.  

  • Marker-Specific Rewards: Promising approach to systematically guide each step of reasoning. 

  • Generalization: Verifiable approaches may yield “superhuman” performance in narrow tasks, but it’s uncertain how well they generalize to broader, open-ended problems. 

  • Practical Gains: Our Internal benchmarks suggest significant performance boosts (e.g., +32 points on a cyber red-teaming task), validating these strategies in specialized domains. 

 Final Thoughts 

DeepSeek’s R1 and R1-Zero highlight a pragmatic approach to RL-based language model training. By focusing on simple, verifiable reward signals and avoiding overly complex reward models, they sidestepped the high compute and instability issues often seen in RL for large LMs. The introduction of marker-based reasoning also shows a path forward for domain-specific solutions—whether in cybersecurity, law, or medicine. As methods to verify and reward intermediate reasoning become more refined, we can expect further breakthroughs in both accuracy and interpretability. 

Mirror Security team will open source the implementation based on the full benchmark runs.


Mirror Security

© All right reserved

Mirror Security

© All right reserved

Mirror Security

© All right reserved