Learning from Peers in Reasoning Models

Enabling Cross-path Interaction in Parallel Inference


1The Chinese University of Hong Kong, Shenzhen, 2DualityRL, 3USTB, 4Huawei
*Equal Contributions. †Corresponding Author.

Abstract

Large Reasoning Models (LRMs) possess self-correction capabilities, yet they falter when reasoning paths begin poorly—a phenomenon we term the "Prefix Dominance Trap". Inspired by psychological findings on peer interaction, we introduce Learning from Peers (LeaP). In LeaP, every \( T \) tokens, reasoning paths summarize and share intermediate insights via a routing mechanism, enabling collaborative inference. For smaller models struggling with these instructions, we developed the fine-tuned LeaP-T series. Experiments on AIME 2024/2025, AIMO 2025, and GPQA Diamond demonstrate substantial improvements. Notably, QwQ-32B with LeaP surpasses its baseline by nearly 5 absolute points on average and outperforms DeepSeek-R1-671B on three math benchmarks. Our LeaP-T-7B matches DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis confirms LeaP's robust error correction via timely peer insights, showcasing strong error tolerance and adaptability to varied task difficulties. LeaP signifies a step towards enabling collaborative reasoning in LRMs.

How LeaP Works: Fostering Collaborative Reasoning

Our method, Learning from Peers (LeaP), enhances the reasoning capabilities of Large Reasoning Models by enabling them to learn from each other during the inference process. Instead of isolated reasoning, multiple parallel reasoning paths engage in structured communication. After generating a certain number of tokens ($T$), each path pauses to summarize its current progress and insights. These summaries are then shared with other paths through a sophisticated routing mechanism. This allows individual paths to incorporate diverse perspectives and correct potential errors early on, effectively mitigating the "Prefix Dominance Trap".

Overview of LeaP vs Independent Reasoning

Figure: (a) Traditional Independent Reasoning where paths are generated in parallel without interaction. (b) Our proposed Learning from Peers (LeaP) method, which inserts LeaP blocks to facilitate cross-path communication and learning.

The Anatomy of a LeaP Block

A LeaP block is the core component enabling peer interaction. It consists of two main stages: Summarization, where each path distills its current state, and Routing, where these insights are strategically shared. The model is then prompted to consider these peer summaries before continuing its own reasoning.

... [previous reasoning context] ...

Alright, let’s take a step back and summarize what we’ve figured out so far.

<summarize> In short, my current conclusions are that {Path's own generated summary, e.g., "The rhombus has vertices at (p,q), (r,s) on the hyperbola. Minimal BD squared is 980/119."} </summarize>

Hmm, it seems that my peers have given me some comments, so let me check if anyone’s conclusions are different from mine before I continue my own reasoning.

<peer_summary>
Peer 1: "{Summary from Peer 1, e.g., BD can be expressed as 80 + (22v)/3...}"
Peer 2: "{Summary from Peer 2, e.g., Expressed BD in terms of t... BD approaches 480 as t approaches infinity...}"
...
</peer_summary>

... [model continues reasoning, incorporating peer insights] ...

This structured interaction, exemplified in Figure 3 of our paper, allows paths to dynamically adjust their trajectories based on collective intelligence, leading to more robust and accurate reasoning.

Example of LeaP Communication

Figure: An example illustrating how LeaP enables communication between reasoning paths (Path i and Path j).

Key Results & Improvements

Our comprehensive evaluations on challenging benchmarks like AIME 2024/2025, AIMO 2025, and GPQA Diamond demonstrate the significant advantages of LeaP and LeaP-T.

Across four benchmarks, DeepSeek-R1-Distill-Qwen-7B with LeaP (Top-4 Dispersed routing) exceeds the baseline by an average of 6.49 points. Similarly, DeepSeek-R1-Distill-Qwen-14B with LeaP surpasses its baseline by 6.08 points. Impressively, QwQ-32B with LeaP even outperforms the much larger DeepSeek-R1-671B on all three math datasets. Our fine-tuned LeaP-T-7B model achieves performance comparable to DeepSeek-R1-Distill-Qwen-14B on AIME 2024.

Performance on Standard Benchmarks (Pass@1)

Main Results Table

Table: LeaP significantly outperforms baselines across AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond. Different routing strategies (Clustered, Hybrid, Dispersed) and numbers of peers (Top-2, Top-4) are evaluated.

Overcoming the "Prefix Dominance Trap"

Prefix Dominance Trap Results

Figure: Performance drop when starting with poor beginnings (Prefix Dominance Trap).

LeaP Mitigates the Trap

LeaP Mitigates Prefix Dominance Trap

Figure: LeaP consistently reduces the performance gap caused by poor beginnings.

LeaP-T: Fine-tuned for Enhanced Collaboration

LeaP-T Results Table

Table: Evaluation of our LeaP-T models (1.5B to 14B) on three math benchmarks, showing consistent improvements over baselines and non-fine-tuned LeaP.

Case Studies: LeaP in Action

Explore how LeaP helps models correct their reasoning paths by learning from peers. Swipe through the examples below.

Case Study 1: Successful Correction
Successful Correction: Initially on an incorrect path, the model receives diverse summaries from peers. By identifying a more promising approach from a peer, it re-evaluates and reaches the correct answer.
Case Study 2: Overcoming Bad Beginning
Overcoming Bad Beginning (Hyperbola Problem): Starting with an incorrect prefix for the hyperbola problem, the model receives a correct summary from Peer 4 regarding $BD^2$ approaching 480. This contradicts its own flawed reasoning, prompting self-verification and leading to the correct answer of 480.

Roadmap & Future Work

Current Progress & TODOs

  • Open source our code, datasets and V1 models (Target: May 13, 2025)
  • Publish our LeaP-R1 models trained by RL (Target: August 2025)

Promising Future Directions

Learning from Peers in Reinforcement Learning: Integrating LeaP into RL frameworks could empower models to learn collaborative problem-solving strategies more effectively, potentially unlocking greater capabilities.

Learning from Peers with Different Expertise: Leveraging peers with specialized skills (e.g., some using web search, others using code execution) could significantly enhance reasoning quality, especially for multifaceted problems.

Citation

If you find our paper inspiring and have utilized it in your work, please cite our paper.
@article{luo2025learning,
  title={Learning from Peers in Reasoning Models},
  author={Luo, Tongxu and Du, Wenyu and Bi, Jiaxi and Chung, Stephen and Tang, Zhengyang and Yang, Hao and Zhang, Min and Wang, Benyou},
  journal={arXiv preprint arXiv:2505.07787},
  year={2025}
}