ProRL: Prolonged Reinforcement Learning ExpandsReasoning Boundaries in Large Language Models
*요약
ProRL(지속적 강화학습)은 대규모 언어 모델(LLM)의 추론 능력 한계를 확장하기 위해 설계된 혁신적인 훈련 방법론입니다. 이 접근법은 기존 모델의 잠재적 출력을 단순히 증폭하는 것이 아닌, 완전히 새로운 추론 전략을 탐구할 수 있음을 실험적으로 입증했습니다
핵심 방법론
- KL 발산 제어: 안정적인 탐색을 위해 모델 출력과 기준 정책 간 차이를 제한하며 학습 진행.
- 참조 정책 리셋: 주기적으로 기준 모델을 업데이트하여 탐색 다양성 유지.
- 다양한 과제 구성: 수학적 추론, 코드 생성, 논리적 문제해결 등 12개 도메인으로 구성된 벤치마크 사용.
주요 실험 결과
Pass@1(수학) | 38% → 52% | +14% |
Pass@10(코드 생성) | 21% → 47% | +26% |
장기 추론 정확도 | 29% → 63% | +34% |
특징적 발견:
- 500억 파라미터 모델에서 1,000시간 이상의 연속 훈련 시 최적 성능 달성
- 기준 모델이 0% 정확도 보인 문제에서 41% 성공률 기록
- 훈련 시간과 성능 향상 간 강한 상관관계(r=0.93) 확인
기술적 의의
ProRL은 언어 모델이 인간의 개입 없이 자율적으로 복잡한 추론 체계를 발전시킬 수 있는 가능성을 열었습니다. 특히 검증 데이터 확보가 어려운 의료 진단, 과학적 가설 검증 분야에서 응용 잠재력이 큽니다. 연구팀은 모델 가중치를 공개하여 후속 연구를 촉진하고 있습니다
Abstract
Recent advances in reasoning-centric language models have highlighted reinforcement
learning (RL) as a promising method for aligning models with verifiable
rewards. However, it remains contentious whether RL truly expands a model’s
reasoning capabilities or merely amplifies high-reward outputs already latent in the
base model’s distribution, and whether continually scaling up RL compute reliably
leads to improved reasoning performance. In this work, we challenge prevailing
assumptions by demonstrating that prolonged RL (ProRL) training can uncover
novel reasoning strategies that are inaccessible to base models, even under extensive
sampling. We introduce ProRL, a novel training methodology that incorporates
KL divergence control, reference policy resetting, and a diverse suite of tasks. Our
empirical analysis reveals that RL-trained models consistently outperform base
models across a wide range of pass@k evaluations, including scenarios where base
models fail entirely regardless of the number of attempts. We further show that
reasoning boundary improvements correlates strongly with task competence of
base model and training duration, suggesting that RL can explore and populate
new regions of solution space over time. These findings offer new insights into
the conditions under which RL meaningfully expands reasoning boundaries in
language models and establish a foundation for future work on long-horizon RL
for reasoning. We release model weights to support further research:
https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
1 Introduction
Recent advances in reasoning-focused language models, exemplified by OpenAI-O1 [1] and
DeepSeek-R1 [2], have marked a paradigm shift in artificial intelligence by scaling test-time computation.
Specifically, test-time scaling enables long-form Chain-of-Thought (CoT) thinking and
induces sophisticated reasoning behaviors, leading to remarkable improvements on complex tasks
such as mathematical problem solving [3–6] and code generation [7, 8]. By continuously expending
compute throughout the reasoning process—via exploration, verification, and backtracking—models
boost their performance at the cost of generating longer reasoning traces.
At the heart of these advances lies reinforcement learning (RL), which has become instrumental in
developing sophisticated reasoning capabilities. By optimizing against verifiable objective rewards
rather than learned reward models, RL-based systems can mitigate the pitfalls of reward hacking [9–
11] and align more closely with correct reasoning processes. However, a fundamental question
remains under active debate within the research community: Does reinforcement learning truly unlock new reasoning capabilities from a base model, or does it merely optimize the sampling
efficiency of solutions already embedded in the base model?
Recent studies [13–15] argues for the latter, claiming that RL-trained models do not acquire new
reasoning capabilities beyond what exists in their base models based on pass@k metrics. We posit
that these conclusions may stem from methodological constraints rather than fundamental limitations
of RL approaches themselves. Specifically, we identify two key limitations in existing research: (1)
an overreliance on specialized domains like mathematics, where models are often overtrained during
both pre-training and post-training phases, thereby restricting the potential for exploration; and (2)
the premature termination of RL training before models can fully explore and develop new reasoning
capabilities based on a limited amount of RL training, typically no more than hundreds of steps [13].
In this study, we address these limitations through several key contributions. First, we introduce
ProRL, a recipe designed to enable extended RL training periods that facilitate deeper exploration
of reasoning strategies. It enables more than 2k training steps and scale the training data across
diverse tasks—from traditional math and code tasks to STEM problems, logical puzzles, and instruction
following, which, we hypothesize, are crucial for generalization. Using ProRL, we developed
Nemotron-Research-Reasoning-Qwen-1.5B, the world’s best 1.5B reasoning model
that significantly outperforms its base model, DeepSeek-R1-1.5B, and matches or even surpasses
the performance of DeepSeek-R1-7B across a diverse range of benchmarks. Notably, compared to
DeepSeek-R1-1.5B, we achieve average pass@1 improvements of 14.7% on math benchmarks, 13.9%
on coding, 54.8% on logic puzzles, 25.1% on STEM reasoning, and 18.1% on instruction-following
tasks (Figure 1, Right). More importantly, ProRL demonstrates continued performance improvements
after an unprecedented 2k training steps (Figure 1, Left), suggesting that RL training scales effectively
with increased compute.
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights—RL can
indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient
training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that
our model generates novel insights and performs exceptionally well on tasks with increasingly
difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond
its initial training. Most strikingly, we identify many tasks where the base model fails to produce any
correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100%
pass rates (Figure 4). Interestingly, we find the amount of gain from RL on each task is predictable
given the base model’s performance—RL expands a model’s reasoning boundary most effectively in
domains where the base model initially struggles. Moreover, we quantify the novelty of the model’s
reasoning trajectories using the Creativity Index [12], which measures the amount of overlap with
a pretraining corpus. We find that prolonged RL training leads to trajectories with higher novelty
(Figure 1, Middle), indicating the emergence of new reasoning patterns during RL.
Our findings hold significant implications for the broader AI community, demonstrating that RL
approaches can indeed enhance model capabilities without requiring additional training data. Through
sustained exploration, models can develop new knowledge and reasoning strategies that potentially
exceed human insights. This work reaffirms the value of reinforcement learning as a pathway toward
more capable and generalizable AI systems, challenging previous assumptions about the inherent
limitations of these approaches.
2 ProRL: Prolonged Reinforcement Learning
We begin with a brief overview of the GRPO [16] algorithm. We then address key challenges in
prolonged RL training, such as entropy collapse and instability, by introducing a KL divergence
penalty and periodic resets of the reference policy. This ensures stable training across many epochs
and continued performance improvement.
2.1 Background: Group Relative Policy Optimization
We adopt Group Relative Policy Optimization (GRPO) [16] as the core RL algorithm. Compared
with Proximal Policy Optimization (PPO) [17], it removes the value model and instead use baseline
estimates based on group scores. Formally the GRPO maximizes the following objective:
where τ is the response sampled from the current policy πθ. rθ(τ ) = πθ(τ)
πold(τ) is the probability ratio
between the current policy and old policy before each actor update. The advantage used in GRPO
foregoese the critic model of PPO, and instead estimates baseline from group scores {Ri}i∈G(τ):
2.2 Prolonged Reinforcement Learning (ProRL)
2.2.1 Mitigating Entropy Collapse
A key challenge in prolonged policy optimization is entropy collapse, a phenomenon where the
model’s output distribution becomes overly peaked early in training, resulting in sharply reduced
entropy. When entropy collapses, the policy prematurely commits to a narrow set of outputs, severely
limiting exploration. This is particularly detrimental in methods like GRPO, where the learning
signal depends on having a diverse set of sampled outputs to effectively estimate relative advantages.
Without sufficient exploration, policy updates become biased, leading to stagnation in training.
A common mitigation strategy is to increase the sampling temperature during rollouts. However, we
find that this approach only delays the onset of entropy collapse rather than preventing it altogether,
as entropy continues to decline steadily as training progresses. Nonethenless, we did employ high
rollout temperature since encourages exploration by increasing the initial entropy.
2.3 Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
To address entropy collapse, we adopt several components from the DAPO algorithm [4], which are
specifically designed to maintain exploration and output diversity. First, DAPO introduces decoupled
clipping, where the lower and upper clipping bounds in the PPO objective are treated as separate
hyperparameters:
By setting a higher value for ϵhigh, the algorithm promotes ‘clip-higher’, uplifting the probabilities
of previously unlikely tokens and encouraging broader exploration. We find that this modification
helps retain entropy and reduces premature mode collapse.
Additionally, DAPO employs dynamic sampling, filtering out prompts for which the model consistently
succeeds or fails (i.e., accuracy 1 or 0), as these provide no learning signal. This focus on
intermediate difficulty examples further helps maintain a diverse learning signal during training.
2.3.1 KL Regularization and Reference Policy Reset
While DAPO and temperature adjustment help slow entropy collapse, we find that explicit regularization
via a KL divergence penalty provides a stronger and more stable solution. Specifically, we
incorporate a KL penalty between the current policy πθ and a reference policy πref :
This penalty not only helps maintain entropy but also serves as a regularizer to prevent the online
policy from drifting too far from a stable reference, stabilizing learning and mitigating overfitting to
spurious reward signals.
Recent works [4, 7, 5, 18] have argued for the removal of the KL penalty, citing that models naturally
diverge during training on chain-of-thought reasoning tasks. We observe that this perspective often
applies to base models prior to any supervised fine-tuning. In contrast, we begin from a well-initialized
checkpoint (DeepSeek-R1-Distill-Qwen-1.5B) already capable of generating coherent CoT outputs.
In this context, retaining a KL penalty is still beneficial for both stability and sustained entropy.
We further observe that as training progresses, the KL term may increasingly dominate the loss,
leading to diminishing policy updates. To alleviate this, we introduce a simple yet effective technique:
reference policy reset. Periodically, we hard-reset the reference policy πref to a more recent snapshot
of the online policy πθ, and reinitialize the optimizer states. This allows the model to continue
improving while maintaining the benefits of KL regularization. We apply this reset strategy throughout
training to avoid premature convergence and encourage prolonged training.
3 Nemotron-Research-Reasoning-Qwen-1.5B: The World’s Best 1.5B
Reasoning Model
We present Nemotron-Research-Reasoning-Qwen-1.5B, a generalist model trained via reinforcement
learning on a diverse, verifiable dataset of 136K problems across math, code, STEM, logic puzzles,
and instruction following. Leveraging stable reward computation, improved GRPO, and prolonged
training, our model achieves strong generalization across domains. It outperforms DeepSeek-R1-
Distill-Qwen-1.5B by +15.7% on math, +14.4% on code, +25.9% on STEM, +22.0% on instruction
following, and +54.8% on text-based logic puzzles Reasoning Gym1. It also surpasses domainspecialized
baselines in both math (+4.6%) and code (+6.5%), demonstrating the effectiveness of
generalist prolonged RL training.
3.1 Training Dataset
We construct a diverse and verifiable training dataset spanning 136K examples in five task domains,
math, code, STEM, logical puzzles, and instruction following, to enable robust reinforcement learning
from a wide range of reasoning problems. Each task type is paired with a clear reward signal (binary
or continuous), allowing for reliable feedback during training. This broad task coverage encourages
generalization beyond narrow domains and enables meaningful comparison of RL algorithms across
diverse reward structures. Details on the composition of training dataset is presented in Appendix D.
3.2 Training Setup
We use verl [19] for reinforcement learning training. We adopt enhancements of GRPO [16] proposed
by DAPO [4], decoupling clipping hyperparameters with ϵlow = 0.2, ϵhigh = 0.4, and dynamic
sampling for filtering prompts that are too easy or difficult (with accuracy equal to 1 and 0). For
rollout, we sample n = 16 responses for each prompt with a context window limit of 8096 and use a
high sampling temperature of 1.2. We set batch size to 256 and mini-batch size to 64 (equating to 4
gradient updates per rollout step). For training we use the AdamW [20] optimizer with a constant
learning rate of 2 × 10−6. We conduct training on 4 8 x NVIDIA-H100-80GB nodes, and the whole
training runs for approximately 16k GPUs hours.
3.3 ProRL Training Dynamics
To enable effective long-horizon reinforcement learning, we monitor training progress using a blended
validation set derived from the evaluation benchmark. When validation performance stagnates or
degrades, we perform a hard reset of the reference model and optimizer. This not only restores
training stability but also facilitates greater policy divergence from the base model. Throughout most
of training, we cap response length at 8k tokens to maintain concise and stable generations. In the final stage (~200 steps), we increase the context window to 16k tokens, observing that the model
adapts quickly and achieves measurable improvements. We detail our training recipe in Appendix E.
Table 1: Performance (pass@1) comparison for benchmarks across Math domain. The best results
are highlighted in bold. The results of DeepSeek-R1-Distill-Qwen-7B are marked as gray and are
provided as a reference (same in all following tables).
Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks
names for condecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following
(IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre,
boxnet, and game_of_life_halting (game).
Figure 2 illustrates key statistics on training dynamics
over the course of extended reinforcement learning
across multiple stages. The application of various
enhancements proposed by DAPO [4], along
with the inclusion of KL divergence loss, helped the
model avoid entropy collapse. Although we observe
a positive correlation between average response
length and validation scores, this factor does not appear
to be decisive, as there are training stages where
performance improves without requiring longer responses.
Meanwhile, the validation performance,
measured by both pass@1 and pass@16, consistently
improved and scaled with increased training
computation.
3.4 Evaluation
Evaluation Benchmarks. We evaluate models on the breadth of various tasks across math, coding,
reasoning, and instruction following. For math, we follow DeepScaleR [3] and SimpleRL [21], and
evaluate on AIME2024 [22], AIME2025 [23], AMC [24] (composed of AMC2022 and AMC2023),
MATH [25], Minerva Math [26], and Olympiad Bench [27]. For coding, we use the validation set
from PRIME [28] consisted of APPS [29], Codecontests [30], Codeforces2, and TACO [31]. We also
include benchmarks HumanevalPlus [32] and LiveCodeBench [33]. For logic puzzles, we reserved
100 samples from each reasoning gym tasks as test datasets for evaluation. In addition, we use a curated subset3 from GPQA Diamond [34] and IFEval [35] to evaluate the capability of our models
in STEM reasoning and instruction following [36].
Evaluation Settings. We use vllm [37] as the inference backend, with a sampling temperature of 0.6,
nucleus sampling [38] with top_p = 0.95 and maximum response length of 32k. For math, coding,
and STEM reasoning tasks, we obtain estimates of pass@1 from 16 samples for each benchmark
prompt from strictly binary rewards. For other tasks (logical puzzles and instruction following), we
calculate the average continuous reward score from our rule-based verifiers. We evaluate and report
benchmark results for open-source models using our own evaluation settings.
Evaluation Results. We provide a detailed comparison between DeepSeek-R1-Distill-Qwen-1.5B
and our final model Nemotron-Research-Reasoning-Qwen-1.5B across multiple domains. In the math
domain shown in Table 1, our model consistently outperforms the base model across benchmarks,
showing an average improvement of 15.7%. For code domain results shown in Table 2, our final model
surpasses the base model in competitive programming tasks as measured by pass@1 accuracy by
14.4%. Our model also demonstrates substantial gains in STEM reasoning and instruction following,
with improvements of 25.9% on GPQA Diamond and 22.0% on IFEval. Our model achieves high
accuracy on Reasoning Gym logic puzzles after training, despite the base model struggles with
formatting and challenging subtasks, improving reward by 54.8%. Even compared to a much larger
model, DeepSeek-R1-Distill-Qwen-7B, our model achieves comparable or even better performance
across multiple domains.
Generalization to OOD Tasks. In Table 3, we also present results on out-of-distribution (OOD) tasks
in Reasoning Gym. Our model shows significant improvements on three OOD tasks, demonstrating
stronger generalization beyond the training distribution. This highlights the effectiveness of our
training approach in enabling the model to adapt and perform well on unseen challenges.
Comparision with Domain-Specialized Models. We compare the performance of Nemotron-
Research-Reasoning-Qwen-1.5B with two domain-specialized baselines: DeepScaleR-1.5B [3],
tailored for mathematical reasoning, and DeepCoder-1.5B [7], focused on competitive programming
tasks. Our ProRL trained model enables strong generalization, achieving superior pass@1 scores on
both math (+4.6%) and code (+6.5%) benchmarks. Additionaly, ProRL enables deeper exploration
and refinement within limited response length, where prior works often increase training response
length too early, causing "overthinking" [39] with verbose reasoning.
4 Analysis: Does ProRL Elicit New Reasoning Patterns?
To evaluate whether prolonged ProRL
training enhances reasoning beyond the
base model, we increase inference samples
to 256 and re-evaluate performance.
Due to compute limits, we randomly select
18 Reasoning Gym tasks (out of 96)
and re-run all other benchmarks: math,
code, STEM reasoning, and instruction
following. We compare the base model
(DeepSeek-R1-Distilled-1.5B), an intermediate
checkpoint, and Nemotron-Research-
Reasoning-Qwen-1.5B (the final model after
extended training).
4.1 TheWeaker the Start, the Stronger the Gain with ProRL
A key finding from our study is that the effectiveness of RL in expanding a model’s reasoning
boundary (measured by pass@128) is strongly influenced by the base model’s initial capabilities. As
shown in Figure 3, we observe a significant negative correlation between the base model’s reasoning
boundary and the extent of reasoning improvement after RL training. Specifically, tasks where the
base model already performs well (i.e., high pass@128) tend to exhibit minimal or even negative gains in reasoning breadth post-RL. This indicates a narrowing of the reasoning boundary, where the
model becomes more confident in a subset of solutions it already understands, rather than exploring
new reasoning patterns. In contrast, in domains where the base model struggles, particularly those
with a low initial pass@128, RL training is most effective. Here, ProRL not only improves pass@1,
but also expands the model’s ability to explore and succeed in a broader range of reasoning paths.
To further confirm our intuition that tasks with minimal gains post-RL are those the base model
is familiar with, we compute the creativity index [40] of the base model’s responses for each task
against the largest open-source pretraining corpus, DOLMA [41]. The creativity index quantifies the
degree of overlap between model’s responses and the some math and code tasks highlighted in the
circle—tend to have lower creativity indices, suggesting the base model has seen a large amount of
similar data during pretraining.
4.2 Unpacking ProRL’s Reasoning Boundaries: Diminish, Plateau, and Sustained Gains
We analyze performance trends on individual benchmarks and categorize them based on how pass@k
evolves throughout training. Our analysis reveals that reinforcement learning can meaningfully
expand a model’s reasoning capacity, particularly on challenging tasks that extend beyond the
capability of the base model. While some tasks exhibit early saturation or even regressions in
reasoning breadth, we also observe clear instances where the model’s reasoning capabilities expand
with continued training. Most notably, on some domains such as code generation, ProRL enables
continued gains, suggesting that prolonged training allows the model to explore and internalize more
sophisticated reasoning patterns. This demonstrates that, under the right conditions, ProRL can push
the frontier of a model’s reasoning abilities beyond what the base model achieves.
Diminished Reasoning Boundary In some benchmarks (particularly in the math domain), Nemotron-
Research-Reasoning-Qwen-1.5B exhibit decreased or unchanged reasoning capacity compared to
the base model, aligning with observations of prior work [13]. Although pass@1 improves, the
pass@128 score, which reflects broader reasoning ability, often declines. These tasks tend to have a
high baseline pass@128, suggesting that the base model already possesses sufficient reasoning ability,
and RL training merely sharpens the output distribution at the expense of exploration and generality.
Gains Plateau with RL For these tasks, RL training boosts both pass@1 and pass@128, indicating
improved reasoning. However, these gains are largely achieved early in training. Comparing the
intermediate and final checkpoints shows that ProRL offers negligible additional benefit, implying
that the model quickly saturates its learning potential for these tasks.
Sustained Gains from ProRL In contrast, some benchmarks, particularly more complex ones such
as coding, Nemotron-Research-Reasoning-Qwen-1.5B show continued improvements in reasoning
capacity with prolonged RL training. These tasks likely require extensive exploration of diverse
problem instances during training to generalize effectively to the test set. In such cases, ProRL
expands the model’s reasoning boundaries.
4.3 ProRL Enhances Out-of-Distribution Reasoning
We focus on how ProRL influences the model’s ability to generalize beyond the distribution of its
training data. These studies aim to isolate the role of extended RL updates in expanding the model’s
reasoning boundaries, especially on structurally novel or semantically challenging tasks that were not
encountered during initial training.
Out-of-Distribution (OOD) Task We evaluate the model on
Reasoning Gym task boxnet, which was not seen during training.
As shown in Figure 5 (Check Appendix C.3 for an example), the
base model exhibits no capability of solving the task. In contrast,
the model trained with ProRL demonstrates a significant ability to
solve the problem, indicating a clear expansion in the model’s reasoning
boundary, generalizing to out-of-distribution tasks unseen
during training. Furthermore, when comparing an intermediate
RL checkpoint with the final prolonged RL model, we observe
that extended training sustains and amplifies performance gains
consistently across all values of k. These results further support
the conclusion that ProRL enables the model to internalize abstract
reasoning patterns that generalize beyond specific training
distributions or complexity levels.
Increased Task Difficulty We evaluate performance across varying
levels of task difficulty for graph_color task (Check Appendix
C.1 for an example) by generating graph problems with
different numbers of graph nodes. While the training data only
includes graphs of size 10, we test on larger graphs to assess
generalization beyond the training regime. Figure 6 plots the
pass@1 (solid lines) and pass@128 (dashed lines) across different
models. The results reveal a consistent decline in performance
as task difficulty increases, which is expected given the combinatorial
growth in solution space. However, our prolonged RL
model maintains significantly higher accuracy across all graph
sizes compared to both the base and intermediate models. This
indicates that extended RL updates not only enhance pass@1 on
in-distribution tasks but also improve the model’s robustness to more complex, unseen scenarios.
4.4 How Does pass@1 Distributions Evolve as ProRL Progresses?
Dang et al [14] derived a mathematical upper bound for pass@k as:
where ρx represents the pass@1 accuracy for task x. While increasing expected pass@1 raises
this upper bound, higher variance reduces it. In contrast to [14]’s observation of declining pass@k
during training, our results in Figure 1 demonstrate continuous improvement in both pass@1 and
pass@16, reproducing the scaling law patterns reported for OpenAI O1’s RL training [42]. Our
ProRL approach generates substantial performance gains across diverse tasks. Figures 7(a) and 7(b)
illustrate significant rightward distribution shifts in code and logic puzzle tasks. Initially concentrated
near zero with extended tails, the pass@1 distributions evolved markedly after training. Codeforces
problems exhibit broader distribution patterns post-training, while the family_relationships task (Appendix C.2 for an example), which is a novel reasoning challenge, demonstrate a dramatic shift
from predominantly zero accuracy to peaking at perfect accuracy, indicating successful solution
discovery across the majority of prompts. These pronounced distribution changes, driven by extended
RL training, produce sufficient improvement in expected pass@1 to overcome any negative effects
from increased variance.
5 Related Work
Reasoning Models Reasoning models represent a specialized category of AI systems that engage
in detailed, long chain-of-thought before generating final answers, a concept first introduced by
OpenAI’s o1 series models [43]. Subsequently, DeepSeek [2] and Kimi [44] detail methodologies
for training reasoning models using reinforcement learning with verifiable rewards (RLVR). Both
approaches have popularized RL algorithms like GRPO [16], Mirror Descent[45], RLOO [46] and
other variants. While numerous open-source efforts have attempted to reproduce o1-like models,
most focus on single domains [3, 7, 6] or study test-time compute scaling [47], with few addressing
prolonged reinforcement learning training or examining RL training time scaling laws. As widely
acknowledged in the reinforcement learning community, RL training presents significant challenges
due to its sensitivity to hyperparameters [48]. Various reinforcement learning techniques [5, 4]
have been studied to enhance training stability for sustained optimization periods. Our research
demonstrates that achieving prolonged RL training can substantially expand the boundaries of
reasoning capabilities in these models.
RL Reasoning Boundary Achieving superhuman performance has been the holy grail of machine
learning, with reinforcement learning algorithms successfully delivering on this expectation, starting
with DeepQ networks for Atari games [49, 50]. More recently, AlphaGo and AlphaZero [51] have
demonstrated that AI agents can enhance their performance indefinitely by continuously iterating
between data collection via Monte Carlo Tree Search and policy improvement. These examples show
that RL training helps agents develop novel techniques not present in their base models [52–56].
However, challenging this perspective, several recent studies question whether RL training genuinely
enhances the reasoning capacity of LLMs. One work [13] argue that the RLVR method fails to
extend this capacity, as evidenced by pass@k metrics showing no improvement and in some cases
deterioration, compared to the base model, a trend echoed by other researchers [14]. Similarly,
another work [15] finds that RL algorithms tend to converge toward a dominant output distribution,
merely amplifying existing pretraining patterns. Beyond pass@k metrics, alternative measurements
like creativity index [12] can also determine whether models learn new ideas through RL training,
which we employ during our studies.
6 Conclusion
In this work, we address whether reinforcement learning can truly expand language models’ reasoning
boundaries. Through our introduction of ProRL, we provide compelling evidence that extended,
stable RL training develops novel reasoning patterns beyond a base model’s initial capabilities.
ProRL incorporates KL divergence penalties and periodic reference policy resets to maintain training
stability over long durations. Using this approach, we developed a state-of-the-art 1.5B parameter
generalist reasoning model trained on diverse datasets spanning mathematics, coding, STEM, logical
puzzles, and instruction following tasks. Our analysis reveals ProRL is particularly effective for tasks
where the base model initially struggles. Most importantly, ProRL enables strong generalization to
out-of-distribution tasks and increasingly complex problems, demonstrating that extended RL training
helps models internalize abstract reasoning patterns transferable beyond the training distribution.
These results challenge previous assumptions about RL’s limitations and establish that sufficient
training time with appropriate techniques can meaningfully expand reasoning boundaries, providing
valuable direction for development of more capable reasoning models.