<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>PTRM on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/ptrm/</link>
        <description>Recent content in PTRM on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Wed, 10 Jun 2026 14:54:59 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/ptrm/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Probabilistic TRM: A Little Randomness Makes a Tiny Reasoning Model Much Stronger</title>
        <link>https://knightli.com/en/2026/06/10/probabilistic-tiny-recursive-model-test-time-compute/</link>
        <pubDate>Wed, 10 Jun 2026 14:54:59 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/06/10/probabilistic-tiny-recursive-model-test-time-compute/</guid>
        <description>&lt;p&gt;A new arXiv paper introduces the Probabilistic Tiny Recursive Model, or PTRM. The idea is simple: instead of letting a Tiny Recursive Model follow one fixed reasoning path, add a bit of random noise to its hidden state at inference time, run multiple trajectories in parallel, and use the model&amp;rsquo;s existing Q head to choose the answer most likely to be correct.&lt;/p&gt;
&lt;p&gt;What makes the method interesting is that it does not change training and does not require hand-written augmentation rules for each task. The authors only add test-time compute, yet report clear gains on reasoning benchmarks: Sudoku-Extreme rises from 87.4% to 98.75%, and multiple Pencil Puzzle Bench tasks rise from 62.6% to 91.2%. On the latter, PTRM reaches 91.2% with 7M parameters, above the 55.1% frontier LLM baseline cited in the paper, at less than 0.0001x the cost.&lt;/p&gt;
&lt;h2 id=&#34;where-trm-is-strong&#34;&gt;Where TRM Is Strong
&lt;/h2&gt;&lt;p&gt;TRM reasons differently from the usual large language model.&lt;/p&gt;
&lt;p&gt;An LLM typically generates an answer token by token, sometimes with a chain of thought, code, or explanation. TRM repeatedly refines an answer inside a continuous hidden state. The same small network is called many times, updating its internal state and current answer until it reaches a final solution.&lt;/p&gt;
&lt;p&gt;This lets TRM solve structured reasoning problems with very few parameters, such as Sudoku, mazes, and pencil-and-paper logic puzzles. It is not relying on broad language knowledge; it is using recursive updates to push an answer toward a valid state.&lt;/p&gt;
&lt;p&gt;The downside is that deterministic recursion can get stuck. If the model enters a bad basin, more iterations may only keep it circling inside the wrong region.&lt;/p&gt;
&lt;h2 id=&#34;failure-often-means-getting-stuck&#34;&gt;Failure Often Means Getting Stuck
&lt;/h2&gt;&lt;p&gt;The authors analyze TRM trajectories on Pencil Puzzle Bench and find three broad patterns:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Pattern&lt;/th&gt;
          &lt;th&gt;Behavior&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Fast success&lt;/td&gt;
          &lt;td&gt;Quickly enters the correct region; answer accuracy and Q value rise together&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Delayed success&lt;/td&gt;
          &lt;td&gt;Wanders in a wrong region, then jumps into the correct one&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Failure&lt;/td&gt;
          &lt;td&gt;Keeps oscillating in a wrong region and ends with an incorrect answer&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A basin here can be understood as a local region in hidden space. A good basin decodes into the correct answer; a bad basin decodes into a wrong one. TRM&amp;rsquo;s problem is not that it lacks all solving ability, but that a deterministic trajectory has little mechanism for escaping once it lands in a bad basin.&lt;/p&gt;
&lt;p&gt;TRM also already has a Q head. During training, this head estimates whether the current answer is good enough and helps decide whether computation can stop early. The paper finds that Q scores are highly correlated with answer quality: correct trajectories tend to get higher Q values, while failed trajectories remain low.&lt;/p&gt;
&lt;p&gt;In other words, the model already has an internal signal for &amp;ldquo;does this path look right?&amp;rdquo;, but standard inference does not make full use of it.&lt;/p&gt;
&lt;h2 id=&#34;how-ptrm-works&#34;&gt;How PTRM Works
&lt;/h2&gt;&lt;p&gt;PTRM can be summarized in three steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run multiple rollouts for the same problem in parallel;&lt;/li&gt;
&lt;li&gt;Inject Gaussian noise into the hidden state during each deep recursive step;&lt;/li&gt;
&lt;li&gt;Use the Q head to score each trajectory and choose the answer with the highest Q value.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This adds a width dimension to TRM. The usual approach can run more recursive steps, increasing depth. PTRM runs multiple slightly different paths at the same time, increasing width.&lt;/p&gt;
&lt;p&gt;It resembles multi-sampling for LLMs: ask the model for several candidate answers, then choose by voting or verification. The difference is that PTRM does not generate natural-language reasoning chains. It samples trajectories in continuous hidden space, and its verifier is not an external model but TRM&amp;rsquo;s own trained Q head.&lt;/p&gt;
&lt;h2 id=&#34;why-random-noise-helps&#34;&gt;Why Random Noise Helps
&lt;/h2&gt;&lt;p&gt;At first glance, adding noise to inference sounds like making the system less stable. For a recursive model like TRM, moderate noise can help it escape a wrong trajectory.&lt;/p&gt;
&lt;p&gt;The paper gives an example where deterministic TRM cannot solve a puzzle. Among 100 random rollouts, 92 still stay in bad basins, but 8 escape into the correct region and produce the right answer. If the Q head can identify those 8, the final output changes from wrong to correct.&lt;/p&gt;
&lt;p&gt;That is the main benefit of PTRM. It does not require every trajectory to improve. It only needs some parallel trajectories to find a correct solution, and the Q head to select them.&lt;/p&gt;
&lt;h2 id=&#34;width-is-more-practical-than-depth&#34;&gt;Width Is More Practical Than Depth
&lt;/h2&gt;&lt;p&gt;TRM can also use more test-time compute by increasing recursive steps, but deeper recursion is sequential: the next step depends on the previous one. PTRM&amp;rsquo;s rollouts are naturally parallel, which fits GPUs better.&lt;/p&gt;
&lt;p&gt;On the PPBench validation set, the paper observes that pass@K and best-Q@K both improve as the number of rollouts increases. More interestingly, best-Q@K stays close to oracle pass@K, suggesting that the Q head acts almost like a correct-answer selector in these tests.&lt;/p&gt;
&lt;p&gt;Simply choosing the most frequent answer helps much less. PTRM&amp;rsquo;s gain is not just &amp;ldquo;run it several times and vote&amp;rdquo;; it depends on the Q head&amp;rsquo;s ability to recognize rare correct trajectories.&lt;/p&gt;
&lt;h2 id=&#34;how-strong-are-the-results&#34;&gt;How Strong Are the Results?
&lt;/h2&gt;&lt;p&gt;The key numbers are:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Benchmark&lt;/th&gt;
          &lt;th&gt;Standard TRM&lt;/th&gt;
          &lt;th&gt;PTRM&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Sudoku-Extreme&lt;/td&gt;
          &lt;td&gt;87.4%&lt;/td&gt;
          &lt;td&gt;98.75%&lt;/td&gt;
          &lt;td&gt;No retraining; only random rollouts at test time&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pencil Puzzle Bench tasks&lt;/td&gt;
          &lt;td&gt;62.6%&lt;/td&gt;
          &lt;td&gt;91.2%&lt;/td&gt;
          &lt;td&gt;7M parameters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;PPBench vs frontier LLM&lt;/td&gt;
          &lt;td&gt;55.1%&lt;/td&gt;
          &lt;td&gt;91.2%&lt;/td&gt;
          &lt;td&gt;Paper reports PTRM costs less than 0.0001x&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These results should not be read as &amp;ldquo;small models beat large models everywhere.&amp;rdquo; PTRM targets structured, verifiable reasoning tasks with clear training distributions. Its performance on Sudoku and pencil puzzles does not mean it can replace general LLMs for open-ended QA, writing, coding collaboration, or tool use.&lt;/p&gt;
&lt;p&gt;But it does show that, for some reasoning tasks, architecture and test-time search can matter more than simply adding parameters.&lt;/p&gt;
&lt;h2 id=&#34;scope&#34;&gt;Scope
&lt;/h2&gt;&lt;p&gt;PTRM is most suitable when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the answer space is structured;&lt;/li&gt;
&lt;li&gt;the problem has a clear correct answer;&lt;/li&gt;
&lt;li&gt;the model has already learned most of the solving ability;&lt;/li&gt;
&lt;li&gt;failures mainly come from stuck reasoning trajectories, not missing knowledge;&lt;/li&gt;
&lt;li&gt;there is a reliable internal scoring head or external verifier.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For open-ended generation, such as writing articles, product analysis, or casual chat, PTRM cannot be applied directly. These tasks do not have a single standard answer, and a Q head cannot easily judge correctness from internal state alone.&lt;/p&gt;
&lt;p&gt;The other limitation is compute. PTRM turns one trajectory into many, so the gain comes from extra test-time compute. Even if each TRM is tiny, cost still rises with rollout count.&lt;/p&gt;
&lt;h2 id=&#34;what-it-suggests-for-ai-agents&#34;&gt;What It Suggests for AI Agents
&lt;/h2&gt;&lt;p&gt;PTRM is a model paper, but the idea is relevant to Agent systems.&lt;/p&gt;
&lt;p&gt;Many Agent failures happen not because the first step is impossible, but because the system enters a wrong route and keeps building on a wrong assumption. PTRM reminds us that instead of betting on one reasoning path, a system can keep multiple candidate trajectories and use tests, rules, verifiers, or scoring models to choose a better one.&lt;/p&gt;
&lt;p&gt;This echoes the shift from prompt engineering to loop engineering. The focus is not just writing a prettier prompt, but designing a loop of generation, perturbation, validation, selection, and retry.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The value of Probabilistic TRM is not merely the trick of adding noise. It shows a practical point: if a small model already has solving ability, inference-time search and selection can unlock much more of that ability.&lt;/p&gt;
&lt;p&gt;For large models, test-time compute often appears as multi-sampling, reflection, tool verification, and long-chain reasoning. For recursive models like TRM, it can appear as random rollouts in hidden space plus Q-head selection. The forms differ, but both answer the same question: when the model takes a wrong first path, does the system have a way to try another one?&lt;/p&gt;
&lt;p&gt;References: &lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2605.19943&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;arXiv: Probabilistic Tiny Recursive Model&lt;/a&gt;, &lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/html/2605.19943v1&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;HTML version&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
