Not too long ago, our CMU-MATH workforce proudly clinched 2nd place within the Artificial Intelligence Mathematical Olympiad (AIMO) out of 1,161 collaborating groups, incomes a prize of $65,536! This prestigious competitors goals to revolutionize AI in mathematical problem-solving, with the final word objective of constructing a publicly-shared AI mannequin able to successful a gold medal within the International Mathematical Olympiad (IMO). Dive into our weblog to find the successful method that set us aside on this vital contest.

Background: The AIMO competitors

The Synthetic Intelligence Mathematical Olympiad (AIMO) Prize, initiated by XTX Markets, is a pioneering competitors designed to revolutionize AI’s function in mathematical problem-solving. It pushes the boundaries of AI by fixing complicated mathematical issues akin to these within the Worldwide Mathematical Olympiad (IMO). The advisory committee of AIMO consists of Timothy Gowers and Terence Tao, each winners of the Fields Medal. Attracting consideration from world-class mathematicians in addition to machine studying researchers, the AIMO units a brand new benchmark for excellence within the subject.

AIMO has launched a collection of progress prizes. The primary of those was a Kaggle competitors, with the 50 take a look at issues hidden from opponents. The issues are comparable in problem to the AMC12 and AIME exams for the USA IMO workforce pre-selection. The non-public leaderboard decided the ultimate rankings, which then decided the distribution of $253,952 within the **one-million greenback prize pool** among the many high 5 groups. Every submitted resolution was allotted both a P100 GPU or 2xT4 GPUs, with as much as 9 hours to resolve the 50 issues.

Simply to present an concept about how the issues appear like, AIMO supplied a 10-problem coaching set open to the general public. Listed here are two instance issues within the set:

- Let (ok, l > 0) be parameters. The parabola (y = kx^2 – 2kx + l) intersects the road (y = 4) at two factors (A) and (B). These factors are distance 6 aside. What’s the sum of the squares of the distances from (A) and (B) to the origin?

- Every of the three-digits numbers (111) to (999) is colored blue or yellow in such a manner that the sum of any two (not essentially completely different) yellow numbers is the same as a blue quantity. What’s the most doable variety of yellow numbers there might be?

The primary drawback is about analytic geometry. It requires the mannequin to know geometric objects primarily based on textual descriptions and carry out symbolic computations utilizing the space method and Vieta’s formulation. The second drawback falls beneath extremal combinatorics, a subject past the scope of highschool math. It’s notoriously difficult as a result of there’s no normal method to use; fixing it requires artistic considering to use the issue’s construction. It’s non-trivial to grasp all these required capabilities even for people, not to mention language fashions.

On the whole, the issues in AIMO have been considerably more difficult than these in GSM8K, a regular mathematical reasoning benchmark for LLMs, and about as tough as the toughest issues within the difficult MATH dataset. The restricted computational sources—P100 and T4 GPUs, each over 5 years previous and far slower than extra superior {hardware}—posed a further problem. Thus, it was essential to make use of applicable fashions and inference methods to maximise accuracy throughout the constraints of restricted reminiscence and FLOPs.

Our successful method

Not like most groups that relied on a single mannequin for the competitors, we utilized a dual-model strategy. Our closing options have been derived via a * weighted majority voting* system, which consists of producing a number of options with a

*, assigning a weight to every resolution utilizing a*

**coverage mannequin***, after which selecting the reply with the very best whole weight. Particularly, we paired a coverage mannequin—designed to generate drawback options within the type of laptop code—with a reward mannequin—which scored the outputs of the coverage mannequin. Our closing options have been derived via a weighted majority voting system, the place the solutions have been generated by the coverage mannequin and the weights have been decided by the scores from the reward mannequin. This technique stemmed from our research on compute-optimal inference, demonstrating that weighted majority voting with a reward mannequin persistently outperforms naive majority voting given the identical inference funds.*

**reward mannequin**Each fashions in our submission have been fine-tuned from the DeepSeek-Math-7B-RL checkpoint. Beneath, we element the fine-tuning course of and inference methods for every mannequin.

**Coverage mannequin: Program-aided drawback solver primarily based on self-refinement**

The coverage mannequin served as the first drawback solver in our strategy. We famous that LLMs can carry out mathematical reasoning utilizing each textual content and applications. Pure language excels in summary reasoning however falls brief in exact computation, symbolic manipulation, and algorithmic processing. Packages, then again, are adept at rigorous operations and might leverage specialised instruments like equation solvers for complicated calculations. To harness the advantages of each strategies, we applied the Program-Aided Language Models (PAL) or extra exactly Tool-Augmented Reasoning (ToRA) strategy, initially proposed by CMU & Microsoft. This strategy combines pure language reasoning with program-based problem-solving. The mannequin first generates rationales in textual content kind, adopted by a pc program which executes to derive a numerical reply

To coach the mannequin, we would have liked an acceptable drawback set (the given “coaching set” of this competitors is simply too small for fine-tuning) with “floor reality” options in ToRA format for supervised fine-tuning. Given the issue problem (akin to AMC12 and AIME exams) and the particular format (integer solutions solely), we used a mix of AMC, AIME, and Odyssey-Math as our drawback set, eradicating multiple-choice choices and filtering out issues with non-integer solutions. This resulted in a dataset of two,600 issues. We prompted GPT-4o (and DeepSeek-Coder-V2) with few-shot examples to generate 64 options for every drawback, retaining people who led to right solutions. Our closing dataset contained 41,160 problem-solution pairs. We carried out supervised fine-tuning on the open-sourced DeepSeek-Math-7B-RL mannequin for 3 epochs with a studying charge of 2e-5.

Throughout inference, we employed the self-refinement method (which is one other extensively adopted method proposed by CMU!), offering suggestions to the coverage mannequin on the execution outcomes of the generated program (e.g., invalid output, execution failure) and permitting the mannequin to refine the answer accordingly.

Beneath we current our ablation research on the methods we employed for the coverage mannequin. We used the accuracy on a specific subset of the MATH take a look at set because the analysis metric. It’s simple to see the mix of methods that result in massive efficiency features in contrast with naive baselines.

Mannequin |
Output format |
Inference technique |
Accuracy |
---|---|---|---|

DeepSeek RL 7b | Textual content-only | Grasping decoding | 54.02% |

DeepSeek RL 7b | ToRA | Grasping decoding | 58.05% |

DeepSeek RL 7b | ToRA | Grasping + Self-refine | 60.73% |

DeepSeek RL 7b | ToRA | Maj@16 + Self-refine | 70.46% |

DeepSeek RL 7b | ToRA | Maj@64 + Self-refine | 72.48% |

Our finetuned mannequin | ToRA | Maj@16 + Self-refine | 74.50% |

Our finetuned mannequin | ToRA | Maj@64 + Self-refine | 76.51% |

Notably, the first-place workforce additionally used ToRA with self-refinement. Nonetheless, they curated a a lot bigger drawback set of 60,000 issues and used GPT-4 to generate options within the ToRA format. Their dataset was greater than **20x** bigger than ours. The price to generate options was far past our funds as an instructional workforce (over $100,000 primarily based on our estimation). Our drawback set was primarily based purely on publicly accessible knowledge, and we spent solely ~$1,000 for resolution technology.

**Reward mannequin: Resolution scorer utilizing label-balance coaching**

Whereas the coverage mannequin was a artistic drawback solver, it may typically hallucinate and produce incorrect options. On the publicly accessible 10-problem coaching set, our coverage mannequin solely appropriately solved two issues utilizing normal majority voting with 32 sampled options. Apparently, for one more 2 issues, the mannequin generated right options that did not be chosen as a result of incorrect solutions dominating in majority voting.

This statement highlighted the potential of the reward mannequin. The reward mannequin was an answer scorer that took the coverage mannequin’s output and generated a rating between 0 and 1. Ideally, it assigned excessive scores to right options and low scores to incorrect ones, aiding within the choice of right solutions throughout weighted majority voting.

The reward mannequin was fine-tuned from a DeepSeek-Math-7B-RL mannequin on a labeled dataset containing each right and incorrect problem-solution pairs. We utilized the identical drawback set as for the coverage mannequin coaching and expanded it by incorporating issues from the MATH dataset with integer solutions. Easy as it could sound, producing high-quality knowledge and coaching a robust reward mannequin was non-trivial. We thought of the next two important elements for the reward mannequin coaching set:

- Label steadiness: The dataset ought to comprise each right (constructive examples) and incorrect options (destructive examples) for every drawback, with a balanced variety of right and incorrect options.

- Variety: The dataset ought to embody various options for every drawback, encompassing completely different right approaches and varied failure modes.

Sampling options from a single mannequin can’t meet these elements. For instance, whereas our fine-tuned coverage mannequin achieved very excessive accuracy on the issue set, it was unable to generate any incorrect options and lacked variety amongst right options. Conversely, sampling from a weaker mannequin, reminiscent of DeepSeek-Math-7B-Base, hardly ever yielded right options. To create a various set of fashions with various capabilities, we employed two novel methods:

**Interpolate between robust and weak fashions.**For MATH issues, we interpolated the mannequin parameters of a robust mannequin (DeepSeek-Math-7B-RL) and a weak mannequin (DeepSeek-Math-7B-Base) to get fashions with completely different stage of capabilities. Denote by (mathbf{theta}_{mathrm{robust}}) and (mathbf{theta}_{mathrm{weak}}) the mannequin parameters of the robust and weak mannequin. We thought of interpolated fashions with parameters (mathbf{theta}_{alpha}=alphamathbf{theta}_{mathrm{robust}}+(1-alpha)mathbf{theta}_{mathrm{weak}}) and set (alphain{0.3, 0.4, cdots, 1.0}), acquiring 8 fashions. These fashions exhibited completely different drawback fixing accuracies on MATH. We sampled two options from every mannequin for every drawback, yielding various outputs with balanced right and incorrect options. This method was motivated by the analysis on mannequin parameter merging (e.g., model soups) and represented an fascinating utility of this concept, i.e., producing fashions with completely different ranges of capabilities.**Leverage intermediate checkpoints.**For the AMC, AIME, and Odyssey, recall that our coverage mannequin had been fine-tuned on these issues for 3 epochs. The ultimate mannequin and its intermediate checkpoints naturally supplied us with a number of fashions exhibiting completely different ranges of accuracy on these issues. We leveraged these intermediate checkpoints, sampling 12 options from every mannequin skilled for 0, 1, 2, and three epochs.

These methods allowed us to acquire a various set of fashions nearly at no cost, sampling various right and incorrect options. We additional filtered the generated knowledge by eradicating incorrect options with non-integer solutions because it was trivial to find out that these solutions are incorrect throughout inference. As well as, for every drawback, we maintained equal numbers of right and incorrect options to make sure label steadiness and keep away from a biased reward mannequin. The ultimate dataset accommodates 7000 distinctive issues and 37880 labeled problem-solution pairs. We finetuned DeepSeek-Math-7B-RL mannequin for two epochs with studying charge 2e-5 on the curated dataset.

We validated the effectiveness of our reward mannequin on the general public coaching set. Notably, by pairing the coverage mannequin with the reward mannequin and making use of weighted majority voting, our technique appropriately solved 4 out of the ten issues – whereas a single coverage mannequin may solely resolve 2 utilizing normal majority voting.

Concluding remarks: In the direction of machine-based mathematical reasoning

With the fashions and methods described above, our CMU-MATH workforce solved 22 out of fifty issues within the non-public take a look at set, snagging the **second place** and establishing **the very best efficiency of an instructional workforce**. This final result marks a big step in the direction of the objective of machine-based mathematical reasoning.

Nonetheless, we additionally observe that the accuracy achieved by our fashions nonetheless trails behind that of proficient human opponents who can simply resolve over 95% of AIMO issues, indicating substantial room for enchancment. There are a variety of instructions to be explored:

**Superior inference-time algorithms for mathematical reasoning.**Our dual-model strategy is a strong method to boost mannequin reasoning at inference time. Current analysis from our workforce means that extra superior inference-time algorithms, e.g., tree search strategies, may even surpass weighted majority voting. Though computational constraints restricted our skill to deploy this method within the AIMO competitors, future explorations on optimizing these inference-time algorithms can probably result in higher mathematical reasoning approaches.**Integration of Automated Theorem Proving.**Integrating automated theorem proving (ATP) instruments, reminiscent of Lean, represents one other promising frontier. ATP instruments can present rigorous logical frameworks and assist for deeper mathematical analyses, probably elevating the precision and reliability of problem-solving methods employed by LLMs. The synergy between LLMs and ATP may result in breakthroughs in complicated problem-solving situations, the place deep logical reasoning is important.**Leveraging Bigger, Extra Numerous Datasets.**The competitors strengthened an important lesson concerning the pivotal function of knowledge in machine studying. Wealthy, various datasets, particularly these comprising difficult mathematical issues, are important for coaching extra succesful fashions. We advocate for the creation and launch of bigger datasets targeted on mathematical reasoning, which might not solely profit our analysis but additionally the broader AI and arithmetic communities.

Lastly, we want to thank Kaggle and XTX Markets for organizing this excellent competitors. Now we have open-sourced our code and datasets utilized in our resolution to make sure reproducibility and facilitate future analysis. We invite the group to discover, make the most of, and construct upon our work, which is accessible in our GitHub repository. For additional particulars about our outcomes, please be happy to achieve out to us!