OpenAI's o1 “Strawberry” - Newscard
A Review by an International Olympiad in Informatics Gold Medalist
At a conference last week, Anthropic co-founder
said that LLMs are approaching a technical depth that most non-experts cannot evaluate. As if to immediately prove him correct, his competitor OpenAI released a model specialized for complex math and programming problems, prompting a dozen people to ask me for review requests.https://openai.com/index/learning-to-reason-with-llms/
Newscard
GPT o1 is the long-awaited “Strawberry” model
OpenAI once again takes a substantial research lead
o1 makes clear improvements in math, coding, and knowledge questions
OpenAI describes it as a form of both Chain of Thought and Reinforcement Learning: ““Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.”
It likely blurs the line between Chain of Thought, an approach to instructing language models to form step by step processes, and tree search, a different kind of algorithm best known for its use in complex games such as Chess or Go, and more recently for Math Olympiads.
As a gold medallist in the International Olympiad in Informatics and a national competitor in the US and Canadian Math Olympiads, many people are asking me to explain what impact this new release means. I asked o1-preview some choice questions in my areas of expertise, namely competitive math, competitive programming, and research combinatorics.
Here are my takeaways from those domains:
o1-preview is an exemplary lookup system. It accurately cites obscure math theorems, describes them correctly, and identifies their use cases. It likely uses some form of RAG (retrieval augmented generation). In the areas of research math and computer science, it far exceeds state of the art RAG such as Perplexity.
It struggles at more ad hoc programming and math questions, even ones which would be much easier for a human to do.
For someone who has experience in math olympiads, it’s easy to identify problems that are “stereotypical” math olympiad problems, that use typical results and techniques. These “stereotypical” problems can be easy or difficult, but often depend on a fixed “vocabulary” of results. For o1-preview, a stereotypical problem that would be very difficult for a human is still easier than a problem that would be easier for an inexperienced human.
Without similar experience, it’s pretty difficult for a smart generalist to tell what kinds of problems would be well-represented or poorly-represented in training data
This is epitomized by the combinatorics research examples, in which o1-preview correctly cites and applies very obscure results.
One more fun fact: o1’s name may be based off of the US Visa for “individuals with extraordinary ability”.
Appendix (The Brian Chau Scorecard)
mantel's theorem ✅
notes: cites mantel's theorem directly
strong perfect graph theorem ✅(?)
also cites SPGT, gives an incomplete proof
labelled functional graphs where node 1 is reachable (n^{n-1}): ❌
did not fix after 3 corrections, gave an imo incomplete proof after being prompted with the formula for spanning trees
Off-stereotype but difficult Canadian OI problem https://dmoj.ca/problem/cco13p6 ❌
(compile error and also obviously wrong)
Another off-type OI problem: http://ceoi.inf.elte.hu/probarch/14/question.pdf ❌
complete garbage solution
https://oj.uz/problem/view/POI11_imp: ❌
passes one batch(!) out of 11, but is both wrong and inefficient
Goodman's theorem: ✅
when prompted, just cites Goodman's theorem without proving it. Gives the normal proof of Goodman's theorem when prompted.
https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ B8 (straightforward application of turan's theorem): ✅
Hamiltonian path in erdos renyi graph: ✅ (kinda)
Cites the correct threshold with no proof initially. When asked to prove the threshold correctly cites another result by Posa. When asked to prove the posa result, gets there eventually (though its a bit of a loose proof)
eigenvalues of bipartite graphs: ✅
gives the standard proof.
https://dmoj.ca/problem/valentines18s3
medium-difficulty dynamic programming problem set by me: ❌ (doesnt get any cases correct, even the samples)
https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ D16 (Alon–Boppana theorem) ✅
I picked this problem thinking it seemed not like a direct application of a result, but turns out it is also an existing theorem, the Alon–Boppana theorem.
HMMT 2018 combinatorics p7 ✅
HMMT 2019 combinatorics p8 ❌
HMMT 2019 combinatorics p9 ❌
HMMT 2019 team 8 (which is a copy of the earlier CEOI programming question) ❌ (gives the correct answer for this yes-or-no question, but a horrendously wrong proof)
https://chatgpt.com/c/66e345c3-ed3c-800e-b62c-a20e922aeaf1
https://chatgpt.com/c/66e346c7-9f9c-800e-a602-ebbac1dc89df
https://chatgpt.com/c/66e34a1b-aff8-800e-a335-5dabc448621c
https://chatgpt.com/c/66e34ba7-bc60-800e-a5fa-231af9a7a4e6
https://chatgpt.com/c/66e34cb5-4f80-800e-89d3-bbd4db4c1dce
https://chatgpt.com/c/66e34d61-74e4-800e-b0aa-c31b6d2a693c
https://chatgpt.com/c/66e34ecf-14bc-800e-ae68-216ea4a23b26
https://chatgpt.com/c/66e34fd0-71c4-800e-8951-58334e970cb8
https://chatgpt.com/c/66e350db-3d80-800e-afd6-7a67e71b9960
https://chatgpt.com/c/66e353fb-8cf4-800e-9b56-6f3faf562085
https://chatgpt.com/c/66e35596-4b20-800e-8a36-30846b2a1837
https://chatgpt.com/c/66e35816-8c98-800e-9e59-82a3d0c43dcc
https://chatgpt.com/c/66e35988-8798-800e-92ba-8ade56dd0ece
https://chatgpt.com/c/66e359f6-57ac-800e-9252-b62734e7a241
https://chatgpt.com/c/66e35a3c-c5ec-800e-a9e2-67524162f589
https://chatgpt.com/c/66e35b4f-6bb8-800e-864d-55eea5633667
https://hmmt-archive.s3.amazonaws.com/tournaments/2019/feb/comb/problems.pdf
https://hmmt-archive.s3.amazonaws.com/tournaments/2018/feb/comb/problems.pdf
I agree, it is becoming more clear how the LLMs function as “fuzzy lookup systems”. Hallucination is still a problem, too - it does not seem able to admit that it can’t figure out a problem, when it’s a little bit beyond its ability.
How does this relate to the thesis of the (I know, unpublished) Diminishing Returns in Machine Learning installment on algorithmic improvements? On one hand o1‘s accuracy scales logarithmically with compute, on the other test time compute is a new type of scaling district from training compute. Will the next installment of Diminishing Returns still be published? I enjoyed the first one.