OpenAI's o1 “Strawberry” - Newscard

A Review by an International Olympiad in Informatics Gold Medalist

Sep 13, 2024

At a conference last week, Anthropic co-founder

said that LLMs are approaching a technical depth that most non-experts cannot evaluate. As if to immediately prove him correct, his competitor OpenAI released a model specialized for complex math and programming problems, prompting a dozen people to ask me for review requests.

https://openai.com/index/learning-to-reason-with-llms/

Newscard

GPT o1 is the long-awaited “Strawberry” model
OpenAI once again takes a substantial research lead
o1 makes clear improvements in math, coding, and knowledge questions

OpenAI describes it as a form of both Chain of Thought and Reinforcement Learning: ““Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.”
It likely blurs the line between Chain of Thought, an approach to instructing language models to form step by step processes, and tree search, a different kind of algorithm best known for its use in complex games such as Chess or Go, and more recently for Math Olympiads.

As a gold medallist in the International Olympiad in Informatics and a national competitor in the US and Canadian Math Olympiads, many people are asking me to explain what impact this new release means. I asked o1-preview some choice questions in my areas of expertise, namely competitive math, competitive programming, and research combinatorics.

Here are my takeaways from those domains:

o1-preview is an exemplary lookup system. It accurately cites obscure math theorems, describes them correctly, and identifies their use cases. It likely uses some form of RAG (retrieval augmented generation). In the areas of research math and computer science, it far exceeds state of the art RAG such as Perplexity.
It struggles at more ad hoc programming and math questions, even ones which would be much easier for a human to do.
- For someone who has experience in math olympiads, it’s easy to identify problems that are “stereotypical” math olympiad problems, that use typical results and techniques. These “stereotypical” problems can be easy or difficult, but often depend on a fixed “vocabulary” of results. For o1-preview, a stereotypical problem that would be very difficult for a human is still easier than a problem that would be easier for an inexperienced human.
- Without similar experience, it’s pretty difficult for a smart generalist to tell what kinds of problems would be well-represented or poorly-represented in training data
This is epitomized by the combinatorics research examples, in which o1-preview correctly cites and applies very obscure results.

One more fun fact: o1’s name may be based off of the US Visa for “individuals with extraordinary ability”.

Appendix (The Brian Chau Scorecard)

mantel's theorem ✅

notes: cites mantel's theorem directly

strong perfect graph theorem ✅(?)

also cites SPGT, gives an incomplete proof

labelled functional graphs where node 1 is reachable (n^{n-1}): ❌

did not fix after 3 corrections, gave an imo incomplete proof after being prompted with the formula for spanning trees

Off-stereotype but difficult Canadian OI problem https://dmoj.ca/problem/cco13p6 ❌

(compile error and also obviously wrong)

Another off-type OI problem: http://ceoi.inf.elte.hu/probarch/14/question.pdf ❌

complete garbage solution

https://oj.uz/problem/view/POI11_imp: ❌

passes one batch(!) out of 11, but is both wrong and inefficient

Goodman's theorem: ✅

when prompted, just cites Goodman's theorem without proving it. Gives the normal proof of Goodman's theorem when prompted.

https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ B8 (straightforward application of turan's theorem): ✅

Hamiltonian path in erdos renyi graph: ✅ (kinda)

Cites the correct threshold with no proof initially. When asked to prove the threshold correctly cites another result by Posa. When asked to prove the posa result, gets there eventually (though its a bit of a loose proof)

eigenvalues of bipartite graphs: ✅

gives the standard proof.

https://dmoj.ca/problem/valentines18s3

medium-difficulty dynamic programming problem set by me: ❌ (doesnt get any cases correct, even the samples)

https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ D16 (Alon–Boppana theorem) ✅

I picked this problem thinking it seemed not like a direct application of a result, but turns out it is also an existing theorem, the Alon–Boppana theorem.

HMMT 2018 combinatorics p7 ✅

HMMT 2019 combinatorics p8 ❌

HMMT 2019 combinatorics p9 ❌

HMMT 2019 team 8 (which is a copy of the earlier CEOI programming question) ❌ (gives the correct answer for this yes-or-no question, but a horrendously wrong proof)

https://chatgpt.com/c/66e345c3-ed3c-800e-b62c-a20e922aeaf1

https://chatgpt.com/c/66e346c7-9f9c-800e-a602-ebbac1dc89df

https://chatgpt.com/c/66e34a1b-aff8-800e-a335-5dabc448621c

https://chatgpt.com/c/66e34ba7-bc60-800e-a5fa-231af9a7a4e6

https://chatgpt.com/c/66e34cb5-4f80-800e-89d3-bbd4db4c1dce

https://chatgpt.com/c/66e34d61-74e4-800e-b0aa-c31b6d2a693c

https://chatgpt.com/c/66e34ecf-14bc-800e-ae68-216ea4a23b26

https://chatgpt.com/c/66e34fd0-71c4-800e-8951-58334e970cb8

https://chatgpt.com/c/66e350db-3d80-800e-afd6-7a67e71b9960

https://chatgpt.com/c/66e353fb-8cf4-800e-9b56-6f3faf562085

https://chatgpt.com/c/66e35596-4b20-800e-8a36-30846b2a1837

https://chatgpt.com/c/66e35816-8c98-800e-9e59-82a3d0c43dcc

https://chatgpt.com/c/66e35988-8798-800e-92ba-8ade56dd0ece

https://chatgpt.com/c/66e359f6-57ac-800e-9252-b62734e7a241

https://chatgpt.com/c/66e35a3c-c5ec-800e-a9e2-67524162f589

https://chatgpt.com/c/66e35b4f-6bb8-800e-864d-55eea5633667

https://hmmt-archive.s3.amazonaws.com/tournaments/2019/feb/comb/problems.pdf

https://hmmt-archive.s3.amazonaws.com/tournaments/2018/feb/comb/problems.pdf

Kevin

I agree, it is becoming more clear how the LLMs function as “fuzzy lookup systems”. Hallucination is still a problem, too - it does not seem able to admit that it can’t figure out a problem, when it’s a little bit beyond its ability.

Expand full comment

1 reply by Brian Chau

sprachzarathustra

Sep 17, 2024Edited

How does this relate to the thesis of the (I know, unpublished) Diminishing Returns in Machine Learning installment on algorithmic improvements? On one hand o1‘s accuracy scales logarithmically with compute, on the other test time compute is a new type of scaling district from training compute. Will the next installment of Diminishing Returns still be published? I enjoyed the first one.

5 more comments...

From the New World

Discussion about this post