Stubsack: weekly thread for sneers not worth an entire post, week ending 19th January 2025 - awful.systems

blakestacey@awful.systems · 4 months ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 19th January 2025 - awful.systems

aio@awful.systems · 4 months ago

That o3 does well on frontier math held-out set is impressive, no doubt

I think there is plenty of room for doubt still. elliotglazer on reddit writes:

Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.

My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.

(emphasis mine). So there is good reason to doubt that the “held-out dataset” even exists.