Personal Learnings← Interconnects  Library

Interconnects · Tech & AI

GPT-4o-mini changed ChatBotArena

TIER 4   2024-07-31

<p><em>Audio of this post is available on <a href="https://podcast.interconnects.ai/episodes/gpt-4o-mini-changed-chatbotarena">podcast players here</a>.</em></p><p><a href="https://www.interconnects.ai/p/chatbotarena-the-future-of-llm-evaluation">ChatBotArena</a> is the largest community evaluation tool for language models. The <a href="https://lmsys.org/">LMSYS</a> team, which emerged early in the <a href="https://www.interconnects.ai/p/behind-the-curtain-ai">post-ChatGPT craze</a>, works with most of the model providers to host all of the relevant models. If you’re looking to get to know how multiple models compare to each other, ChatBotArena is the place to start.</p><p>ChatBotArena casts language model evaluation through the wisdom of the crowd. For getting an initial ranking of how models stack up and how the models in the ecosystem are getting better, it has been and will remain crucial.</p><p>ChatBotArena does not represent a controlled nor interpretable experiment on language models.</p><p>When evaluating models to learn which are the best at extremely challenging tasks, distribution control, and careful feedback are necessary. For these reasons, ChatBotArena cannot definitively tell us which models are solving the hardest tasks facing language models. It does not measure how the best models are improving in clear ways. This type of transparency comes elsewhere.</p><p>For most of its existence, people correlated the general capabilities tested in ChatBotArena with a definitive ranking of <em>which models can do the hardest things for me</em>. This is not true. In both my personal experience reading data and what the community knows about the best models, the ChatBotArena ranking shows the strongest correlations with:</p><ol><li><p>Certain stylistic outputs from language models, and</p></li><li><p>Language models that have high rates of complying with user requests.</p></li></ol><p>Both of these have been open research problems in the last two years. <a href="https://www.interconnects.ai/p/how-rlhf-works-2">Style is deeply intertwined with how information is received by the user</a> and precisely refusing only the most harmful requests is a deeply challenging technical problem that both Meta (with Llama 2) and Anthropic (with earlier versions of Claude particularly) have gotten deeply criticized for.</p><p>Among closed labs, their styles have been greatly refined. All of Meta, OpenAI, and Anthropic have distinctive styles (admittedly, I haven’t used Google’s Gemini enough to know).</p><ul><li><p>Meta’s AI is succinct and upbeat (something that <a href="https://www.reddit.com/r/singularity/comments/1caj1tb/llama_3_is_now_top5_in_leaderboard_arena/">has been discussed many times on the LocalLlama subreddit</a>).</p></li><li><p>OpenAI’s style is the most robotic to me. It answers as an AI and contains a lot of information.</p></li><li><p>Claude’s style is <a href="https://www.interconnects.ai/p/switched-to-claude-from-chatgpt">intellectual, bordering on curious, and sometimes quick to refuse</a>.</p></li></ul><p>When ChatBotArena was founded, these styles were in flux. Now, they majorly shift the rankings depending on what people like. People seem to like what OpenAI and Meta put out.</p><p>There are clear reasons why OpenAI’s models top the charts on ChatBotArena. They were the originators of modern RLHF, have <a href="https://www.interconnects.ai/p/openai-rlhf-model-spec">most clearly dictated their goals with RLHF</a>, continue to publish innovative ideas in the space, and have always been ahead here. Most people just did not realize how important this was to evaluation until the launch of GPT-4o-mini. <a href="https://x.com/tszzl/status/1779608670181171504">Culture impacts AI style</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KxYI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KxYI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 424w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 848w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 1272w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KxYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KxYI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 424w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 848w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 1272w, https://substackcdn.com/image/fetch/$s_!KxYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a44064a-8226-472a-8598-c3b108e3eb94_2000x851.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every evaluation tool has weaknesses. Here’s the plot LMSYS <a href="https://x.com/lmsysorg/status/1815855136318840970/photo/1">recently shared</a> with early results for GPT-4o-mini, which caused a bit of a stir.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dkKt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dkKt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 424w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 848w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 1272w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dkKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png" width="1456" height="1063" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1063,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1256965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dkKt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 424w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 848w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 1272w, https://substackcdn.com/image/fetch/$s_!dkKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31740f35-abba-42b5-9c37-dfbd51a1ae73_2000x1460.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI <a href="https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/">announced GPT-4o-mini</a>, their latest model, marked as “intelligence too cheap to meter” <a href="https://x.com/sama/status/1813984333352649087">according to Sam Altman</a>. This model seems very likely to be distilled from current or unreleased versions of OpenAI’s models (<a href="https://www.interconnects.ai/i/145870222/are-gemini-flash-and-claude-haiku-distilled">as Claude does with Claude Haiku and Google with Gemini Flash</a>). This was an important model in OpenAI’s line-up and one that will be used to many applications via the popular OpenAI API. In terms of people following the frontier model market, they were disappointed to see OpenAI going smaller rather than bigger.</p><p>On an evaluation that would rank models on “absolute peak ability,” people would expect GPT-4o-mini to be no where near the top 3.</p><p>LMSYS went so far as to share a <a href="https://x.com/lmsysorg/status/1816838034270150984">thread</a> and a <a href="https://huggingface.co/spaces/lmsys/gpt-4o-mini_battles">demo</a> to specifically show how GPT-4o-mini performed so well on their arena and it paints a clear picture of what the <em>average</em> ChatBotArena user tests (as usual, there is plenty of similar discussion on <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ed01p8/why_gpt4o_mini_beats_claude_35_sonnet_on_lmsys/">LocalLlama</a>). I went through the battles, particularly with Claude 3.5 Sonnet, to confirm the discussion above. Mini has a distinctive list-and-line-breaks-style and Claude refuses a tad too many requests.</p><p>For a first example, in style:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P-Bf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P-Bf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 424w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 848w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P-Bf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:878426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P-Bf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 424w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 848w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!P-Bf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12a98a7-6d3e-44ea-8796-e99b7f90376d_2000x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Or a refusal:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XHNW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XHNW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 424w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 848w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XHNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:929537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XHNW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 424w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 848w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!XHNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbe29a5-6772-40a1-9e5a-82c478ea9d8e_2000x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are plenty more like this. For me, the “overall” category of ChatBotArena is taken with large error bars.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.interconnects.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Interconnects is a reader-supported publication. Consider becoming a subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email…" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Llama 3 in the arena</h3><p>The scores for Llama 3.1 in the Arena were <a href="https://x.com/lmsysorg/status/1818321701052276990">recently announced</a>. The 405B model comes in behind the latest Gemini Pro and ahead of Claude 3 Opus. The 70B model is closed to older versions of GPT-4.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PV00!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PV00!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 424w, https://substackcdn.com/image/fetch/$s_!PV00!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 848w, https://substackcdn.com/image/fetch/$s_!PV00!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!PV00!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PV00!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png" width="1456" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1404860,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PV00!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 424w, https://substackcdn.com/image/fetch/$s_!PV00!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 848w, https://substackcdn.com/image/fetch/$s_!PV00!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 1272w, https://substackcdn.com/image/fetch/$s_!PV00!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7379a1-775e-401d-8d28-e9763c1ca2ba_2000x1484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This figure slightly distorts things by making the y-axis very narrow — it makes the gaps between the models look bigger than they are. I very accurately <a href="https://x.com/natolambert/status/1817292317352710272">predicted this Arena positioning for Llama 405B a few days earlier on Twitter</a>:</p><blockquote><ol><li><p>Open weight models get to operate without a safety filter added (e.g. llama gaurd), which is a major boost.</p></li><li><p>Meta ai's concise and slightly different, friendly style will help it. Claude's style doesn't appeal to the masses who are voting.</p></li></ol></blockquote><p>Meta team members, being late to join the frontier model party, have shared plenty of interesting comments about ChatBotArena scores when discussing their projects. Some team members explicitly said that the first Llama 3 versions “outperformed expectations on the benchmark.” The <a href="https://www.latent.space/p/llama-3">recent episode of Latent Space</a> with a lead on the Llama alignment team had more details, emphasis mine.</p><blockquote><p>Now the models are getting so good that it's hard to get to some prompts to break them and to compare models and see their edge cases.</p></blockquote><p>And later on.</p><blockquote><p>Because when we did the preview, and I don't know yet what will be the results for this new Llama 3, but <strong>we ended [up] very high in this blind test leaderboard. And to be honest, I didn't expect that.</strong> I knew we had good results internally, but how that will transfer to perception from the community, people like using it in practice and comparing it to the other models, I didn't expect that positive feedback.</p></blockquote><p>He continues to say, about the community scoring the models, that “we are limited,” edited slightly for clarity:</p><blockquote><p>We are not good to do that. So, it gives you a very good indicator of how good, helpful, how on the main core of the distribution, simple prompts about the tone of the model compared to the others. But for much more complex prompts, much more intelligent reasoning, coding of complex stuff, it doesn't tell the full story.</p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h3>Partial solutions and next steps</h3><p>There are a few existing avenues to give us a more dependable signal for comparing top models.</p><ol><li><p>ChatBotArena’s built-in harder categories (Hard Prompts, Reasoning, Math, etc.).</p></li><li><p>Private human evaluation, such as <a href="https://scale.com/leaderboard">Scale AI’s new-ish leaderboard</a>.</p></li></ol><p>Both of these options have issues. Both of these options are better than the default, overall aggregate score on ChatBotArena. Evaluating cutting-edge language models is a domain-expert task. It is not cheap or convenient.</p><p>ChatBotArena’s <a href="https://lmsys.org/blog/2024-05-17-category-hard/">hard categories</a> are curated by training classifiers to route the prompts to different sections. The correlation between these prompts is closer to what we expect, as seen in their blog post.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEXP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEXP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 424w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 848w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 1272w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEXP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:460841,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UEXP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 424w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 848w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 1272w, https://substackcdn.com/image/fetch/$s_!UEXP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c9a6f0f-7d5b-4e60-b48a-dfdd86f2ee07_2000x2000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Fundamentally, this is only a small step in a direction that is extremely costly to go all the way on. This category requires a more powerful AI than we have to accurately classify the hard prompts. I suspect this category is actually prompts that <em>present as hard</em>, rather than being prompts that are <em>what is currently hard for language models</em>. We’re hoping to evaluate the latter, but the best we can do is a proxy. One solution could be human data from a provider that knows what the top labs are currently trying to overcome.</p><p>In the near future, I expect a mix of Hard Prompts, Math, and Code to become the default on ChatBotArena. It’s not an easy transition to make.</p><p>Here are the early results on Llama 3.1 405B Instruct on Scale’s leaderboard. It puts the model right at the top of instruction following, but below other frontier models on the more challenging tasks. This is more or less what I would expect in the shape of Llama’s performance characteristics, but it is closer to the top than I would expect.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YVK4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YVK4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 424w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 848w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 1272w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YVK4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png" width="1456" height="1094" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1094,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:571597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YVK4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 424w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 848w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 1272w, https://substackcdn.com/image/fetch/$s_!YVK4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c79e13-8f65-4720-afec-edd2e47ac146_2000x1503.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Unfortunately, Scale’s leaderboard has a ceiling on trust due to the clear conflict of interest where models they’re selling training data to likely have an advantage by being in-distribution for their human raters. This happens due to the infrastructure of incentivizing human curators, regardless of whether the prompts are indeed fully blind. Even with this, the value gained by looking at this leaderboard <em>with</em> the noisier, more random prompts of ChatBotArena, is high.</p><p>No evaluation tool has an infinite lifespan. It’s a sign of progress that the language modeling industry has partially outgrown ChatBotArena. The Arena will still be a central component of model launches, but we need to keep building more tools to give a more diverse and robust representation of language model evaluation. As many as possible of these should be run in public and by organizations with simple incentives.</p><p>While we focus on frontier models, choosing prompts that fairly compare the best language models and those a tier below them may not exist. Much like there were plenty of prompts that open models couldn’t begin to solve in the early days of GPT-4, there are likely plenty of prompts that Claude 3.5 Sonnet can ace and few other models can. Global arenas built on model-to-model evaluations can never capture this information. For this, we must continue relying on specific benchmarks. Whenever you average over categories, you reduce some signals.</p><div><hr></div><p><strong>Housekeeping</strong></p><ul><li><p>Audio of this post is available (soon) in <a href="https://podcast.interconnects.ai/">podcast</a> form (and sometimes on <a href="https://www.youtube.com/@interconnects">YouTube</a>).</p></li><li><p>My real podcast is at <a href="http://retortai.com">retortai.com</a>.</p></li><li><p><em>Paid subscriber Discord access in email footer.</em></p></li><li><p>Referrals → paid sub: Use the <a href="https://www.interconnects.ai/leaderboard">Interconnects Leaderboard</a>.</p></li><li><p>Student discounts in <a href="https://www.interconnects.ai/about">About page</a>.</p></li></ul>