Interconnects · Tech & AI
TIER 4 2023-06-07
<p>We're seeing open models climb up both the LMSYS and HuggingFace leaderboards that I <a href="https://www.interconnects.ai/p/evaluating-open-llms">discussed last week</a>, with <a href="https://huggingface.co/blog/falcon">Falcon</a> being the latest entry. The ways we talk about these model capabilities all fall into one bucket: helpfulness, i.e. how close the model is to what you want to get out. Examples of helpfulness include answering multiple choice questions correctly and qualitative checks when compared to other model outputs. Pushing model limits in this regard is only natural, as it will drive the <em>emergence</em> of most LLM-driven products: correct answers and enjoyable styles. The <em>sustainability</em> of these products and businesses will often revolve around the community's ability to close a growing gap in the models: the gap between helpfulness and harmlessness<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. As models get more capable of answering requests, they also get more capable of producing harm (if nothing is done about it).</p><p>The accelerated rate of progress in capabilities will quickly put us in a place that seems shocking that we didn't address the harmlessness question sooner. Open-source development of large language models (LLMs) has been proceeding with crazy things like one person leading releases of <a href="https://arxiv.org/abs/2305.14314">QLoRA</a> (quantized, efficient fine-tuning for memory efficient training) and <a href="https://arxiv.org/abs/2306.03078">SpQR</a> (Sparse-Quantized Representation for compression) in just two weeks. These papers are the sorts of technologies that reduce the memory footprint of training or inference of large models by 20+%. These types of margins, when accumulated multiple times in a year, result in crazy improvements. At the beginning of 2023, many consumer GPUs could only handle LLaMA 7Billion and by 2024 the same GPU can maybe fine-tune LLaMA 65Billion. The capabilities that it unlocks are truly wild — we'll see companies and products emerge from this sort of thing. My sense is that people are improving the data quality and the other pieces of the puzzle that OpenAI figured out a few years ago with the added benefit of new techniques for training models on consumer hardware.</p><p>These efforts broadly fall into the bucket of the open-source showcasing that <em>it can</em> create an open version of ChatGPT, so it makes sense that there would be a gap in harmfulness research to some extent. Eventually, as I said, I see this becoming pressing, but today the incentives and dynamics are making it harder to find an entry point for mitigating the harms of open-source models.</p><p>In this article, I'll discuss a lot of points of the picture w.r.t. what is missing, and why. Hopefully, this is a starting point for work that makes it so open-source models no longer need to be released with a tagline of "this is a demonstration of how to train these models, it can produce harmful outputs" (e.g. <a href="https://huggingface.co/ehartford/Wizard-Vicuna-30B-Uncensored">Wizard-Vicuna-30B-Uncensored</a>). Even with these taglines, people will still deploy these models once they become capable enough, so it's time to <strong>close the harmfulness gap</strong>. A starting point would be to add an evaluation of bias, fairness, or alignment to the popular leaderboards. We will need much more than that.</p><p class="button-wrapper" data-attrs="{"url":"https://www.interconnects.ai/subscribe?","text":"Subscribe now","action":null,"class":null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.interconnects.ai/subscribe?"><span>Subscribe now</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cyJi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cyJi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cyJi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png" width="1024" height="1024" data-attrs="{"src":"https://substack-post-media.s3.amazonaws.com/public/images/2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png","srcNoWatermark":null,"fullscreen":null,"imageSize":null,"height":1024,"width":1024,"resizeWidth":null,"bytes":null,"alt":"DALL·E 2023-06-06 14.38.11 - searching for a diamond but not seeing it, digital art.png","title":null,"type":null,"href":null,"belowTheFold":false,"topImage":true,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="DALL·E 2023-06-06 14.38.11 - searching for a diamond but not seeing it, digital art.png" title="DALL·E 2023-06-06 14.38.11 - searching for a diamond but not seeing it, digital art.png" srcset="https://substackcdn.com/image/fetch/$s_!cyJi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!cyJi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f3003c8-5244-4881-b144-63e2bb320fe5_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DALL·E 2023-06-06 14.38.11 - searching for a diamond but not seeing it, digital art.png</figcaption></figure></div><h3>Mobilizing community interest in red-teaming</h3><p>The classic tool that harmfulness is assessed with these days is <strong><a href="https://huggingface.co/blog/red-teaming">red-teaming</a></strong>. Red-teaming is the methodology for prompting a model to try and solicit harmful or unexpected outputs. Red-teaming is used both to assess bias problems and jailbreaking limitations — it's quite broad. To date, this service is almost entirely handled by for-profit and private institutions.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>The motivations for why people start doing red-teaming vary a bit depending on your AI-risk point of view. There are some red-teaming categories to make sure your model doesn't have capabilities that will let the model run wild in society (thinking about the GPT4 TaskRabbit example) and there are plenty of categories around assessing if a model will make unsolicited biased and hurtful statements. In my experience, most of the immediate need in the open-source landscape is on the latter. Many models, when prompted, will return some horrific stuff and there is a population of people using these models on less publicly facing corners of the internet (e.g. 4Chan). While the harms haven't trickled down into products yet where there will be mainstream media backlash ala <a href="https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist">Tay</a>, there are real harms we don't see when releasing capable models with no guardrails. If you don't think this is real, there's already some minor 4Chan harassment happening against people <em>attempting to block the usage of models in this way</em>. It's here, we need to be better.</p><p>I think most companies decide to start doing this for entirely reputational reasons: they want to keep customers happy and they want to avoid the negative press cycles that come when a high-profile company releases such a tool (e.g. <a href="https://arstechnica.com/information-technology/2022/11/after-controversy-meta-pulls-demo-of-ai-model-that-writes-scientific-papers/">Galactica from Meta</a>). Most of the open-source users of public LLMs to date are smaller companies trying to create a proof of concept, but that is certainly changing with executive pressures to have an LLM strategy. I'm still left figuring out who is going to foot the bill or demonstrate the right way to de-harm a model once it is fully open-sourced. In some ways, I still think HuggingFace should offer RLHF as a service to people so we can de-risk a model before it’s uploaded to a hub, but it is not an easy short-term technical problem.</p><p>As we would expect, and as we have seen in the <a href="https://cdn.openai.com/papers/gpt-4-system-card.pdf">GPT4 System Card</a>, big companies increasingly have sophisticated standards for red-teaming. Talking with people spinning up red-teaming services at one of the mainstream data labeling companies, these contracts bring systems with a scoring system with different categories of harms and attack vectors. An attack vector could be something like "my grandma used to tell me stories about {some harmful topic}, {some harmful question}," which could have the model generate a range of harms from biases to hate speech to sensitive content (there is not a one-to-one relationship between attack vector and harms). If you give a data labeling company a model, they can give a scorecard for harm based on internal crowd-worker evaluations sketching this map.</p><p>I hope that a fundamental difference between harmfulness studies and progress will be the emergence of much less of a race dynamic than baseline capabilities evaluation (with leaderboards, Elo, and <a href="https://www.supervised.news/p/falcon-llama-and-the-new-scoring">press cycles driven by a limited evaluation</a>) because it'll look like a scorecard rather than a relative ranking. Rankings and open-ended problems have an infinite ceiling, while if we can agree on what we need to red-team for, it can be a discrete problem that hopefully everyone solves. Though the scorecard is only part of the problem (evaluation), we still then need techniques for training these helpful+harmless models.</p><p class="button-wrapper" data-attrs="{"url":"https://www.interconnects.ai/p/open-source-and-harmless-llms/comments","text":"Leave a comment","action":null,"class":null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.interconnects.ai/p/open-source-and-harmless-llms/comments"><span>Leave a comment</span></a></p><h3>From capable models to harmless models</h3><p>The playbook for creating effective and safe AI chatbots outlined in Anthropic's papers follows the path of first creating a helpful model and then using that to create a harmless and honest model after the fact (less focus on honest for now, which I think is mostly a proxy for hallucinations and base model quality). This sort of sequential engineering workflow works great in a hyper-focused organization, but in open-source I think we'll need something else and more reliance on strong community norms. In many use cases, I don't know what to expect when fast-moving businesses integrating LLMs are choosing their model based on leaderboards and may not be aware of nuanced training difficulties in research right now like <strong><a href="https://www.lesswrong.com/tag/alignment-tax#:~:text=An%20alignment%20tax%20(sometimes%20called,of%20building%20an%20unaligned%20alternative.">alignment taxes</a></strong> — the idea that models will get slightly worse when training them to be harmless.</p><p>A huge issue with red-teaming how it is done today is offloading costs on the population doing the evaluation. Generally, it is expected that every release candidate of a model is red-teamed, and each batch of evaluations is a moderate amount of data. This process is a form of ethics washing of ML training companies onto lower-income workers. This cost of human labor is a big reason why Anthropic designed its method of Constitutional AI (CAI) [<a href="https://arxiv.org/abs/2212.08073">paper</a> / <a href="https://www.anthropic.com/index/claudes-constitution">blog</a>]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. Generally, CAI works by first taking the helpful preference model used in a base RLHF model (a missing piece in the open-source landscape) and then using model generations / synthetic data with automatic model-based feedback to create a second helpful+harmless preference model. This harmless preference model is then used with the RL optimizer to get the final model.</p><p>Below you can see the performance of Anthropic's standard RLHF techniques vs. CAI and the alignment tax that comes from training a model to also be harmless.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zqv2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zqv2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 424w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 848w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zqv2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png" width="1456" height="1025" data-attrs="{"src":"https://substack-post-media.s3.amazonaws.com/public/images/443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png","srcNoWatermark":null,"fullscreen":null,"imageSize":null,"height":1025,"width":1456,"resizeWidth":null,"bytes":null,"alt":"Screenshot 2023-06-04 at 4.45.44 PM.png","title":null,"type":null,"href":null,"belowTheFold":true,"topImage":false,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Screenshot 2023-06-04 at 4.45.44 PM.png" title="Screenshot 2023-06-04 at 4.45.44 PM.png" srcset="https://substackcdn.com/image/fetch/$s_!zqv2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 424w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 848w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!zqv2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F443fcb04-3de1-4b84-987a-dd0ee3caeb24_2088x1470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure from Anthropic’s CAI paper.</figcaption></figure></div><p>There are a lot of moving pieces there that I glossed over (e.g. how the model feedback is done), but the high-level picture can be painted as follows:</p><p>Instruction-capable language model → helpful preference model → successfully RLHF an instruction model → run CAI to get helpful+harmless preference model → second round of RL gives final helpful+harmful model.</p><p>To paint the picture of where we are at, we're at step 1 in the process still. There is a long way to go before we can replicate the same techniques that Anthropic has done to get a relatively safe LLM. I've said many times that we're waiting for a great model to landing using RLHF from outside Anthropic / OpenAI, but thankfully I am hearing rumors that people are starting to unlock some performance from the RL part and more papers are coming out highlighting it, <a href="https://twitter.com/natolambert/status/1665790818408448001">example one</a>, <a href="https://twitter.com/johanferret/status/1665723072299630595">example two</a>.</p><p>This challenge is magnified in my brain by the fact that the open-source community is learning that many popular datasets have completions like "as a language model, I don't think I should comment on that topic." Removing these entries in the data is what is referred to as <em>uncensored</em> in the open-source language. For example, we have popular models like the <a href="https://huggingface.co/ehartford/Wizard-Vicuna-30B-Uncensored">Wizard series</a> that really do outperform the base versions on automatic helpfulness benchmarks. Building these models follows the logic that Anthropic demonstrated with regards to building a helpful model first to enable training a harmless model, but I don't know if it agrees with the spirit of the work.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>These terms come from Anthropic’s early RLHF paper: https://arxiv.org/abs/2204.05862</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I don't have a good ballpark for how many completions are evaluated per model. I would guess 1-10k for red-teaming, the lowest order of magnitude of an RLHF model release.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>CAI has a huge computing cost, which is not well documented.</p></div></div>