Explainer of Ludwig, Mullainathan and Rambachan's 2026 Econometrics of LLM Paper

This is not technically a Claude Code post as it is an econometrics post.  
  
͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   

| |   
---|---|---  
| | | Forwarded this email? Subscribe here for more  
---  
  
# Explainer of Ludwig, Mullainathan and Rambachan's 2026 Econometrics of LLM Paper 

| | scott cunningham  
---  
| Mar 25| | | ∙| | Preview  
---|---|---  
|   
---  
   
---  
| | |   
---  
| |   
---  
| |   
---  
| |   
---  
| | READ IN APP  
---  
   
  
| |   
---|---|---  
  
This is not technically a Claude Code post as it is an econometrics post. It's an econometrics post about LLMs from a new paper by Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan entitled "Large Language Models: An Applied Econometric Framework", forthcoming in _Annual Review of Economics._ You can find the abstract below.

| |   
---|---|---  
  
As this is an econometrics explainer, and not so much a Claude Code one (even though it will be partly based on earlier posts I've done using Claude Code to analyze texts), I flip a coin three times to decide on paywalling. Best two out of three wins, and in this case, that was heads, which means it goes beyond the paywall.

So in this post, I'm going to walk through a new forthcoming paper by that power house teach of coauthors. Rambachan, many of you will know, is coauthor on the "credible parallel trends" paper with Jon Roth in _Restud_ from a few years ago. He has a fascinating research agenda, and was actually a guest on my podcast back in the day.

## S3E21: Ashesh Rambachan, Predictive Algorithms and Causal Inference, MIT  
  
---  
| | scott cunningham| | *  
---|---|---  
| June 11, 2024  
Greetings listeners! It is a pleasure to introduce this week's guest on the podcast, Ashesh Rambachan, an assistant professor of economics at MIT. I wanted to talk to Ashesh for two main reasons. First, because I wanted to, and second, because I was aware of some of his recent work in econometrics. His recent article on  
| | Read full story  
---  
  
But this is something they worked on relating to using LLMs for either prediction or estimation tasks. Their article is about using LLMs to automate text classification in economics research. Specifically, replacing expensive human annotation with cheap LLM labels. The manuscript is a deep discussion of the measurement error problems that arise when you do. 

The key theoretical result, which I'll try to break down carefully, is that high accuracy alone doesn't protect your regression estimates, because errors can correlate with your covariates in ways that destroy inference. Their solution is a small human-coded validation sample used not to replace the LLM but rather to debias its labels. 

I came upon this as I was trying to do more to write up the work I did on here with Claude Code to re-analyze a paper from PNAS that classified 305,000 Congressional speeches from the late 19th century to the present with regards to the speaker's sentiment about immigration. Here's the first substack about it, but there were a total of three I did. I decided over spring break to figure out a strategy for how to write this up, and I'm reading the Ludwig, et al. (2026) paper now to try and see if this might be the angle.

Claude Code  
---  
  
## Claude Code Part 14: I Asked Claude to Replicate a PNAS Paper Using OpenAI's Batch API. Here's What Happened (Part 1)   
  
| | scott cunningham| | *  
---|---|---  
| Feb 5  
I've been experimenting with Claude Code for months now, using it for everything from writing lecture slides to debugging R scripts to managing my chaotic academic life. But most of those tasks, if I'm being honest, are things I could do myself. They're just faster with Claude.  
| | Read full story  
---  
  
* * *

**The Paper That Caught My Eye**

So, let me back up. There is a cottage industry right now in writing papers about AI. It reminds me of Covid to some degree where at first there were only a few papers about Covid, then there were ten, then there were a hundred, then a thousand, then a hundred thousand, then it was a blizzard and I couldn't keep up with anything and so just stuck with my normal research agenda rather than make any effort at a contribution.

I'm not saying AI is like that now, but it's definitely pushing that way. I consider myself lucky that I actually find this interesting. I developed a class on the economics of AI at Baylor in the spring of 2025, and have been a fairly intense power user of gen AI ever since March 2023. I have thought about its practical use for economics, both thinking theoretically about work and aggregate output, but also thinking about how it could be a tool for me to do research. I have written (in admittedly weird papers) about using it for prediction, and I now use Claude Code intensively for my own research projects, as well as to start different types of research projects that I otherwise never would have begun.

This reclassification of 305,000 speeches from the late 19th century to the present is an example of a project that I would've never started had it not been for Claude Code. One thing leading to the next thing until I felt like I had a set of findings, and now I needed to better understand what econometricians had been saying about LLMs to see if there was something beyond the rote "replication" exercise I had been doing. 

And that was when this paper caught my eye. It was while I started reviewing what economic historians had been doing with large language models that I somehow found their paper, and now this substack is me taking a stab at explaining it to myself, as well as to you.

* * *

**What the Paper Actually Does**

Here's what I understand it to be, and I want to be honest about the fact that I'm working this out as I write.

Ludwig, Mullainathan, and Rambachan make a clean distinction between two ways economists use LLM outputs. The first is prediction problems and the second is estimation problems. The prediction problems is related to using LLMs to forecast some outcome, like me and Van Pham did in our paper. Or me, Jared Black and Coco Sun did with predicting the Harris-Trump election outcome for 100 days using an extension of the method that me and Van used (and failing miserably). 

But others too. You often see people using LLMs to forecast some outcome. Stock returns from financial headlines, for instance. Even before Claude Code, that was itself a growing cottage industry of applied work by academics and industry folks. Can LLMs _predict_ and if so how can we know? And how will we use it? And what decisions matter and which ones don't matter? Because me and Van found even seemingly relevant information fed to ChatGPT caused prediction errors to paradoxically _rise_. 

But the second is, like I said, about using LLMs for estimation problems. That is where you'd use the LLM to automate the measurement of some economic concept so that you can use that measurement again downstream. Maybe in a regression. 

| |   
---|---|---  
  
These two problems sound similar but they require completely different disciplines. Just like in prediction and causal inference, we use often the same tool (e.g., regression) but for very different tasks, it's the same here. 

* * *

**No Such Thing As A Free Label**

Labeling texts is expensive, or can be. You can use Mechanical Turk, but reports have been saying that the quality of MT has been deteriorating the last decade. You could pay students, but that's expensive as well. But if you had the labels, then you might want to estimate some population estimand, omega, with an estimator like the population average or a regression.

| |   
---|---|---  
  
The problem is that it's expensive, as I said, so the researcher substitutes LLM labels for the "true label". And instead runs that regression. 

The question then is obviously about bias. When can I believe that the cheap estimate is reliable?

* * *

**LLMs For Prediction versus Estimation**

For prediction, the key requirement is what they call "no training leakage". This is borderline to do with careful design, or even splitting into test and training samples. "No training leakage" means that the LLM's training data can't overlap with your evaluation sample. This sounds obvious. Whatever you think it maybe means, you probably would agree "leakage" doesn't sound like a great thing. But in this case, what it would mean is that your prompt engineering methods, like telling the model to "ignore information past this date" or whatever instructions, doesn't actually do it. They've tested whether prompts can create the necessary moat such that GPT or whoever literally doesn't know the training data exists, and it doesn't. Leakage regularly occurs. GPT-4o can literally memorize 344 out of 10,000 Congressional bill descriptions and then complete them verbatim from the first half alone, for instance.

So prompt engineering does not itself guarantee "no training leakage". In fact, it does not just not guarantee it. Rather, that doesn't work, and thus is not something you can use to satisfy the condition. 

For estimation, which is what I'm doing in this PNAS replication idea I've been doing on here for a month or so, the key requirement is a _validation sample_. This is different from the "no training leakage" concept, so put that out of your mind for now. With _validation sample_ , you need to collect your measurement (e.g., some classified sentiment) on a small random subsample using the expensive, careful, human-coded method. Then you use that subsample to debias the LLM's labels.

There are some things in this that I feel like I hear echoes of, but I don't know yet enough to be sure that's what I'm hearing. But it does sound like the kind of sample splitting econometrics I associate with things by the team that gave us double debiased machine learning. And I also sort of feel like I hear echoes of Abadie and Imbens (2011) bias correction method for matching, as well as the Ben-Michael, et al. paper on augmented synthetic control. But like I said, that could just be echoes, and as I learn more, I'll try to figure out those connections a bit better and determine if they're useful for my brain. 

Scott's Mixtape Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber...

Upgrade to paid

## Continue reading this post for free in the Substack app

Claim my free post

Or upgrade your subscription. **Upgrade to paid**

   
---  
| | | Like  
---  
| | Comment  
---  
| | Restack  
---  
   
  
(C) 2026 scott cunningham  
910 North 17th Street, Waco, Texas 76707   
Unsubscribe