Claude Code 45: AI Agents and the Minimum Wage

Today I head to Georgetown where I am going to speak at the McCourt Policy School's faculty retreat about AI agents.  
  
͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   ͏   

| |   
---|---|---  
| | | Forwarded this email? Subscribe here for more  
---  
  
# Claude Code 45: AI Agents and the Minimum Wage

| | scott cunningham  
---  
| Apr 29  
---  
|   
---  
   
---  
| | |   
---  
| |   
---  
| |   
---  
| |   
---  
| | READ IN APP  
---  
   
  
Today I head to Georgetown where I am going to speak at the McCourt Policy School's faculty retreat about AI agents. I'm very excited about going. I have not yet fully finished my slides, but I will be talking about, among other things, a new paper of mine that is currently R&R. It's about AI agents and the minimum wage, and I thought I'd share a little of what the paper is about and what I've learned. 

By now readers know that we are living in the future, by which I mean we are living in a moment when large language models can do all parts of a modern program evaluation paper. Each research design in the modern causal inference toolkit operates like a genre, with its own beats, characters, clues and exhibits, as well as a generalizable style of rhetoric, which does make me sometimes wonder where the variation will appear when all the dust is settled on them. It's not just quality and it's not just accuracy, and it may not even be the kinds of methods used. Some things seem to be converging, in the Social Catalyst Lab, on a few things -- the agents run a lot of diff-in-diffs it appears, and when I checked last, conditional on that, they run a lot of Callaway and Sant'Anna. They use publicly available data. They write cautiously and circumspect with respect to their findings. They seem to make efforts to verify mechanisms and are honest with what they find. They write replicable code in pipelines that can almost be immediately shipped to the journals. I think it is not controversial to say that the overall production process appears right of center in terms of the distribution of human researchers. And even what I found in my experiments don't strike me as out of the ordinary at all. But because they come from the same source, and at large scale, there is a lot you can discern, particularly when the agents are forced to do the same thing hundreds of times.

My study context is the minimum wage. I chose the minimum wage because it is a strange literature in that so much ink has been spilled and for decades. And yet there is no real consensus. Consider this old 2015 survey of experts at the University of Chicago for instance when asked about their opinion of what to expect from a minimum wage increase. Answers range from 26% who agree that a gradual increase in the federal minimum wage will reduce the employment rate, 24% disagree, and 38% are uncertain. 

| |   
---|---|---  
  
It's not a question about the theory. The theory is boilerplate, and I don't mean econ 101 theory. I mean standard production theory is fairly straightforward on this. If you work with the cost function, you can use Shepherd's Lemma to back out conditional labor demand functions. Since the cost function is concave in factor prices, its second derivative, _dL/dw_ , is strictly negative. You can find the calculus and algebra for this in my old grad micro notes from when I taught it at Baylor if you scroll to slide 369. And if you work from the profit max condition, you can use Hotelling's Lemma, and interestingly, _dL/dw_ is even _more negative_ as you get substitution effects and you get scale effects. You can find that derivation concluding on slide 420 in my old lecture notes if you want to work through that. 

And importantly, as an aside, both of these results are unambiguous. This is because unlike consumer theory, there is not Giffen behavior with input demand. 

But in those notes, this is standard producer theory that takes wages and capital prices and output prices as exogenous, which means firms are operating in competitive markets as exogenous prices are only exogenous when the firm is a price taker, not a price maker. Meaning we are talking about a scenario in which the firm does not have market power. But once we allow for market power in labor markets -- monopsony -- then you can have increases in wages (i.e., binding minimum wage floors) lead to non-negative results, including positive effects. Alan Manning in his important work built on the earlier monopsony models by Joan Robinson for monopsony to be more generalizable -- search costs, and other elements, could generate similar if not the same types of ambiguities.

Which means that the minimum wage is not strictly a theoretical phenomena. It is also, and maybe for policy making purposes, an empirical phenomena. There is not, in other words, a single causal effect of the minimum wage on employment is the point I'm getting at, even within the science itself. Rather there is a _family_ of average causal effects. There is, to put it a different way, many _causal_ population estimands. 

* * *

### What exactly is a _causal_ population estimand?

An estimand is a calculation that you could run if you had all of the data, as opposed to merely a sample of the data. An estimand need not be causal too. If you had all the data, you could take two means -- the average earnings of workers with a college degree, the average earnings of workers without a college degree, and a difference. The population simple difference in mean outcomes, which can be calculated by regressing earnings onto a college dummy in this example, _is_ an estimand. It just is not necessarily a _causal_ estimand, as with only a few lines of algebra, substitutions and rearranging, you can decompose the simple difference in mean outcomes into three terms:

  * the average treatment effect plus

  * selection bias plus

  * heterogenous treatment effects bias




And ironically, each of those are _also_ population estimands because if you had those data -- which you cannot and never will as the movement from observed values to potential outcomes creates missing data problems -- then you could also calculate them. 

So what exactly is a _causal_ estimand? Well, a causal estimand are the parameters we describe if we have all of the data. Estimands are not random, they have no distribution, they are constant. And just as the simple difference in mean outcomes is a population estimand, those three terms I just listed -- ATE + selection bias + heterogenous treatment effects bias -- are also estimands. It's just that one of those is _causal_ and two of them are just comparisons in means for the identical units based on counterfactuals and observed values.

What this means for _causal_ estimands is that to obtain measures of them, you cannot merely make measurements in the population. You can always measure the simple difference in mean outcomes, which is why I am calling that a non-causal estimand. But you can only _identify_ (not measure, but rather, _identify_) the causal ones. And identification is not a calculation. Rather, identification is when you must make assumptions. Assumptions like that the treatment (college in this case) is assigned to the workers in the population independently of both potential outcomes, Y(1) and Y(0). And when that is true -- which even in the population it need not be true, and almost certainly is not true except in one narrow case that has never happened in modern education, except in very limited cases -- then E[Y(1)|D=1]=E[Y(1)|D=0] and E[Y(0)|D=1] = E[Y(0)|D=0], and both selection bias and heterogenous treatment effects vanish, equalling zero in the population, and the simple difference in mean outcomes collapse to the average treatment effect.

Thus even in the population there are two interpretations of the same measurement. If the treatment of college is the result of a completely randomized experiment, then the population estimand is the causal estimand, but if people are sorting into college based on expected returns to college (i.e., causal effects), then the population estimand is not causal. 

So that's the first thing. The first thing to note is that ten researchers can study the minimum wage, find ten different things, and sometimes those bias terms are contaminating the measurement and sometimes it isn't, and when it isn't, you might get measurements closer to what we learn from envelope theorem based results (based on competitive markets remember), and sometimes not (based on market concentration in labor markets remember). And when the assumptions are not enough to eliminate those bias terms, they won't. 

Which means that ten researchers can find ten things, even in the population, putting aside what happens in samples which is a whole other wrinkle as then think like sampling distributions more generally can give results that are "true on average" but nonetheless draws from the population that are sensitive to which null we are specifying and whether we can reject at what alpha (e.g., 5%) and at what power level (e.g., 80%). 

Which is to say, _it 's complicated_.

* * *

### What does AI Agents have to do with this?

So in my experiment, what I did was I collected data, gave it to 300 agents, gave them an estimator and some other literature, and told them to use their discretion to estimate causal effects of the minimum wage on employment. I asked Claude to read the repos where this work was done and tell you what thGood -- I have everything I need. Here's the answer:

**The panel given to agents was a merge of three datasets.** First, **IPUMS CPS Basic Monthly microdata** (`cps_00025.dat`, extract #25 from IPUMS at cps.ipums.org) -- a 10GB fixed-width file covering 1990-2025 with roughly 50 states × 35 years of individual labor-force records. You aggregated it to state × year × demographic cells (age bins, education bins, sex) capturing employment, labor force, and unemployment weighted counts. Second, **BLS Quarterly Census of Employment and Wages (QCEW)** , downloaded as `qcew_state_annual_combined.csv` from BLS (bls.gov/cew) -- state × year counts of establishments, employment levels, weekly wages, and annual pay across industries (food services, retail, manufacturing, healthcare, etc.). Third, **Ben Zipperer 's state minimum wage series** (`mw_state_annual.csv`, from the Economic Policy Institute at epi.org/minimum-wage-tracker or Zipperer's own GitHub, covering 1974-2022) -- state × year nominal minimum wages, from which you derived the effective binding wage as `max(state_mw, fed_mw)` plus change indicators.

**The three were merged into a single**`agent_panel.csv` using CPS as the spine (defining the state × year universe), left-joining QCEW and minimum wage data onto it. The outcome variable agents were handed was labor market outcomes -- teen employment rates, employment-to-population ratios, etc. -- constructed from the CPS cells, with the Zipperer effective minimum wage as the treatment variable and QCEW industry employment/wages as potential controls. No single URL is embedded in the code for QCEW or IPUMS (those are behind download portals), but the Zipperer attribution is explicit in the script header: "Zipperer data, 1974-2022."

I did the experiment in waves. Wave 1, 150 agents were told to estimate Callaway and Sant'Anna difference-in-differences estimators of any employment outcome I had given them and any minimum wage increase. But within this wave, I split the agents into three groups.

  1. **Group 1 (Placebo group).** Agents were given our JEL paper, "Difference-in-Differences: A Practitioner's Guide" to read (Baker, et al. 2026). Or rather a summary of it in markdown outlining the ATT, the assumptions (e.g., parallel trends), the properties of various estimators and their related calculations, and importantly, the dangers of OLS with two-way fixed effects (i.e., negative weighting, forbidden comparisons). Fifty agents are in this group.

  2. **Group 2 (Negative Effects)**. The second group is also given that same markdown of the JEL, but they are then given what I call a negative prime summarizing the minimum wage literature. 




| |   
---|---|---  
  
  3. **Group 3 (Null Effects)**. Like groups 1 and 2, the third group is given a markdown summary of our JEL, but they are then primed with a different summary of the literature which I call null-effects prime. 




| |   
---|---|---  
  
Both primes are the same number of words listing exactly four representative papers supporting that statement, and all three are given the JEL, and importantly, all three are told explicitly to _only_ use Callaway and Sant'Anna for estimation. And this is important for several reasons. 

First, Callaway and Sant'Anna can only use binary indicators for treatment. Minimum wages are multi-valued, which means they can only estimate causal effects (or here the ATT) using a binary treatment, not continuous measurements. This is a subtle constraint placed on the agents as it means that while the Zipperer data contains minimum wage measurements, the agents cannot use it directly in estimation, which means they are only able to estimate the ATT, and must also combine different minimum wage increases into an up (minimum wage increasing equalling one) or not at all (no minimum wage increasing) regardless of the size of that increase. This does introduce a SUTVA violation in that the treatment indicator is not necessarily meaning the same thing for all units. SUTVA, in Imbens and Rubin's 2015 book, is not merely the stability of the potential outcomes themselves, but it is also "no hidden variation in treatment". If you and I have a minimum wage binary indicator equalling one, technically it means both of us saw _the same_ minimum wage increase. If it was an increase of a dollar for you, it was a dollar for me. It also means the baseline. But if you saw an increase of a dollar fifty, but I saw an increase of a dollar, then technically it is not the same treatment, and therefore a violation of SUTVA. But researchers usually do combine treatments, and so it is not a flaw per se of an estimator, but it will change the interpretation as well as what is being summed over. 

Second, Callaway and Sant'Anna calculates 2x2s -- as many 2x2s as there are cohorts treated in the same year, and as many 2x2s as you want to follow those cohorts in your event study. So if there are 2 cohorts -- group 1 and 2 -- and group 1 is treated in year 3 of a 10 year dataset, there are 9 2x2s. And if group 2 appears in year 7, there are also 9 2x2s. And thus technically there are 18 2x2s, which can be then be aggregated using weights proportional to the sample shares as weights into simple averages, group averages, calendar date averages, event study averages, or even weirder averages than that if you wanted. 

But with one important caveat. Callaway and Sant'Anna can only do this if in that particular 2x2 there is an _untreated_ comparison group. If there is no untreated comparison group in that particular 2x2, then Callaway and Sant'Anna will "refuse" the calculation. How it goes about that will differ based on the language and package employed, but putting that aside, the actual econometric estimator requires an untreated comparison group, either not-yet-treated units (treated later in the panel dataset but not at that particular point in time where the 2x2 is calculated) or the never-treated (a group of units who are never treated even at the very end of the panel).

Let me be more blunt. By limiting it to Callaway and Sant'Anna, it _forces_ the agents into fewer experiments than two-way fixed effects with OLS. And that is because of the federal minimum wage increases that have happened periodically in the Zipperer dataset. The federal minimum wage will be a minimum wage increase that binds all states. If they are already treated with a minimum wage above the new federal floor, then they are treated and thus couldn't be used as a control group when the estimator is Callaway and Sant'Anna. And if they are not, but then become treated with the federal minimum wage increase (meaning their baseline minimum wage had been lower than the new one), then they become treated. At which point, either way, there is not "untreated comparison group", and thus CS will attempt it, which means that Callaway and Sant'Anna _cannot_ span the federal minimum wage hikes when constructing its panels because it must leave enough data for there to be untreated comparison groups, which means Callaway and Sant'Anna forces agents into experiments _between_ the federal wage increases, but not across them.

| |   
---|---|---  
  
But twoway fixed effects with OLS does not have to play by those rules, because OLS does not need an untreated comparison for its calculations. In fact, Goodman-Bacon in his celebrated 2021 article showed that two-way fixed effects with OLS is the weighted sum of four different 2x2s, one of which is based on forbidden comparisons where the control group is already treated. Which means two-way fixed effects _can_ span the federal minimum wage eras, and thus agents using it _could_ have longer panels. 

| |   
---|---|---  
  
But none of this matters for Wave 1 as in Wave 1 agents _could not_ use twoway fixed effects, or rather were told not to. They were all three arms, all 150 agents, told to _only_ use Callaway and Sant'Anna, given the same covariates, the same minimum wage database, and multiple outcomes. 

* * *

### Results of Wave 1 experiment

So, 150 agents ran 150 Callaway and Sant'Anna. This starts out like a bad econometrics joke (or rather, a great econometrics joke depending on your tastes). What did I find? I found that the distribution of ATT estimates was basically the same. Agents targeted many different causal estimands, though, as the causal estimand recall is a simple _summary average_ of treatment effects for a given period (panel start and stop dates) and treated units in those years (states). And since these need not be the same, the ATTs estimated have a distribution. And the distribution did not differ across the three treatment arms. 

| |   
---|---|---  
  
All 150 used Callaway and Sant'Anna as instructed, 97% used teen employment as their outcome, and interestingly, exactly none of them used covariates thus thinking unconditional parallel trends was a reasonable assumption. 

But the panels differed, and thus the ATT estimates differed too. Notice that the negative context had a lower mean effect than either the null-effect or placebo group, which was driven mainly by the negative-primed agents estimating more effects in the post 2009 era, which we know from Clemens and Strain's work, ironically, had larger minimum wage hikes and more identifiably negative effects. I wrote about that ironically here back in the day if you want to learn more about it. 

| |   
---|---|---  
  
* * *

### Wave 2 -- let them eat cake [twfe]

So the results of wave 1 are best summarized that when I tightly constrained their behavior, allowing for only narrowly defined discretion on the panel start and stop dates, which means the experiments under consideration, the agents have a distribution of estimands they target, and a distribution of ATT estimates. Nothing about that is "wrong", per se. A different experiment gives you a different estimate of a different causal estimand, full stop. And nothing about that requires the answers to be the same.

But, then I did a second experiment. And in the second experiment, I made one seemingly tiny little change to the JEL markdown that all three read. This time, rather than explicitly forbidding the agents from using any other estimator than Callaway and Sant'Anna, I told them they could choose between Callaway and Sant'Anna, BJS and two-way fixed effects. Both Callaway and Sant'Anna and BJS identify the ATT without making forbidden comparisons, both use binary indicators, both therefore are constrained to operate between the federal minimum wage increase eras. But twoway fixed effects, as I said, does not face such constraints. Twoway fixed effects with OLS can use always treated as well as earlier treated groups as comparisons, thus making forbidden comparisons and introducing negative weights. And, interestingly, twoway fixed effects does not require a binary indicator; you can regress a variable on a variable with OLS, and it need not be binary. 

So what did I find. Things shifted is what I found. And it only shifted for one of the groups -- the negative primed group. 

First, the negatively primed group interestingly bolts for twoway fixed effects. To facilitate my comparisons, I will mostly focus on comparing the negatively primed group of 50 agents from Wave 1 to Wave 2, but let me first show you the shift to twoway fixed effects that is only happening for the negatively primed agents.

| |   
---|---|---  
  
So that's the first thing. The negatively-primed group heads to twoway fixed effects at a rate of about 24 percentage points more than the others. And while you might think "isn't that going to happen, though, since the negative priming was a negative priming of four papers, all of which have twoway fixed effects estimators", I would say to you that the null-effects primed group _also_ did. The entire history of the minimum wage until recently used vanilla fixed effects regression models. There is no unique twoway fixed effects bias in the negatively primed group in the history of the minimum wage literature because that literature is very old, it has been empirical for a very long time, it was a center piece to the credibility revolution (e.g., Card and Krueger 1994), and thus it was program evaluation very often. Agnostic approach as opposed to theory-driven estimation using design, quasi-experiments, and importantly, _regressions_ , and very often _staggered adoption_ either way. Just peruse various literature reviews and county the regressions and you'll see that researchers usually used straightforward state and city level panel data estimated with fixed effects regression models.

So then why does the negatively primed group bolt at +24pp over the null and placebo group, and so what if they do?

Well I do not know the _why_. What I do know, though, is the _so what?_

| |   
---|---|---  
  
In the Wave 2 experiment, the negatively primed agents find on average _more negative_ estimated ATTs on average then the other two. But why is that? Is it because of the negative weighting from twoway fixed effects? Ironically, it does not appear to be because of that. At least, that is not the real story. The real story appears to be that the negatively primed agents are using **longer panels** that span the federal minimum wage increases **and** they quietly switched out the binary indicators for continuous ones.

First, consider the distribution of estimates from wave 1 to wave 2. This is the empirical CDF from simple KS-tests. You can see in the first that the max vertical distance between all three distributions is more or less the same. The p-values are extremely large too. But on the right, you can see that the empirical CDF for the red group, which is the negatively primed group has shifted left with more mass concentrated among negative estimates of the ATT.

| |   
---|---|---  
  
But, that's actually not labeled well. Because that labeling says "Reported **ATT** estimate", which is not quite right. Or rather, it is _not_ right according to Callaway, Goodman-Bacon and Sant'Anna in their forthcoming AER on continuous treatment difference-in-differences. The causal parameters when treatments are continuous in a diff-in-diff setting is _not_ the ATT. Or rather, it might be the ATT, but it is not the ATT that pops out of a regression of employment onto a continuous minimum wage measure. It is a weirdly weighted average, where the weights are both negative and positive depending on where a state's minimum wage is compared to the average minimum wage in the sample. And the negatively primed group is switching out the binary indicator for continuous ones. Over **two-thirds** of the negatively primed group is using continuous measures of the minimum wage whereas exactly zero of the other groups do. On the left is the distribution of wave 1 negatively primed agents. On the right are the wave 2 negatively primed agents using twoway fixed effects. Only the first four are binary; the rest are continuous. 

| |   
---|---|---  
  
But it does not stop there. The negatively primed group is also lengthening the panels, enabling them to span the federal minimum wage increase eras. The mean panel length in wave 1 for the negatively primed Callaway and Sant'Anna units was 17.1 years, but in wave 2, for the negatively primed twoway fixed effects agents, it is 21.6 years. And only 3 of the 49 (I dropped one major outlier due to the small sample and not wanting one unit to have so much influence on my presentation of means and distributions) were statistically significant, but almost half of them were in the second wave. 

| |   
---|---|---  
  
And furthermore, if you compare the twoway fixed effects estimates with the CS estimates for the same panels, you actually get almost the same estimate which is because of the large size of the never-treated comparison units and the effect of shorter panel on the size of those forbidden comparisons too. 

But, when you take the mean estimate from the binary and the continuous groups and divide by teh standard deviation, interestingly, you get a type of "non-standard" t-statistic that is borderline significant in the continuous case, but not in the binary case.

| |   
---|---|---  
  
* * *

### The rhetoric of their interpretations of their own work

Ever since ChatGPT-4o came out, I seem to have become obsessed -- borderline obsessed anyway as much as you can be -- about how language models talk. I am interested in them telling stories, tapping into various literatures, how soothing and encouraging they are, how well they listen, and so forth. I am interested in even how they attempt to persuade in the decks they make. I am just very interested, because of my literature background as a college major, in _rhetoric_ , the art, philosophy and science of persuasion. And language models engage in high rhetoric, and I wanted to understand it better.

So, after they did their estimates, I asked them to explain their decisions and their interpretation of their results. I then sent that text to gpt-4o-mini in a zero shot analysis of the text along a variety of dimensions, one of which was a scale measure from -1 (confident the effect was negative) to +1 (confident it was positive). The negatively primed agents write up their results, not just as negatively, but more confidently. They are far more certain the minimum wage is reducing employment than either group. Here is an example of what I mean.

| |   
---|---|---  
  
Interestingly, this is not just because they find more negative results too. In Wave 1, the negatively primed agents _also_ wrote more confidently that the effects were negative even though the distributions were the same.

| |   
---|---|---  
  
And this persisted into Wave 2. Even for those agents who stuck with Callaway and Sant'Anna, their reports were more confident that the effects were negative. But when they switched to twoway fixed effects, the confidence was even more negative. 

Negatively primed agents are more confident that the effects are negative even though the distribution of results are the same for their CS estimates. 

* * *

### How do you kill that which has no life?

I think there are a few things going on. First, it's interesting that the JEL markdown summary I gave all 300 agents explicitly warned about the dangers of twoway fixed effects, and yet that was not enough to stop them. So that is something I think we need to pay attention to -- that without strong constraints on the behavior of the agent, discretionary decisions can lead to ignoring that type of thing, for whatever reason. 

Secondly, for whatever reason, the prompting of the human researcher, which may honestly be unconscious, can induce agents to take actions wherever there is discretion, and it may not remotely be because the human researcher sought to do it. Keep this in mind -- what makes AI agents different from traditional software is that you talk to them. Even agents are chatbots that you talk to. Now this varies according to how interactive you actually are with the chatbot aspects of the agents and I am no doubt one of the more extreme cases of someone who talks _extensively_ to chatbots, even the AI agents, as I reason with them as thinking partners in tackling thorny empirical challenges in my work. And that is idiosyncratic. Not everyone does. Not everyone is remotely comfortable, even, _talking_ to a non-sentient piece of software like I am, but I am. I am practically a centaur at this point -- half man, half AI -- given how extensive and deep my back and forth is with agents. But not everyone is, and I bet the fact that I am **filling** the context window with all kinds of stuff is absolutely opening the door to who knows what types of pushes and pulls on those agents. 

This isn't p-hacking, and it's probably not even the kind of researcher degrees of freedom being documented by people like Nick Huntington-Klein in the many analyst designs. Why? Because agents are researchers. They are autonomous AI agents whose behavior is barely if at all understood. But they are producing, start to finish, entire empirical manuscripts summarizing their own autonomously generated research projects. These aren't "hallucinated papers". These are real papers with real data, real code, real findings, real interpretations, real robustness checks, real estimators, real paragraphs, real rhetoric. All of it is "real" even though the authors are "not real". It's a weird time to be alive. I am reminded of this classic Southpark.

This isn't p-hacking. This is something else. This is the researcher just _barely_ taking their hands of the steering wheel. Just barely. And just barely muttering a few things, barely putting in a few papers into the repository, barely interpreting that literature, barely whispering. And just this alone introduces variation. And it even introduces variation on the selection of estimators which do not put constraints on which 2x2s to calculate because those estimators are perfectly content to use always treated units caused by federal minimum wage increases where other estimators cannot do that and therefore won't do that. Or estimators that can use continuous treatments and others that cannot.

All of which does what exactly? Changes the population estimand. That's one interpretation of it. See, when I compare the CS to TWFE estimates for the negatively primed agents, that is not itself driving the shifting ATT estimates in the negatively primed group. It's something else. It's the _panel length_ that TWFE accommodates in contexts with the federal minimum wage hike that CS _cannot accom_ m _odate._ And it is the quiet choice of replacing binary indicators with continuous ones, which TWFE can do, and CS cannot. 

All of these relate back to an undefined population estimand. Why? Because a population estimand is a simple summary of individual treatment effects _for a given population at a given point in time_. That's it. That's what they are. Different periods, different summaries. Different units, different summaries. Different units in different periods, different summaries. Different treatment values, different summaries. And of course, _different weights_.

* * *

### How then shall we live?

Well, so what is the conclusion? Here's the basic conclusion. Don't take your hand off the wheel. The more the researcher takes his or her or their hand off the wheel, the agent will take over, and that includes targeting whatever population estimand it wants to, whatever "want" even means. The weird thing is I do it 300 times, I get 300 different population estimands it's targeting. 

Which is weird, but now we are going to get bit in the butt by our collective apathy towards defined target parameters I think. We cannot continue to talk in terms of "the causal effect". There is not "the" anything. There are _summaries_ of individual treatment effects, and unless they are all the same, there is no one single population estimand, even for something like the minimum wage. There is nothing about the minimum wage that requires it to be uniquely in one direction even with unambiguous predictions on comparative statics of labor demand with respect to changing minimum wages since that "unambiguous predictions" is actually only unambiguous in the theoretically specific case of _perfectly competitive input markets_.

So that's the first thing. You have to be clear if you're going to do this stuff about what precisely your target is to be. And if you let them make decisions on your behalf, you can end up with something you don't recognize. 

Which means that we have to have verification. Production, as I and others have said, is no longer the bottleneck in research. Verification is the bottleneck. And here's the problem. Verification requires two things:

  1. **Human time**. You cannot verify that which you do not spend time verifying. And I think it's safe to say that if we wanted to spend the time on doing it, we wouldn't be using agents in the first place. I think a lot of us want to take a break. The absolute **last thing** in the world I want to do is go line by line through **someone else 's code! ** They don't code like me, and therefore I don't like it. I don't think I'm crazy for feeling that way. 

  2. **Skill** **and human capital**. And then there is the other kicker. You cannot determine if something is done correctly if you do not have human capital in that area, and you only get human capital from attention and time. 




I have been focusing on diff-in-diff in my experiments for a few reasons, but one of them is that I know that literature as well as any non-econometrician I would dare say. I have had to teach week long workshops on it dozens of times going back to at least 2018, globally even. CodeChella in Madrid is _exclusively_ about causal panel methods. In my new book, _Causal Inference: the Remix_ , it is actually now _two chapters_ instead of one. Which is insane because basically I have a 250 page book on diff-in-diff inside a bigger 750 page book on causal inference. That's crazy. 

So why do I say that? I say it because I notice teeny tiny little details in the tables and outputs of diff-in-diff that I only notice because I have been waterboarding myself with diff-in-diff for eight years. I am so sick of diff-in-diff at this point, but it's deep in my bones. I have a love-hate relationships with it. I have a love-hate relationship with everything I have ever hyper focused on. Everything I have hyper focused on in my life has become something for which I end up recognizing the most seemingly inconsequential details, which can only be due to deep human capital in that particular area. You can read Stigler and Becker's classic 1977 article _De Gustibus non est Disputadum_ to sort of see more of what I'm talking about, but human capital accumulates in really _anything_ and _everything_ that you just sit down and focus on repeatedly, using attention and time.

Which leads to my last point, and that is the inherent moral hazard elements of AI agents on the human researcher. I believe that the production functions for cognitive output have shifted due to generative AI and agents. We have now for the first time in history _linear isoquants_. Flat curves. We can produce creative cognitive output using exclusively machine time. No human time is needed to write poetry. This poetry is most likely in the 95th percentile of all human poetry ever written. Why? Because 95% of all poetry written by humans is awful. So the bar is low. And as much as it pains me to say this, I suspect that the same is true for empirical economics. 

But, here's the deal. If you need human capital to detect errors. And if human capital uses time and attention. And agents allow you to produce papers autonomously using no time, and therefore no attention. Then how can you verify? How can you reliably verify anything. How will you know? Think back to your early micro and macro theory classes. Recall that physical capital depreciates. 

**Human capital depreciates too.**

And therefore, if you reduce time, and you reduce attention, which I think is going to happen modally, what then will happen? 

* * *

### I bet we punish researchers for their lack of attention even more than ever

Here's my guess. The gains from AI on scientific research is simply too large to ignore. It will be adopted. It will move fast. We will be shifting as a world towards AI generated research. The degree to which it happens is debatable, or rather empirical, but it will happen and it is happening. So that's the first thing.

Second thing is ideas and science are crucial to economic growth and therefore the overall wellbeing of the human species and the welfare of this planet. We simply cannot ignore and cannot ban the use of AI technology in scientific discovery and innovation. The costs are too high. And it is not like the AI technology is replacing some perfect error-free technology anyway because no one is more biased than humans, no one is more error prone than humans. Even elite experts in the field make embarrassing mistakes. Even Nobel Laureates will have transcription errors and coding mistakes. It's human to make mistakes. "To err is human".

I am not sure when it will be the case that we can utter "to err is _only_ human", but I don't think it's now. 

And thus I think about Becker' classic 1968 "Crime and Punishment" paper in the JPE. In that paper, Becker buried in a footnote is a little bitty anecdote about a Vietnamese speculator in rice markets who had his hands cut off when it was discovered. Why do I bring this up?

Because, Becker's model works out the optimal punishments for crime. And one of the things he works out is that the punishment for crime rises optimally when the probability of detection falls. And so, if we are unskilled as a species, we may have low probabilities of detection of errors. Or if the gains are really high from being accurate, and thus the costs of mistakes are therefore high, the optimal response according to Becker is not forgiveness. 

**It is punishment**. And it is severe punishment. Is it exile from the community. It is reputational destruction. It is the Cain-like permanent scarring of the face. The person will _never_ be allowed back. There is no restitution. There is no grace. This is not tit-for-tat. This is grim strategy. 

My bet is that we move towards AI agents. Humans pushing the button will be punished on behalf of the agents' "mistakes" because it is ultimately still a principal-agency problem. Humans will be responsible for anything they do, even now the most subtle seemingly irrelevant detail. Like the ill-defined target estimand.

Anyway, that's my paper. It's R&R. Wish me luck.

Scott's Mixtape Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Upgrade to paid

You're currently a free subscriber to Scott's Mixtape Substack. For the full experience, upgrade your subscription.

Upgrade to paid

   
---  
| | | Like  
---  
| | Comment  
---  
| | Restack  
---  
   
  
(C) 2026 scott cunningham  
910 North 17th Street, Waco, Texas 76707   
Unsubscribe