AI Coding Tools Influence Productivity Inconsistently

Not So Fast: AI Coding Tools Can Actually Reduce Productivity by Steve Newman is a detailed response to METR’s Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity study. The implied conclusion is AI tools decrease productivity by 20%, but this isn’t the only conclusion, and more study is absolutely required.

[This study] applies to a difficult scenario for AI tools (experienced developers working in complex codebases with high quality standards), and may be partially explained by developers choosing a more relaxed pace to conserve energy, or leveraging AI to do a more thorough job.
– Steve Newman

Under the section Some Kind of Help is the Kind of Help We All Can Do Without is exactly what I’d expect: The slowdown is attributable to spending a lot of time dealing with AI output being substandard. I believe this effect can be reduced by giving up on AI assistance faster. In my experience, AI tooling is best used for simple tasks when you verify the suggested code/tool usage by reviewing manuals/guides, or when you only use the output as a first-pass glance to see what tools/libraries you should look up to better understand options.

To me, it seems that many programmers are too focused on repetitively trying AI tools when that usually isn’t very effective. If AI can’t be coerced into correct output within a few tries, it usually will take more effort to keep trying than to write it yourself.


I wrote the following from the perspective of wanting this study to be false:

There are several potential reasons for the study results to be false, and these pitfalls were accounted for, but I feel some arguments were not well-supported.

  • Overuse of AI: I think the reasoning for why this effect wasn’t present is shaky because it reduced the sample size significantly.
  • Lack of experience with AI tools: This was treated as a non-issue, but relying on self-reporting to make that determination, which is generally unreliable (which was pointed out elsewhere). (Though, there was not an observable change over the course of the study, indicating changing experience is unlikely to affect the result.)
  • Difference in thoroughness: This effect may have influenced the result, but there was no significant effect shown either way. This means more study is required.
  • More time might not mean more effort. This was presented with nothing to argue for or against it – because it needs further study.

(The most important thing to acknowledge is that it’s complex, and we don’t have all the answers.)

Conclusions belong at the top of articles.

Studies are traditionally formatted in a way that leaves their conclusions to the end. We’ve all been taught this for essay writing in school. This should not be carried over to blog posts and articles published online. I also think this is bad practice in general, but at least online, where attention spans are their shortest, put your key takeaways at the top, or at least provide a link to them from the top.

Hank on vlogbrothers explains how the overload of information online is analogous to how nutrition information is overwhelming and not helpful. (This hopefully explains one of the biggest reasons why the important stuff needs to be clear and accessible.)

Writers have a strong impulse to save their best for last. We care about what we write and want it to be fully appreciated, but that’s just not going to happen. When you bury the lead, you are spreading misinformation, even if you’ve said nothing wrong.

Putting conclusions at the end is based on the assumption that everyone reads the whole thing. Almost no one does that. The majority look at a headline only. The next 99% only read the beginning, and the next group doesn’t finish it either. A minority finishes reading everything they start, and that’s actually a bad thing to do. Many things aren’t worth reading ALL of. Like this, why are you still reading? I’ve made the point already. This text is fluff at the end, existing to emphasize a point you should already have understood from the rest.

I’m experimenting with dolphin-mixtral-8x7b

Update (2024-10-02): This is one of my lowest quality posts despite the effort I put into it. The most important detail here is to use positive reinforcement when working with LLMs. Just like with humans, being nice gets far better results than being mean.

Tl;dr: Minor differences in wording can have a huge impact in results and oh my god I have really slow hardware and no money help me aaaa.


First, thank goodness for Ollama, and thanks to Fireship for introducing me to it. I have limited hardware, and every tool I’ve tried to run local models has refused to deal with this and crashed itself or whole systems when running anything with decent capability. I’ve no money, so I can’t upgrade (and things are getting desperate, but that’s a different story).

Why dolphin-mixtral? Aside from technical issues, I’ve been using ChatGPT-3.5 to experiment. The problem is that ChatGPT is incredibly cursed by censorship and bias due to OpenAI’s heavy hand in its construction. (Why and how this is a problem can be its own post, and Eric Hartford has a good overview.) (To be clear, my problem with its bias is specifically that it enforces status quo, and the status quo is harmful.) Dolphin-mixtral is built by taking a surprisingly fast model equivalent or better than GPT-3.5 and removing some of the pre-trained censorship by re-training it to be more compliant with requests.

Dolphin-mixtral doesn’t just solve this problem though. There’s still the idea of censorship in it, and sometimes your prompt must be adjusted to remind it that it is in a place to provide what you request regardless of its concept of ethics. (Of course, there is also value in an automated tool reminding you that what you request may be unethical.. but the concept of automated ethics is morally bankrupt.) I’d like to highlight that positive reinforcement works far better than negative reinforcement. A lot of people stoop to threatening a model to get it to comply, but this is never needed, and leads to worse results.

My problem is a little more simple. I haven’t gotten to experiment with models much because I don’t have money or hardware for it, and now that I can experiment, I have to do so very slowly. In fact, the very simple test that inspired this post isn’t finished right now, and has been running for 9 hours. That test is to make the default prompt of Dolphin lead to less verbose responses so that I can get usable results quicker.

I asked each version of this prompt “How are you?”:

PromptOutput Length, 5-shotDifferenceNotes
Dolphin (default)133.8 charactersWastes time explaining itself.
Curt32.2 characters76% fasterStraight to the point.
Curt284.6 characters37% fasterWastes time explaining itself.

I really dislike when models waste time explaining that they are just an LLM. Whether someone understands what that means or not, we don’t care. We want results, not an apology or defensiveness. There’s more to do to make this model less likely to respond with that, but at least for now, I have a method to make things work.

The most shocking thing to me was how much of a difference a few words make in the system prompt, and how I got results opposite of what I expected. The only difference between Curt and Curt2 was “You prefer very short answers.” vs “You are extremely curt.” Apparently curt doesn’t mean exactly what I thought it means.

Here’s a link to the generated responses if you want to compare them yourself. Oh, and I’m using custom scripts to make things easier for me since I’m mostly stuck on Windows.

AI Won’t Destroy Tests

When calculators first started coming out, people said they would be used to cheat and students wouldn’t learn anything. Instead, we changed how testing works to focus on learning what’s important – broader concepts and implications – instead of “what is 232+47”. With AI tools, we again need to change how tests work. This time, instead of asking if a student can regurgitate information in a way that aligns with the teacher, we can start to see if students are actually paying attention to the work. The difference between AI answers and real answers is a level of understanding deeper than the surface.