Language that t̶e̶s̶t̶s̶ writes Code

Coding Robots

Last week, there was a bit of drama on YouTube, Reddit (by way of tiktok), and across the web about Devin.ai, a paid LLM that purportedly is a full-fledged developer and was able to complete tasks that were posted by real people to Upwork. It turns out it can't do that, but I'm not here to dog pile on this situation.

Several people have asked me what I think about AI while raising money for my new company, Kickplan (which helps people monetize AI in SaaS but isn't an AI company per se). I also had a great email exchange with my friend Ward Cunningham about an old technology that got me thinking about it again, and I think there's a lot we can learn about "LLM developers" from it.

Cucumber

In the 2008 - 2012 time period, in the Ruby on Rails community (and later beyond), a testing methodology based on a library called Cucumber became popular. Cucumber was a riff on an older concept that Ward Cunningham invented and was popularized in Java called FIT or "framework for integrated testing." The idea was that non-engineers could specify how software behaved by describing the start and end behavior of the software, sometimes with natural language and sometimes with spreadsheets or other artifacts. And you wrote these documents first, but I don't need to be as specific as writing the actual Ruby (or other) code required to implement it. Once the code was written, the documents could generate code that would automatically prove that your software did what it was supposed to do.

Here's what that looks like:

Feature: Change password

  Scenario: Change my password
    Given I am signed in
    When I go to the edit user page
    And I fill out change password section with my password and "newsecret" and "newsecret"
    And I press "Change password"
    Then I should see "Password changed"
    Then I should be on the new user session page
    When I sign in with password "newsecret"
    Then I should be on the stream page

Is this clear what you want the software to do? It's not bad, but unfortunately, you still have to write all the actual code, as well as the underlying code that turns these natural langage steps into actual tests.

In theory, I absolutely loved the idea of collaborating with stakeholders to decide the behavior of software in a way they could understand. When I put on an engineering hat, there's so much cognitive overload that I don't want to make all the engineering choices and figure out the optimal behavior. It's too much. If we can reduce the ceremony by specifying how software should behave and verifying it behaves that way with the same code, you kill more than 2 birds with one stone.

However, in 2024, I'd be shocked if you know of ANY company that uses Cucumber or any FIT testing at scale. And if they do, I'd be SHOCKED if they started using it in the last 4 years. There's a bunch of reasons why, but the biggest is the specificity problem.

Specificity and Process

To control software's behavior, you have to be very specific. In fact, you need to be so specific that you always create a formal language so that words have specific meanings and you control the vocabulary your team uses. You'll need to craft and refine this language and collaborate with others. You'll need some formalized way to merge multiple work streams together. Finally, you'll need a process, once it's been merged, to check the correctness of that merged code against what he machines understand.

Congratulations, you've created a programming language and now follow a software development lifecycle.

LLM-based Software Engineering

The problem with thinking that an LLM can become a software engineer is twofold:

The hard problem of software engineering is designing systems that result in behaviors that create sufficient value for users.
To control a code generation machine with sufficient specificity to achieve the desired outcome, you ultimately have to be really, really specific.

Most of the outrage pointed at Devin.ai has revolved around problem #1. The code it generated was not correct, and it didn't actually create value for its users. Furthermore, it's so simplistic it can't actually design a system. It can only do 1 thing a programmer does, generate words chosen from a controlled vocabulary in a specific grammar prompted by natural language. And while that may seem complicated to non-programmers, that's not the hard part of programming computers.

But we're missing the point of what we've learned with FIT/Cucumber. When it comes to using an "AI" to generate code, no matter how advanced the "AI" eventually becomes specificity is still going to be the issue. To actually get repeatable and behaviorally software systems of sufficient complexity to sell them as a product or build a business around, from an LLM, you have to become a "prompt engineer" and if you have to be a "prompt engineer" to the extent that you learn the GPT-4 specific controlled vocabulary at a high level, I think you will find you've now learned a new programming language. And if someone was unwilling or unable to learn Python, and non-trivial software systems require GitHub repos with thousands or millions of lines of "natural" language prompts, painfully crafted so that Devin or ChatGPT or Llama or whatever can actually generate something of value, instead of Ruby or Python, then what have we gained?

AGI Will Tell Us How To Fix This

0:00

/0:24

Or, I guess, maybe we'll just let the machines decide what software should without our input at all. What could go wrong?