The Thing We All Obviously Want
Over the past year, we have seen the rapid development of AI-assisted programming to an astounding degree. Even five years ago, fully-automated program synthesis of large-scale, production systems would have seemed unthinkable. Today, this is not an ambition, it is a reality, at least by some measure. To some computer scientists, natural-language-driven program synthesis was the endgame. On the other hand, the software I use day-to-day doesn’t seem to be getting appreciably better overall. Systems are still broken, apps unresponsive (even on well-resourced hardware), crashes are still common, and interfaces are generally as clunky as before. Personally, I believe we will eventually see many systems adapted by AI-assisted refactoring tools; but I also recognize there are human barriers to deploying those things at full thrust in the short (even medium) term.
In any case, my position is that AI-assisted programming, giving us real-time, on-demand generation of any app, is the thing that we all obviously want. There are a few tensions with this reality: (a) it seriously changes the value proposition of what “code” is in a meaningful way, (b) there are externalities: wasted computation, energy replicating junk, and (c) it challenges the role of humans in the knowledge-generation process.
Note: In the rest of this essay I will use the term “LLM(s).” In general, when I say this, I mean a state-of-the-art integration of a frontier model alongside relatively simple tools (e.g., Claude Code, Codex, etc.). There is some nuance in building these tools, but given that the innovation is the model, I will casually refer to the whole agentic process as the “LLM.”
Program Synthesis: Did it Fail?
Traditional program synthesis (by which I generally mean, SMT-based, search-based, or similar) leveraged a rigorous and formal enumeration / proof to produce a synthesized program–potentially with a certificate of its correctness–driven via a rigorous specification. Like many academic fields, the goal of program synthesis was not only to effect fully-automated programming tools: it was to advance the frontier of understanding in semantics, verification, specification, etc. These were the challenging problems, especially given that traditional search (on the CPU) was so slow.
LLMs allow rigorous concepts to gracefully degrade by using text. The underlying model has such a deep understanding of language that fuzzy, hazily-posed descriptions often still give some sensible interpretation. The obvious issue is hallucination: when you push the embedding space into some inconsistency, won’t it just generate junk? And of course, this is absolutely an issue–but when the error rate is low enough that it’s practically useful, many people will not care.
My position is that LLM-guided software engineering was so wildly successful not just because it nailed the generation part, but also because LLMs ended up practically solving the problem of specification. Humans are simply used to the failure modes of underspecification: even from a young age we’re trained to expect disappointment if we miscommunicate our expectations, and so having the LLM fail doesn’t sting as badly as you might expect.
Granularly-Evolving Formal Specs
One potential issue I foresee with current-generation AI is that they focus the process on a textual-only workflow. In practice, smart humans do want to read something that looks like code most of the time–the issue is that they want to be able to focus their limited mental attention rather than sifting through thousands of lines of code. Most anybody who ever worked on a large codebase (that they did not write entirely themselves) never had more than an LLM-level understanding of parts of the codebase anyway. Instead, we embarked upon code understanding efforts whenever we faced tricky bugs, needed to add new features, etc. We codified this in our own mental model (memory, notes, etc.), but also (sometimes) documentation, bug reports, etc. Hilariously, this is now the kind of thing that the LLM loves to ingest.
As we build software, we want to be able to start with a hazy specification (probably in English, but maybe in a big document) and be able to begin building an application. At key decision-making points we want to be able to solicit input and, finally, be dropped into an exploratory state where we may make our thoughts more granularly precise. For many reasons, I still believe that this should be formal, executable code, not English prose.
The issue is that no single language is perfect. English is great for laypeople: any arbitrarily complex topic can be compressed into an arbitrarily-simple soundbite. Unfortunately, English is imprecise and even lies to you via the embedding. On the other end of the spectrum, we might have Lean in a loop with the LLM. The LLM is speaking Lean and there is some amount of grammar- and semantics-constrained generation occurring. The issue here is one of information overload: Lean is fully-rigorous, but complex propositions and definitions take serious mental bandwidth to unpack as a nonexpert. Sure, you can ask the LLM to explain, but it is not easy if you do not have a background with advanced topics. We need something in between.
There is obviously a market for programming environments which enable granularly-precise constraint-guided software development. Many people will attempt this first with English integration layers on top of preexisting languages (e.g., Lean, C++, etc.). I wonder, however, if there is room for families of languages which are focused on abstracting around granularly-more-correct specifications of behavior, with the understanding that (e.g.,) the LLM (or maybe a human) is filling in the gaps.
Searchable, Visualized, Reactive Knowledge Bases
Many people are now using LLMs to build knowledge bases. The baseline is simple: loose collections of textual notes are spat out by the LLM and digested by a human, who iteratively interacts with a tool (Claude Code) to build the knowledge base. Text is very useful, but it is not always the best way to visualize information–I expect we will see a growing corpus of applications which leverage LLMs to integrate into some higher-level, application-specific knowledge base. As an example, consider CAD components: we will often want some semi-programmatic control over the development of a scene (e.g., laying out an office), but we probably want to reuse previously-designed components (e.g., cabinets with a countertop) with slight tweaks–we want to ensure that crucial constraints are met (e.g., always ensuring a uniform toe kick) while enabling the AI to adapt a previously-built model in our knowledge base. At some point it might be helpful to break apart an item, say “base cabinet” versus “floor cabinet.” I expect there will be a demand for gradually evolving things from unstructured, generative objects into symbolic, computed representations.
Self-Evolving Software
I believe self-evolving software is already possible, perhaps it is already here. Self-evolving software is software which optimizes and evolves itself over time to suit a user’s needs. Right now, we prompt the LLM with directions to build the app we desire. However, we can easily imagine a future where software is optimized and evolved on demand. For example, we might imagine being able to give gentle nudges: “I don’t quite like the theme of the app,” which is then taken as an instruction to evolve the application in the background–whenever a change is tested, we can respond with a thumbs up / down (or some more nuanced feedback signal). Since humans are already doing this in production, it seems reasonable to assume that very quickly this process will be abstracted, and humans will begin to codify the ways in which software is allowed to evolve.
This is partially an issue of modularity–we want to be able to modularly decompose our concerns so that we can make rigorous precisely what concerns us while leaving the rest up to the LLM. Still, we seem to lack a good way to rigorously express the constraints guiding a software system while leaving other parts opaque. The issue seems to be that implementation details often form leaky abstractions in practice, and right now English is the common denominator: once you go beyond English formally, its ambiguity starts leaking in.
Generation, Replication, and Libraries
For a time, humans pressed tab and enter. They would look at the app, observe the feature, give input–it was beautiful. Through rose-colored glasses, one might say that the skilled engineer could spend 2-5 hours reading the code’s structure and architecting it correctly, while the agent did the boring, tedious work. Through a cynic’s perspective, there’s absolutely no sense in generating code that you aren’t at least reading. We reconcile the two perspectives by recognizing that the boundary constitutes the “library.”
Thought experiment: imagine a world where the right library always exists. In this world, you would only ever have to write 10kloc apps; everything beyond 10kloc is pushed into a library. Some systems are not really like this, and they do just take a ton of code. But practically, nearly everything you could learn about in a university class could be written in 10kloc or less, relatively efficiently, with the right libraries.
We are close (if not already there) to this vision with modern LLMs, but we are not there in an important way: every time you want the library, you regenerate it. This is obviously extremely wasteful: how many to-do list apps do we really need to waste our energy generating? In our thought experiment world where all libraries exist, imagine you had an oracle that told you the right libraries you needed: could you use a much cheaper model?
In a way, this is just semantic caching. There will always be queries made to the LLM where it has seen something “pretty close” before. In an ideal world, this would be cached and we could do only the incremental work necessary to generate the new response, optimally balancing replication, generation, bandwidth, etc. costs.
Security, Trust, OSS, Etc.
What is the point of well-maintained, agreed-upon code? In a world where generating code is cheap, there are both upsides and downsides. The clearest upside is this: there are many components (e.g., container libraries, layout engines, etc.) for which having stable, predictable, repeatable behavior will generally be desirable. While many of us might want our own custom business CRM (hopefully out of documented components), we still want to use Chrome, too. On the other hand, in an age where LLMs have surpassed humans in exploit identification, running the same software as everyone else leaves us open to potential vulnerabilities.
Another concern is the waste of pervasive use of neural models when symbolic models suffice. Due to the current subsidies of frontier AI, we will inevitably see waste when LLMs are used as general-purpose reasoners when more specific reasoners would have sufficed. To some degree, this is just an engineering problem: “LLMs” are complex agentic workflows which build tools on-demand when necessary, and it seems natural to predict that they will do so when valuable. Maybe the question is: what is the use of reusable, OSS components to an LLM?
Human-Human and Human-LLM Teaming
We used to team up to work on software by having a shared artifact: the codebase. When we got hired, we would read the codebase and step through it in a debugger, we would write a test that forced us to read an execution path. But now that LLMs can do it just as well as we can–at least to the functional constraints (the AI uses tests for its reward, so this is predicted)–it is leaving lots of the industry questioning: why read the code?
Unfortunately though, while individual contributors have moved on to mostly using English text to actually write the code, development teams still generally interact by simply pushing the code. Other engineers then use their LLM to help them read and maintain the code. The issue here seems obvious: the substrate is still code tracked in Git; if I’m working with someone who’s also using an LLM, I don’t really just want the code. On the other hand, I personally do like code; I don’t really just want to read a big sequential dump of English either. What I want is to be able to rapidly interact with someone in a way that lets us focus on just as much of the formal specification as we want, leaving the rest underspecified.
AI Acknowledgement
I did not use AI to write this post. However, I did use Claude for two things: (a) I performed spellchecking, and (b) checks for copyediting / consistency / etc.