Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
What's Worth Knowing In AI Right Now? (with Henry Garner) image

What's Worth Knowing In AI Right Now? (with Henry Garner)

Developer Voices
Avatar
1.1k Plays1 hour ago

AI is changing the way we all build software — that much seems clear. But the landscape is moving so fast that even the people paid to keep up are struggling. MCP or skills? Fine-tune or just prompt? LangChain or let a thousand agents loose? With almost 70 competing technologies and a shelf life of maybe six months on any advice, how do you figure out what's actually worth your time?

Henry Garner is CTO of JUXT, a consultancy with about 150 senior engineers working at the coalface of AI-assisted development, including building AI platforms for tier-one banks. JUXT publishes a quarterly AI Radar — 68 technologies rated and reviewed — and Henry's been watching his own team go through the full adoption arc, from "spicy autocomplete" skepticism through to building Byzantine-fault-tolerant distributed systems over a weekend with Claude. Along the way we cover MCP vs skills, Conway's Law for LLMs, neurosymbolic AI and the unexpected return of Prolog, the "Ralph Wiggum loop" for getting agents to converge on correct implementations, and Allium — a new behavioral specification language Henry's co-authored that sits between human prose and TLA+, aiming to give LLMs just enough structure to pin down what a system should do without falling into waterfall thinking.

If you're trying to make sense of the AI tooling landscape, or you've hit that wall where your agents keep drifting away from what you actually wanted, Henry's thesis — velocity through clarity of intent — might well help out yours.

--

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@DeveloperVoices/join


JUXT: https://www.juxt.pro/

JUXT AI Radar: https://www.juxt.pro/ai-radar/

Allium on GitHub: https://github.com/juxt/allium

Allium Documentation: https://juxt.github.io/allium/

Composition at a Distance (Henry's blog post): https://www.juxt.pro/blog/composition-at-a-distance/

A New Vocabulary for an Old Problem (Henry's blog post): https://www.juxt.pro/blog/new-vocabulary-for-an-old-problem/

Model Context Protocol (MCP): https://modelcontextprotocol.io/

LangChain: https://www.langchain.com/

LangGraph: https://www.langchain.com/langgraph

Gas Town (Steve Yegge): https://github.com/steveyegge/gastown

Kiro (spec-driven AI IDE): https://kiro.dev/

Phoenix (LLM observability): https://github.com/Arize-ai/phoenix

Temporal: https://temporal.io/

Taalas (LLM-on-a-chip): https://taalas.com/


Kris on Bluesky: https://bsky.app/profile/krisajenkins.bsky.social

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/


Recommended
Transcript

Introduction to AI Hype vs. Reality

00:00:00
Speaker
What a time to be a programmer, huh? At the start of 2025, AI just seemed to me like the latest entry in a long line of Silicon Valley hype cycles. seemed to be like where the blockchain fallout was going to head next.
00:00:14
Speaker
By the end of 2025, it kind of seemed undeniable that somehow this was going to change the way we all work and fast. really fast. I mean, like the World Wide Web, that changed the way everyone works, but it took a couple of decades to diffuse in society. AI speed running it, I think it's going to change things massively within a year or two.
00:00:37
Speaker
So how do we keep up Specifically, us techie types, because personally, I think we will all still have jobs in 12 months, but only if we adapt, because they won't be exactly the same jobs. I'm sure of that.

AI Tech Radar Overview

00:00:51
Speaker
How do we keep up?
00:00:53
Speaker
That's a question that's been on my mind a lot lately, and it was particularly on my mind when an old colleague of mine announced that he published an AI Tech Radar,
00:01:04
Speaker
68 technologies in a PDF, all from the AI landscape, and it's an analysis of what's hot and what's not, what's worth using, what's worth experimenting with, what should probably be dropped now.
00:01:17
Speaker
Specifically for all of those of us who are trying to get stuff built amid this time of change. So this week, Henry Garner joins me to discuss not all 68 technologies, it's not that big an episode, but some big swathes of it. We managed to go through things like the impact on jobs and the impact on people paying for those jobs, the people paying our salaries.
00:01:40
Speaker
The current debate at the moment, which seems to be MCP versus skills, and is MCP dead or not? Fine-tuning, is that still worthwhile? Frontier models, can you compete? Neurosymbolic AI, what is it?
00:01:55
Speaker
And why does it possibly mean the comeback of Prologue? And how that all leads into a new language that Henry's co-authored called Allium. Now, I have to confess something to you and to you, Henry, if you're listening to this intro.
00:02:09
Speaker
I didn't get Allium at first. At first, it sounded a little bit like a way to generate more documentation that no one's actually going to read. But I was wrong. The penny drops during this conversation and I get it now. And I'm going to talk a bit more about it at the end because I think he's really onto something.

Impact of AI on Job Market

00:02:28
Speaker
But that's to towards the end of the conversation. We best get there. We best get on the map and fire up the AI radar. So I'm your host, Chris Jenkins. This is Developer Voices. And today's voice is Henry Garner.
00:02:53
Speaker
Joining me today is an old friend of mine, Henry Garner. Henry, how are you? I'm very well, thank you, Chris. Thank you for having me. Oh, thank you for coming. you um Possibly I've got to get you on the most difficult topic to pin down at the moment, but we must try.
00:03:10
Speaker
Yes, you've invited me on to talk about this so so of niche interest that I've got. niche field that no one else is talking about, AI. i think I think I'm going to rush this episode to production because it's moving so fast, right?
00:03:26
Speaker
ah Yeah, I mean, it's dizzying. ah yeah the pace of change, even when, ah you know as I am, I'm basically paid to try and stay on top of these things, is dizzying and overwhelming. The scope as well is absolutely massive. Of course, you know for us in software engineering, it's ah it's changing the whole way we work. But of course, it's reaching out into basically every corner of our lives as well.
00:03:50
Speaker
Yeah, yeah. i The thing that stunned me is my wife, who is not a technical person, is using AI every day for her job. and It's like, okay, this is a bit like the internet in that it's changing the lives of non-technical people too.
00:04:07
Speaker
But it's going so much faster than the internet dispersed through society, right? Yes, you know a faster. um yeah I sit in the park, people walking past, I can hear talking about Claude, and I know exactly what they're talking about. um you know My wife as well is is using it for for all sorts of personal correspondence, but also creative pursuits. ah you know it's ah It's at a fantastically versatile tool, no matter what you do, really.
00:04:36
Speaker
Yeah, yeah. We are going to have to talk about the tech of the tool and specifically advanced techniques. But before we get to that, I'm going to have to ask you, because one of the big questions at the moment is how is it going to affect our jobs, right?

Engineers Embrace AI

00:04:51
Speaker
And you work for, um I'll paraphrase the business business model here, you work for a company that is basically coders for hire, right? Yeah. Essentially, yes. Overflow. Coders with a lot of domain expertise. Excellent coders, sometimes including terrific coders who later go on to host a podcast.
00:05:12
Speaker
um Lovely agency. I've worked with you in the past. But you must be seeing the blunt edge of this, right? it is If companies are adopting AI and if jobs are at risk, I would have thought an agency, Coders for Hire, is the first place that sees the pain of the job market.
00:05:31
Speaker
Is that actually your experience? Yeah, I mean, what an exciting time to be doing what we're doing. yeah And you know I feel the responsibility keenly is the as the CTO of Juxt.
00:05:42
Speaker
um ah We see it from a few different perspectives. you know We've got about 150 engineers skewing towards senior engineers who've been doing the job for a while.
00:05:53
Speaker
um And they have built up considerable expertise um in, if you like, the old way of doing things, yeah hands on keyboard writing code.
00:06:04
Speaker
um And so to see close up what effect this new wave of AI tools, particularly the ones which have landed since late 2025, is having ah is fascinating.
00:06:18
Speaker
um And this is quite a predictable journey that our engineers are going on from, you know, skepticism and a little bit of resentment. You know, and we've all seen hype cycles play out and, uh, plenty of us for some length of time have been saying, know, basically this will all blow over or, you know, it's ah spicy autocomplete.
00:06:39
Speaker
um Yeah, I was saying exactly that a year ago. Yeah. And the scales fell from the ice. Yeah, it's ah it's a well-trodden path through to, Oh, my goodness. You know, when you actually get your hands dirty and you start trying out something like Claude Code, for example, is is where I am lately advising people begin. um Suddenly you realize the power.
00:07:04
Speaker
um And I've heard that some people ah have a bit of a wobble at that point and think, oh, my goodness, you know, what ah these skills that I've spent decades building up, for what what are they worth now when I can just ask a machine to do the job for me? Yeah.
00:07:19
Speaker
That may take a week or two to ah to play out. think it was exactly two for me. There you Through to the other side, where you realize that...
00:07:31
Speaker
It's not taking away your engineering judgment, it's not replacing the things about the role that you enjoyed. It might in fact be allowing you to delegate some of the the boring drudge work that you used to tolerate because what you really enjoyed doing was solving big gnarly problems or thinking through ah nice solutions. And those solutions have changed shape now. We might not be specifying the precise implementation if we're delegating a lot of that to agents, but you've still got to design ah what your software needs to do. And you get the payoff of seeing that it's it's solving real problems and um enabling people to do things that they couldn't do before.
00:08:14
Speaker
It's just, it's massively accelerated that path to done. yeah Yeah, I totally agree with that. And I've had great fun. like All those tiny side projects that I'd never get around to in the past, but would like to build, are things I now tick off of an evening.
00:08:32
Speaker
and I've built so many new small projects that are just scratching personal itches. That is a joy. If you got into this to build things, it's a joy.
00:08:43
Speaker
What I wonder is, let me ask you two things about this. So one, have you found like massive resistance from any corners? I would have thought there are some people who jump into this with enthusiasm and some people in your team who are holding out almost to the point of being left behind.
00:09:03
Speaker
ah Well, I mean, there's definitely a a spectrum. There's a kind of bell curve. ah yeah And part of my ah objective over the last year or so has been to ensure that I i shift ah the mean in our organization towards adoption because the writing's on the wall as far as I'm concerned. This is the new way that software is going to be built and the era of humans writing code for most projects is over, so you've got to get on board.
00:09:36
Speaker
Of course, there are contrarians um who are going to point out all the things that AI can't do. um And you know that the story's been the same since the at the beginning, really, where you know you point out ah some ceiling of AI capabilities, and the next model, ah you know that's the floor of their capabilities, and so on So, you know, ah Those people are a vastly diminishing um group of people.
00:10:06
Speaker
um But there are some areas where they are right.

Limitations and Opportunities of AI

00:10:11
Speaker
um And it's to do with clear specification of intent. And this is really one of the skills that as engineers we...
00:10:20
Speaker
It's one of the core ah capabilities that we really always had to have, understanding what you're trying to achieve and making sure you're communicating it clearly. um i think it's even more important now with ah with agents um doing a lot of the implementation. um But you know contrarians might point out that ah you know contemporary models, very smart as they are, are still kind of constrained by the fact that they are...
00:10:50
Speaker
autoregressive, one token at a time um kind of ah engines who aren't doing real reasoning. um and it's you know They are right to a point and it's worth bearing that in mind because it does inform why they do some really wacky things sometimes. and it's It's important not to be completely blindsided by that kind of behavior, where you've got a ah machine that is, for the most part, ah highly capable and writing, to be honest, better code than ah than I ever did. um But ah you know there are going to be occasions where
00:11:28
Speaker
something really strange happens. And so, you know, those contrarian voices have something. I just don't think that that is, it's not that interesting to dwell on that right now as the primary um ah takeaway.
00:11:42
Speaker
But from a kind of security and reliability point of view, I think we've got to remember we can't be ah naive and too breathlessly enthusiastic, ah you know they ah it's important that we remain remain engaged in the process and able to validate that what we're getting out of these agents is indeed what we want. And if we're taking that code through to production, ah you know we remain accountable for what happens.
00:12:12
Speaker
yeah Yeah, I think this is the difference between vibe coding, where you almost wantonly take your hands off the wheel, and what we as professional programmers should be doing, which is more, I tend to use the phrase delegated coding, where we're still expecting to check the final architecture and output. Maybe maybe not, I mean, but what do you think? Maybe not reviewing individual lines of code anymore, but certainly reviewing processes and architecture.
00:12:42
Speaker
Yes. I mean, if we're quite honest with ourselves, were we always reviewing every single line of code that ah went through our PR review cycle?
00:12:52
Speaker
Maybe not. You know, what you developed was a sense about when you needed to lean in and pay close attention because code was maybe touching something wrong.
00:13:04
Speaker
very sensitive. Maybe there was some security angle or maybe you know we do a lot of work for banking and capital markets clients, so there's a regulatory angle. You want to pay close attention in those kinds of situations, but you develop an intuition for when the last radius of a potential issue is small, maybe the impact's going to be minor, you might not be um you know stepping through line by line.
00:13:31
Speaker
umm You mentioned vibe coding and it It's not a term I particularly like because it implies sort of recklessness, ah which I think is unwarranted. i think it's possible still to introduce a larger distance. This is the the metaphor that I like to use between the code um and yourself.
00:13:54
Speaker
ah And as long as you are exercising some ah good judgment about when it's okay to have large distance between the code and your original ah ah prompt, ah you are not vibe coding. You're simply making good use of your finite attention.
00:14:15
Speaker
um And as long as when something is ah critical, you've got small code distance, then I think you know you are remaining responsible. Yeah, yeah. that There's an analog there, I think, for unit tests.
00:14:30
Speaker
I've never been a person that aims for 100% coverage. I aim for 100% coverage on the stuff that really matters, the critical stuff. And maybe it's something like that. The distance between my tests and my code sort of matches the distance between me and the code written for me.
00:14:49
Speaker
Yes. yeah I think that's a great great analogy. and um you know No check is ever perfect. Our code contains bugs. We might have kind of static analysis that will fix some things, but not all. Then you've got unit tests, you've got integration tests, which will test the boundaries that the integration tests missed.
00:15:07
Speaker
like By stacking up these layers, um you There's a sort of Swiss cheese um metaphor where what you want to ensure is that your holes don't line up and that a bug can't make its way all the way through to the end to to production.
00:15:25
Speaker
yeah yeah And the more kind of imperfect probabilistic gates that you have, the less likely that is to happen. Yeah, I think that's... I like that image. I've not heard that one before. i let me Let me ask you this then. So I have... I mean, I'm not a static dynamic typing zealot.
00:15:47
Speaker
But I do tend to lean towards statically typed languages, all things being equal, which they never are, but it's a preference. And I think that leaning has become more pronounced in the age of these machines that love automatic verification, that love the instant feedback.
00:16:08
Speaker
Are you feeling any sort of lean towards certain

Programming Languages for AI

00:16:11
Speaker
languages for the age of AI coding? Do you think there are certain ones that fit really well?
00:16:19
Speaker
It's a really interesting kind of ah area at the moment. And obviously it's it's true to say that the Python TypeScript are becoming the languages of of AI development. They're the kind of default if you're If you're um running your LLMs somewhat on default mode, those are the languages that you're going to tend to get. And so we yeah we're just going to see this virtual circuit. If you don't care, the LLMs will tend to recommend those ones, right? Yes, that's right. Self-reinforcing. It's not the only consideration, of course. Yeah.
00:16:57
Speaker
ah you know Both you and I have a background in Clojure, for example, and with a Clojure MCP, with something that links an LLM up to a REPL where it can evaluate the code that it's generating in real time, Clojure is a very terse language and therefore it's very token efficient.
00:17:19
Speaker
and so There's an argument to say that that just because ah Python and TypeScript are the default languages of LLM development necessarily means that they've won and those are the only languages that that we should be using.
00:17:34
Speaker
Yeah. yeah i am I definitely think there was a very brief phase where LLM has performed significantly better on the popular languages. I don't think it lasted that long.
00:17:45
Speaker
I mean, I've been using a bit of Gleam recently, which is a very obscure language in the grand scheme of things. But I wrote a skill for it. I got it to download the tutorials and create some notes from that. And it writes Gleam as well as it writes TypeScript, as far as I can tell.
00:18:01
Speaker
e ah Well, yeah one of the things i'd I'd like to talk about today is the behavioral specification language that that we've built in Juxt. And that's a brand new language that LLMs have never seen before that's described entirely by a skill.
00:18:17
Speaker
um It's a skill that describes the entire, the rules of the syntax of this language. And it works really well. It benefits from having a ah a command-line checker ah which can you know symbolically check that the syntax is well-formed because often there are ah it's not perfect, but it's certainly good enough.
00:18:39
Speaker
ah So it's phenomenal. you know the the The current wave of LLMs, particularly Opus 4.6, ah really strong. And so, yeah, I think any argument about, so you know, what's well represented in the training data, um those kinds of arguments haven't aged so well.
00:18:59
Speaker
Yeah. Yeah. It's funny you can have an argument in the AI world that probably has a two month shelf life, which is, and we are going to get into your specification language. We've got so much to get through.
00:19:11
Speaker
Speaking of shelf lives though, You as part of Juxt have released a kind of AI radar what's hot and what's not in AI, which gives us a whole bevy of techniques and tools to talk about.
00:19:26
Speaker
But the first thing we have to say is, like to release a radar, you have to put a footprint in the sand and say, these things are important right now, those things aren't. And almost as soon as you publish that PDF, it's aging very fast. yeah How do you give any advice that's going to last long enough for ah for the effort of creating a PDF and publishing it?
00:19:50
Speaker
Well, it's a quarterly radar for a start. I think when we first published it, there were some voices saying, should we make this annual? ah i mean, and yeah, the half-life of the advice in it is probably six months or so.
00:20:07
Speaker
ah You know, very hard to say, but but ah yeah, it's ah it's an ambitious task. ah And thankfully, we've got AI to assist with research and you know distilling our thoughts and the things that we're observing on client site. you know We're a relatively small team producing it. And in a pre-AI world, it probably wouldn't have been feasible to do so. um But you know it really does exist. it's ah It's on our website, but also... look, an actual printed copy. And a real printed copy, yeah. it's Actual tray.
00:20:37
Speaker
Brilliant. The latest version is 60 pages. I think we've got almost 70 pages. blip Last quarter we added i think about 15 or so. As say, the of change dizzying. as i say the pace of change is is dizzying Yeah, and keeping up with what's worth paying attention to and what's still be worth paying attention to three months from now is really hard, which is why I was drawn to this. It's like, well, least someone's trying to pin a few things down. We're trying. Yes, that's right. We enter it in the spirit of sharing what we currently think to be true.
00:21:13
Speaker
but accepting that as the industry evolves, um we will inevitably, you know, new techniques will come in and displace the old, hu but we'll simply report those things as we observe them. And hope there's some use in that.
00:21:28
Speaker
Okay, let me pin you down on one specific one, which already I think is aging pretty quickly. I heard something saying yesterday, something saying MCP is dead and skills are the new hotness, m essentially. And i think I don't think that's what you've got on your radar as it was published a few weeks ago.
00:21:51
Speaker
ah End of january so end January. So two months already. What's your take on MCPs? Is MCP dead? Are skills the new hotness?
00:22:02
Speaker
When would you use which one? So, um MCPs, right just for those who might not know, you think of them as a bit like a sort of USB interface between an LLM and the outside world. They're a kind of generic connector that allows you to plug into all kinds of systems, you know ah Confluence and JIRA and all that kind of stuff, but also um the the web, ah databases, ah you know you name it. if It's got an a API.
00:22:33
Speaker
You could write an MCP to interact with that API. A skill, which is a more recent invention, I think i think it's ah six months or so ago that that skills became a thing. It's really just...
00:22:46
Speaker
a markdown file or a set of markdown files describing an approach that you want the LLM to take. It's condensing some kind of knowledge or process or point of view.
00:22:59
Speaker
um And it's like, I suppose, an extended prompt that you can just pull into your context on demand. ah And it's true, I mean, you could you could have a skill that communicates how to use a much more generic interface like ah the command line ah or yeah HTTP to use an API rather than having a bespoke interface.
00:23:27
Speaker
MCP specifically for that API. So there's definitely a sense in which skills can displace some of what MCPs were used for.
00:23:38
Speaker
um But there is one thing that ah the MCP server provides that skills don't, which is that they're ah're an external process. Often they're written in something like Node, so they can hold some persistent state.
00:23:51
Speaker
ah If you're talking to a database, they can hold that database connection. Rather than you're using a skill you know rather than if you're using a skill where ah your context window has to contain the skill itself.
00:24:09
Speaker
Then of course your multiple turns of what you say, what the LLM says back to you, this all fills up context. MCPs can use that context more effectively. You don't have to pollute your global context with kind of traffic going on um with the MCP. So there's that's a judgment call, as with all these things, you know where it might be that you benefit from a generic ah MCP and a bunch of specific skills um or not.
00:24:41
Speaker
you know Your mileage may vary. We're still figuring out what the sweet spots are, possibly. Exactly. I mean, as with all these things, like at what level of abstraction ah is the is the right one? you know Are we going to have a few very generic kind of MCPs rather than, as I think you know people got very enthusiastic ah a year or so ago when MCPs came in and suddenly there were tens of thousands of them. I think we've started, you know, initial enthusiasm has ebbed slightly and we're realizing maybe, you know, one MCP for every single API in the world is probably not necessary.
00:25:18
Speaker
Yeah, yeah. There is one I discovered ah last week, which I like and gave me a thought about this, which is um the JIRA MCP server.
00:25:31
Speaker
Firstly, great, because I'm sorry if anyone loves JIRA, but I was glad to realize I wouldn't have to touch JIRA in order to use it. But also, i realized like MCP, I will never have to be responsible for maintaining a JIRA skill.
00:25:46
Speaker
Because the burden of updates is on the MCP provider. And that seemed quite useful. a That's true, isn't it? um yeah You don't need to keep your so skill and command line tool in sync.
00:26:01
Speaker
There's an inverse to that. and I'm sure i've I've seen you, Chris, ah say things like, you know I had this idea, but I'm not going to turn it into a project because now you've got the idea and you can make it happen for yourself.
00:26:14
Speaker
Yeah, I wrote an MCP server for Kafka and it took me 20 minutes and I realized, well, what's the what's the point of maintaining something that takes 20 minutes to reproduce? e And if it takes five minutes to update, maybe that ongoing maintenance burden is is not so yeah so much of a concern.
00:26:33
Speaker
Do you find that with client work, that they're building in-house more because it's easier? Well, that's definitely one of the things that clients are thinking about right now is, yeah, if ah if we're seeing that ah A single engineer or ah a group of three engineers can can make a whole SaaS tool in a week. um Why do we need to pay vendors ah several months to achieve the same outcome? Of course, you know, ah
00:27:07
Speaker
it's ah There's a lot of noise and turbulence in the industry right now yeah as people are trying to figure out how this is going to work. It's one of the reasons that um at Juxt, we are really keen to be embracing these tools and using them to their absolute best effect. We're operating as as much as we can on the frontier so that our pitch can be, you know, it's true. AI is a massive accelerator.
00:27:37
Speaker
you know, that works in both directions. We know how to use them to achieve the goals that we need and to exercise that good judgment about making sure that that ah We are not vibe coding into production. We're building high quality code, but very fast using you know everything that we know about how to constrain the behavior of LLMs productively.
00:28:03
Speaker
Your internal teams might be exploring these things for the first time. um And there's a lot to learn. um And an accelerator doesn't steer.
00:28:14
Speaker
ah you know it could accelerate you into a ah wall. And so i think what we're what we will start to see over the the next six months to a year or so is some you know a bunch of organizations realizing there's actually quite a high, um there's a steep learning curve associated to using these things well. and And so i I'm betting that there will continue to be a place for ah ah consultancies like Juxta.
00:28:39
Speaker
Right. You give me the impression, because there's a lot of headlines about a companies adopting AI and ah the speed and enthusiasm coming from management. You give me the impression that the companies you're talking to are more in an assess mind space.
00:28:59
Speaker
than diving straight in. Is that fair? Is that because you tend to work with banks or insurance type companies? It's interesting. So banks and you capital markets organizations, those in financial services, are obviously constrained a great deal by compliance. And so they are risk averse. They are also massive development shops with tens of thousands of engineers.
00:29:30
Speaker
And so the risks are clear and present, but the rewards are massive too. And so we're actually seeing quite substantial uptake um and The reason we're seeing it of course is that the ah we're involved in these transformations and we're building the AI platforms that tier one banks are are using to exercise safe interactions with LLMs. For example, checking that they're not inadvertently leaking PII, defending as best we can against things like prompt

AI in Financial Services

00:30:08
Speaker
injection.
00:30:08
Speaker
ah so I think that although the ah i think the assessing comes in figuring out how internal teams can start to leverage AI in a safe way and how to navigate the costs. you know If you've got tens of thousands of engineers, you don't suddenly want to give I mean, well, for lots of reasons, you don't suddenly want to give them all access to the latest and greatest models. So figuring out that kind of thing is, is I think, what's being assessed right now. But there is definitely a lot of ambition for realizing the benefits of AI. And yeah luckily, we're in a position where we're helping to ah build those platforms that make that happen.
00:30:52
Speaker
Yeah. Okay. it did Then I wanted to get onto platforms. So let's go there. Because... um
00:31:00
Speaker
There are things like Claude, Codex, Cursor. Cortex is one that the company I'm working for is currently released, right? There are these individual programmer accelerator tools. And then there are these larger, more coordinating pieces. I'm thinking of Langchain, but i'm also thinking of Gastown. Give me your read on the larger orchestration picture.
00:31:30
Speaker
in Okay, so i mean so some of the the trade-offs to think through are the amount of runtime flexibility you're going to give to your agents. I'm thinking here, like if you're if you've got some kind of production system you've stood up, do you want to have provided tram lines for everything, of thought through exactly what the workflow needs to be. um And you know something like Lanchain, LangGraph will specify you know the full workflow very precisely. And that is time consuming, but of course carries enormous benefit that you you
00:32:11
Speaker
You've got a lot more confidence that you know what's going to be happening ah when you hit go, you when you sit submit some kind of request into ah you know the the the public API.
00:32:22
Speaker
um Something like Gastown, ah all bets are off. The process... the the process ah is expressed through this kind of mad management hierarchy. And Steve Yeggie's great fun. like He's an excellent writer. His blog posts are a a good read. And he's invented this um ah It's a slightly Heath Robinson contraption and I think he would agree.
00:32:54
Speaker
feel like I have to say, for the American market, Heath Robinson is our version of Rube Goldberg. It's like an elaborate, overly complex machine. Yes, absolutely.
00:33:07
Speaker
ah hey You know, he's he's very open about the fact he's vibe coding this thing and he's he's reaching into kind of popular culture for the names of his various components.
00:33:19
Speaker
um And I'm not sure that Gastown is the solution to what this kind of open-ended idea orchestration framework is going to be. For those that haven't dived into it, can you break down what it actually is and then what you do differently? I'll easy question. yeah well So what it is, ah it's a multi-layer Agent Orchestration Framework, which is aiming to give you as the user access to a small army of workers.
00:34:01
Speaker
ah And your entry point is to talk to something he calls the mayor. And you you talk to the mayor about what it is that you're trying to achieve. And then the mayor is going to the um a whole bunch of sub-agents who in turn are going to delegate to sub-agents themselves.
00:34:20
Speaker
And because um agents are it kind of trained to be, at least currently, helpful assistants, ah they they often sit there waiting for user input.
00:34:34
Speaker
yeah they'll They'll sort of ask a question and then stop. And so a lot of the um the effort in Gastown goes into... prodding these helpful assistants and and trying to sort of turn them into you know huskies pulling your sled along. So they're just going to be focused on a goal and keep iterating to to get there. So there's, some I think there's a deacon is one of the agents, which is pinging ah dogs and um pole cats and uh yeah i i get a bit lost in the details to be quite honest with you but the the objective fundamentally is just to keep the machine running um and uh it's a slightly unwieldy uh beast because you're fighting against uh these llm's kind of innate preference to um to kind of do some work and then come back to you
00:35:28
Speaker
um What would I do differently? oh boy. I mean, so the way that I prefer to work, um yeah i think I think multiple agents running in parallel is enormously powerful.
00:35:43
Speaker
ah I am a little bit uneasy about simply submitting short requests to a mayor, And then you know trusting the code to be written, tested, merged, and delivered without any involvement from me. I'm not quite there yet, personally. um What I prefer to do is spend some time working on what a solution should look like, which oftentimes is involving me developing my own understanding and helping to reason through what my requirements actually are. i don't necessarily know
00:36:24
Speaker
ahead of time and using AI to scaffold my own understanding through a kind of Socratic process, coming up with something that's a a well-defined specification, not for the whole system, because i you know I don't want us to return to waterfall where we're trying to design everything up front, but a specification that is a clear a description of my intent for some component.
00:36:52
Speaker
And then we can fan out to multiple agents in parallel. And there, Claude Code's um ah agent teams, which is only about a month or six weeks old at at at this point, I think it's still in ah and a preview, um is really useful. And I find that very robust way of...
00:37:10
Speaker
having these long horizon ah pieces of work where you might have quite a substantial specification. It might take hours for agents to correctly implement and test everything that you've asked for, ah but you could probably decompose that problem into a few parallel units of work.
00:37:27
Speaker
You've got multiple agents to do that for you. um That's my preferred way of working. It's much more kind of involved in the process. ah and Then I've got a choice about, you know do I go away and do something else for a bit, ah or do I stay um looking over their shoulder and perhaps I'm off doing something but doing something else on my machine?
00:37:48
Speaker
But keeping an eye on what's going on, um you know At the moment, i i I like to do that where I can ah because yeah it allows you to see whether actually in spite of your best intentions, there was a misunderstanding or misapprehension and you're starting to see files being touched that you didn't anticipate and so on. And then you can jump in, course, correct.
00:38:10
Speaker
Yeah, yeah. i e I don't know if I will agree with myself six months from now, but at the moment, I find I want to be fairly hands-on with architecture.
00:38:24
Speaker
I'm really quite hands off with code. And do you think, do you think there's a sweet spot for how involved you get? And do you think it's going to stay that way? Or it's got like two months before the sweet spot moves again?
00:38:40
Speaker
Well, yeah, so we talked about code distance earlier. ah Design distance, I guess, is another spectrum. And I'm i'm with you. you know I prefer to have very small design distance. I want to know um how the system is is constructed.
00:38:58
Speaker
ah And I think there's still that's still the place for good engineering judgment is in good architecture, good design. i think... ah and We've seen a bunch of examples recently where, for example, the cur cursor fast render experiment where they had thousands of agents working for a week on um building a browser by reverse engineering the specification. i didn't see that one. um And, ah well, it was a complete shambles. Like there was no central organizing force. You had a bunch of...
00:39:34
Speaker
as I say, thousands of agents off doing their own thing with, ah with as far as I can see, almost no coordination. ah And the results are somewhat predictable, you know, that so there were multiple independent implementations of certain key components.

Design and Coordination in AI Systems

00:39:49
Speaker
um And I think, you know, some small proportion of the builds actually compiled.
00:39:58
Speaker
ah And I think that's a sort of illustration you know, ah the risks where you don't have a clear design. you know that um Velocity doesn't really mean anything if what you're building is this sort of fractured and entangled mess. And there's a kind of Conway's Law analogy. you know We've often known with human teams how ah they tend to build software in their own image.
00:40:29
Speaker
And if you've got you know one team doing something and one team doing something else, that's going to end up being reflected in the architecture of whatever system they collaborate on. And their team boundaries will become service boundaries and API boundaries.
00:40:44
Speaker
It's one of those things about how design works in in groups. And then it's kind of subject to the same thing. you know If you don't constrain them, with some kind of overarching design, you know they are poly-semantic, fractured and entangled um machines that tend to produce code along those same lines.
00:41:03
Speaker
Yeah, yeah, i can see I can see that actually being much worse than the classic Conway's Law examples. Because what was it you said? Something like, if you have two teams, you'll end up with a two-pass compiler.
00:41:16
Speaker
yeah But with LLMs, you could easily end up with a thousand agents, which is a thousand teams if you do it wrong. And then... Exactly. Yes. Okay, so this makes me think of pinning these things down a bit more. And there are various techniques for this. I'm going to get onto your big one, I promise. But there's a particularly spicy term in your radar that relates to this, which is neurosymbolic AI. What on earth is that?

Neurosymbolic AI Exploration

00:41:47
Speaker
It's a fun one, isn't it? Yeah.
00:41:49
Speaker
so
00:41:53
Speaker
Although contemporary frontier LLMs are extremely powerful, and they're called reasoning models, and as we've said, you know it's ah it's kind of pseudo-reasoning, but boy does it get you a long way.
00:42:09
Speaker
The researchers analysing these models show that, ah well for example, there was the Apple paper from <unk>t know six or nine months ago or something that was getting various different kinds of LLMs to solve um problems like the Tower of Hanoi type problem. and It showed that there was a sweet spot for reasoning models, which was sort of moderate complexity.
00:42:33
Speaker
but In a kind of mad counterintuitive way, reasoning models actually performed worse than than the non-reasoning models on really simple problems because they kind of overthink themselves into a bad solution. and So there's no silver bullet. um and When ah people like Anthropic look at what's going on inside these models. It's really interesting research.
00:42:57
Speaker
um It's the kind of research that shows that they're polysemantic and that that neurons represent multiple concepts. and You've got to look at clusters of neurons in the LLM in order to understand ah what the LLM is thinking. And what was it about 18 months or so ago, Anthropic built Golden Gate Claude, where they kind of proved that they'd identified the cluster of neurons that represented the Golden Gate Bridge, and they're able to dial those up.
00:43:26
Speaker
so that basically every question you asked of Claude, it would find a way to reference the Golden Gate Bridge. So if you asked it you know how you should spend $20, it would advise paying the the Golden Gate Bridge toll and that kind of thing. um So you know we can interact with these these models. um But they're not they're not doing symbolic reasoning. They're doing kind of vibes-based um pattern matching. you know They're kind of metaphor-based.
00:43:53
Speaker
They're looking at things that are like something else. We might be two, though. That's necessarily a critical thing. Well, we've got two modes, haven't we? We've got the kind of system one and system two type thinking. And the analogy is that the LLMs are this kind of fast, intuitive...
00:44:11
Speaker
kind of way of thinking. Even when you're you're forcing them to reason, and you know it's really cool that it works. But there's this um there's something that researchers put called motivated reasoning, where if you if you kind of ask Claude ah or it indeed any reasoning model a um a maths question, but you imply that the answer is something that it's not,
00:44:35
Speaker
you're often going to get the model finding a way to make you right and they'll say, oh yes, the answer is that and this is how you get to it and it'll construct some kind of plausible um chain of thought to get there. But it's just not true. um And ah
00:44:51
Speaker
This is a problem that LLMs have, and it's a fundamental architectural limitation. But you can pair them with some kind of symbolic reasoning, which is which is doing um you know proper ah consequence analysis. it's you know Maybe it's something like Prologue that's solving a well-formed problem and providing you with exactly the steps to get there. You know that it's correct.
00:45:18
Speaker
um Take me through how that would work, because the return of prologue to the conversation is not something I could have possibly predicted. Well, isn't isn't life funny? And it you know it doesn't have to be prologue, of course. could be giving access to a calculator, but for example. Give the LLM a tool that does execute symbolic reasoning in a way that LLMs will try but you know potentially fail to do. And then this a combination of the two ah can give you much more confidence that you're going to get good results. and It you know opens up the possibility for applications in yeah places like regulated environments where you need to show exactly the steps that were taken. But perhaps you want to to have some kind of fuzziness to your inputs, you know maybe plugging Prolog directly into ah you know whatever upstream systems you have isn't viable. But you can use the LLMs for what they're good for there, which is you know translation.
00:46:16
Speaker
um and then ensure that that you know the the steps that need to be correct are indeed correct. ah you saying that we're going to i'm trying to picture this Are you saying that we're going to get ah we're going to write some kind of prologue model of the system and then get the LLM to say, here are my facts and my conclusions correct?
00:46:39
Speaker
ah Well, it might not be checking it might not even be attempting to ah to come up with its own conclusions. you know you can In creating a hybrid neuro-symbolic system, um The idea is, and of course, LLMs can write code. So this is this is another thing. It doesn't necessarily need to be that um ah you know you've written your system in Prolog. And again, it doesn't have to be Prolog. I just think it's really funny that this this language from the 70s is suddenly yeah you know finding a potential application. Again, still very early days.
00:47:13
Speaker
um But yeah it could write Python, for example. um And so... ah you can have your LLM writing some code that can then provide an answer with a kind of audit trail for how it got there and then take that answer and and run with it.
00:47:33
Speaker
ah so yeah, it's just a side of a hybrid with each part providing what it's good at. maybe i'm Maybe I'm getting confused. Are we saying like it will write programs to assist it in the act of writing programs? Or is it more like when you ask it a non-programming question,
00:47:52
Speaker
it might lean on a programming language to do some of the rigorous parts. Exactly that. Yes. Right. That's right. So, you know, it's tempting to try and use an LLM as the solution to every problem and bless them, they will try. ah It's rare, but you start to see it these days. And I guess there's a lot of work going on in the... um in the post-training to guide LLMs to be transparent about when they can't answer a problem. And I've only had it a handful of times, but I might get a response that says, ah, actually yeah I can't help with that.
00:48:31
Speaker
um Which is useful. It's better than the alternative where you get some kind of motivated reasoning or hallucination or a confabulation and and don't realize it. but But better still would be if the LLM goes, oh, well, I can't do that because I am you know a semantic next token predictor machine. But what I can do is write a little helper function that's going to provide the right answer for you. So I'll go away and do that.
00:48:59
Speaker
Right. Is there a particular field that's being applied to? Or is this all just experimental? It's still really early. you know It's a sort of emerging architecture. um But the it's a very general principle. So although I can't think of a particular field ah where this is some becoming prominent,
00:49:21
Speaker
yeah anywhere where you've got this mix of of requirements and you need some kind of auditability, you want to bring some kind of symbolic reasoning into play,
00:49:34
Speaker
It's another tool in the toolbox and it'll be really interesting over the next year to see ah whether this takes off ah or whether we find ever more ingenious ways to juice the ability of LLMs to the extent that people get away without it yeah Yeah, I have been feeling recently that even if the even if the sophistication of LLMs themselves plateaus right now, there's still an enormous amount to be discovered in the way we use them and the way we orchestrate them with other tools.
00:50:11
Speaker
that's um this is This is actually one reason why I invited you particularly to talk about this topic, because as well as like using offtheshellf and earth off-the-shelf LLMs, as many of us are, you've done lots of training of your own models in the past.
00:50:31
Speaker
And I want to get your read on to what degree is it now worth training a specialized model from scratch, taking an existing model and tweaking it like post-training new weights? what To what degree do we distill large models into small models?
00:50:52
Speaker
Is there space for all these techniques? Because it seems like apart from cost of tokens, you might as well just throw a large model at everything and be done with it. Mm-hmm. There's a lot of truth to that. um You know, ah trying to compete with the ah the speed of development of the frontier model providers ah is almost certainly a losing game.
00:51:20
Speaker
um So yeah you you need to have quite a high threshold, I think, now before you decide to compete to do your own model.
00:51:32
Speaker
um But as you say, you know this has been something that that I've done in the past. And where it's it it is still useful, I think, is some yeah if If you've got a lot of... ah so So one of the use cases, for example, is classification.
00:51:53
Speaker
And ah and ah prior role, I i worked for RegTech, and what we needed to do with was classify regulatory text.
00:52:04
Speaker
And... and There's a rich and and and storied history about text classification and we reached for some of the ah really good classical ML techniques early on, ah things like support vector machines ah for classifying regulatory text.
00:52:27
Speaker
um This was around the time that LLMs were coming to the fore. And so we just decided to see how well an LLM might perform.
00:52:39
Speaker
ah And outperformed the classifier. ah As long as we fine-tuned the LLM on our training data and provided it with the hundreds of thousands of training examples that we collected over years,
00:52:57
Speaker
that gave it a kind of intuitive understanding about how legal text related to the the classes that we wanted to apply. um And so that fine-tuning provided this sort of, ah you know,
00:53:13
Speaker
Intuition is is probably the best word, actually. it it It meant that we didn't have to have massively long prompts that described exactly how you should classify each piece of legal text. It had learned by example.
00:53:28
Speaker
Right. If we had tried to to put all of that into a prompt, we'd start to hit the kind of lost in the middle type challenges that happen when you've got very long prompts and LLMs tend to be much more sensitive to what's at the beginning or end of the prompt and and forget what's ah what's in between.
00:53:46
Speaker
um or we'd sort of hit context window limitations and that kind of thing. so There's still a place if you've got this kind of repository of your own information, valuable um business knowledge, ah that you want to give the LLM a kind of intuitive knowledge about. Fine-tuning still has a place.
00:54:05
Speaker
It's more common, though, that you plug them into some kind of rag pipeline. And if what you're actually doing is information retrieval rather than classification or some other kind of machine learning-y thing, if what you're asking for is fetch me the relevant information,
00:54:19
Speaker
you don't really have to fine-tune a model. You can you vectorize your data. and You provide ah a means of looking up, based on a query, the potentially relevant data. Then you can provide that to an LLM.
00:54:33
Speaker
And you're unlikely, if you you know as long as you're not finding hundreds or thousands of responses, you're not going to exceed the context window of an LLM um to be able to sift through that and come up with a good response for you.
00:54:45
Speaker
So, yeah, so modulo cost, you really are leaning towards just frontier models with the right techniques and orchestration around them.
00:54:56
Speaker
Yes, for most people, i you know that to to there's a a maintenance burden. you know the As I say, trying to keep up with frontier models is is um it's probably a losing game right now. But also, you know you need a bunch of embedded expertise in your team in order to do things like fine-tuning well.
00:55:17
Speaker
ah You need to be um hosting your own models. you don't So that's that carries a cost. and an overhead that's ah that's not insignificant. Fair enough.
00:55:29
Speaker
That reminds me of something I was going to ask you. Hosting your own models. As a company, that's a big ask, right? You probably want plenty of servers with 300 gigabytes of video RAM or whatever.
00:55:43
Speaker
But there's also the hosting your own locally. in which What's your experience here? Because my experience is it's It's impressive how much LLM I can run on my 32 gig MacBook, but it's still kind of brain dead to what I would get if I go for the cloud.
00:56:03
Speaker
Is that your current experience? And do you think that's likely to change anytime soon?
00:56:10
Speaker
Well, that is indeed a probably billion dollar question. ah Yeah, many large companies' fortunes are riding on it being a no, you'll always need the cloud. Yeah, exactly.
00:56:21
Speaker
And my hunch is that ah yeah the opportunity of of being able to run something locally on commodity hardware is is too great for us not to find a workable solution in the next ah year or two.
00:56:37
Speaker
um and Yet, you know at the moment, you're right, yeah we're kind of spoiled when we can go to the cloud and and get this really sort of emotionally intelligent ah AI ah kind of coming back to us with with questions that we didn't even think to to ask.
00:56:59
Speaker
And then you go to a local version and it's... ah it's much more sort of pedestrian and and slower. um And so that is introduces kind of friction and it means that you know most of us aren't aren't doing that right now.
00:57:14
Speaker
um But you know just imagine if ah if we find a way to make these models much more efficient and provide, for at least for certain domains, um enough capability that you don't feel that kind of friction anymore.
00:57:33
Speaker
Of course, that's what we would want to do. yeah our data stays local. We don't have to worry about, you know, inadvertently leaking it. You know costs are much more predictable.
00:57:44
Speaker
yeah You know, at the moment, if you if you really want to do it, ah you do need a lot of RAM, more than 32 gigs. I mean, even more than 128 gigs, ah really, which is what I've got. Ideally, you'd have like... ah i 500 gigs would be great. You could run Kimi locally and you get a pretty comparable experience to to running Claude Opus. But yeah, that's going to be outside of most people's budgets right now.
00:58:14
Speaker
Yeah. And sourcing availability. Indeed. yeah It's hard to get hold of memory at the moment. Yeah. yeah But anyway, you know, that there is ah ah something i'm following with very close interest right now, and I'm sure it will make its way into the next radar is, uh, I think it's a Finnish company called Talas, which has built, uh, a chip, um, which has the weights of, I think it's, uh, one of Meta's models, uh, one of the Lama models, uh, encoded in it.
00:58:48
Speaker
So it's a sort LLM as as a piece of hardware. Uh, and, ah Reportedly, I think it's 17,000 tokens a second second. So it's basically instant. Wow. And so current numbers are something like what? 20 to 50? Yes. Yes. depend So, you know, it's not inconceivable for, ah you know, for some um those with ah enough budget, you could have clawed on a dongle, ah you know, before too long. So maybe that's another solution.
00:59:21
Speaker
Yeah, they're going to really struggle to keep them up to date. But again, maybe that won't matter. Maybe if the LLM power plateaus and it becomes about orchestration, then you find that really what you want is enough USB ports. Plug in all your cords and a domino.
00:59:38
Speaker
I've only got three and one of them is being used to attach my camera. so that's no good. Okay. Um... let's go Let's go back to production then and stop speculating. The the other topic I really wanted to talk about, which which is very much at the coalface, is LLM observability. Mm-hmm.
00:59:59
Speaker
like We know it goes well, we know it goes badly, and I know you've put on the radar like tools for keeping an eye on them. Is this like monitoring pipelines into production or auditing whether your users ah whether your developers are using it right? or What's the state of the art there?
01:00:18
Speaker
Yes. yes i mean It used to be that you know we were much more concerned about training our models and evaluating them during development, and then you'd push them to production and they just kind of worked. and If you were very diligent, you might be looking for data drift and things like that over time where you'd be checking that the distributions that you were training on were still ah correct in production.
01:00:44
Speaker
um Now, of course, if you've got sort of agent ah orchestration going on, you need to be paying very close attention to what your system is doing. You need to be know keeping an eye on kind of tools, tool use that agents are doing, how they're talking to each other, um whether they're stalling, crashing.
01:01:06
Speaker
And so, yeah, there's a whole emerging ah set of tools which will which will provide you that kind of observability.
01:01:17
Speaker
And i mean, phox Phoenix is quite a good one. It's very easy to get started with. um You get to see traces, you get to and to to to see what's going on ah and provide you know nice metrics and graphs and so on that you can start to see if something is not going how you anticipated. And it's kind of paired with ah workflow tools as well, like Temporal,

Ensuring Stability in AI Workflows

01:01:45
Speaker
for example. so okay because
01:01:48
Speaker
Because ah yeah if you're talking to ah the ah the SaaS model providers at scale, you know you're going to start to become really aware of some of the stability issues that they are struggling with as the world tries to use their ah their API.
01:02:10
Speaker
And so getting really robust around ah when the service is unavailable, um you know back off and retry. ah if your agent crashes in the middle of a long multi-step workflow, you don't want to start back at the beginning.
01:02:27
Speaker
You want to pick up where you left off. Temporal is in our radar as well for for that reason. It provides the ability to save game in your ah your workflow.
01:02:38
Speaker
How does that work? What what is Temporal in detail? So if you think about Kubernetes, which is orchestrating machines, ah Temporal is orchestrating your agent workflows. And so it's it's providing a way of keeping durable state in a way where if your machines crash, if your agents crash, things will always be able to pick up where they left off. And some of these things are very long running.
01:03:07
Speaker
And so ah yeah it might be that human approval steps are part of your workflow, in which case, you know, perhaps there are days where something is sitting in a particular state, waiting for some new thing to happen.
01:03:21
Speaker
Temporal ah provides a framework for um specifying that behavior in your own programming language. I think there's Python, Java, TypeScript, and so on. And so, you know, it's a nice way, quite good ergonomics of writing these robust workflows.
01:03:42
Speaker
Is that like is that as simple as fundamentally as simple as persisting all the context windows from the various agents? Because the nice thing about agent is- You make it sound so simple. I do. And I'm not trying to dismiss what they're doing. I'm just trying to get to the hub of the architecture.
01:04:02
Speaker
like The nice thing about an agent is fundamentally it has state, but its state is just a big long text transcript, right? hey Well, yeah, from from the outside, i think there's all kinds of mad science going on ah ah for efficiency.
01:04:20
Speaker
um But yes, from the outside, that's right, big long context string. Okay, so managing those at scale and doing a simple task at scale is always harder than it sounds.
01:04:33
Speaker
Yes, that's right. You've got these ah essentially an arbitrary number of long-running processes that you want to to pick up and based on triggers in your workflow, other events and things.
01:04:45
Speaker
And just for my local version of that, because I often find I have like... Companies have orchestration problems. We're finding developers have orchestration problems where I've got three or four instances of clawed code running and I'm just waiting for the next one to ask for the prompt.
01:05:02
Speaker
right Yeah, we're moving into management, aren't we? We thought we were high seas. But it turns out now we're responsible for a team. ah And the analogy quite like is that we're we're going from being kind of pilots to air traffic controllers.
01:05:18
Speaker
A completely different way of working. Yeah, yeah, I can see that. That's there's quite a good way of thinking about it. And sounds more glamorous than going from contributors to engineering managers, doesn't That's right. I imagine air traffic controllers would say it's not that glamorous. But but aren there are there good local? I mean, would you use Temporal or Phoenix locally? Or are there local management things you would recommend instead?
01:05:49
Speaker
That's a really interesting question. I mean, so as I say, my my preferred workflow is basically an agent team and um you know fanning out and then back together again. It's ah what would be the circumstances in which I've got genuinely long running process?
01:06:04
Speaker
i At the moment, I think probably... four, six hours is probably the longest I've ever left agents running unattended. And that will have required quite a lot of setup to make sure that I've got some idea of what's actually supposed to be happening then. um So a lot of ah you know upfront specification. And then ah a Ralph loop, I suppose that's the that's the dumb answer where you've got some hairy, audacious goal,
01:06:37
Speaker
um For example, you've spent, as I have, a day with an AI iterating in ah you know a lot of detail on the specification for a full distributed system ah spread across 10 files. It's like 3,000 lines of specification, dense, you know um code-like specification, not human pose.
01:07:01
Speaker
ah then you're asking a lot of even an agent team of LLMs to go away and implement that. That's going to be a lot of code. yeah And there's this, you know, quite sort of it's it's ah it's annoying, but um a quite sort of fun, whimsical quality of LLMs of being like a sort of genie that will often find a creative way of telling you that they've ah They've done what you asked, but it was never quite what you intended. it's you know Maybe it's the ah it's the letter rather than the spirit of your request. The classic one is, I've got the tests working by disabling all the tests. Yeah, ah exactly. um But a Ralph Wiggum loop is ah a very simple way of combating that, ah which is to, you if your prompt is something like,
01:07:53
Speaker
I've got a load of specifications sitting in the specs directory, read them and implement the code that they describe. you You give that to an LLM and it will go off and it will, it might run for several hours and it will come back and say,
01:08:08
Speaker
yep I've done that. And on closer inspection, you'll find that, you know, maybe it it it went really deep on a particular component. And that is great. You know, that's been implemented. But there's whole other areas that have just been completely ignored. Or maybe it will have taken a very broad brush approach and implemented everything a bit, but missed lots of the detail. Yeah. um But the kind of the genius insight is that you could s submit exactly the same prompt again and say, take a look at the specs directory, implement the system that they describe. And when the the the new LLM um wakes up with a fresh session, it's a brand new contributor,
01:08:46
Speaker
ah it will take a look at the code you've already got and say, oh, that's not quite right. that the the This doesn't match what the user has asked for. So I'll go away and I'll start on that. And that next LLM will take you a bit closer. And of course, you can just loop as many times as you need to. And this maybe be is where the new engineering judgment comes in, is knowing you know how to specify that prompt, how many times to iterate, how much upfront work to do.
01:09:10
Speaker
um It will get you eventually, after some number of goes, perhaps towards broadly what you wanted. And this has been my experience, even with very sophisticated distributed systems, that you do eventually get what you what you want.
01:09:24
Speaker
So you've tried this with something meaty and it actually works? It does work, yes. Yeah, absolutely. um i've I've written about it. So, ah you know, distributed, ah strongly consistent, very high-performance system ah with ah ah Byzantine fault tolerance.
01:09:42
Speaker
ah it took ah It took a weekend of elapsed time and probably about... 20 hours of LLM time to implement, but it worked. And of course, you know the fact that ah you can if you specify the intent that you have for your system,
01:10:05
Speaker
but not the implementation, you give Claude or your LLM of choice a lot of latitude to solve the problem, but whilst being constrained by you know your invariance, the things that you that must be true, like the fact it needs to be fault tolerant and the kind of error conditions that that we need to ah defend against.
01:10:27
Speaker
And I had the experience of Claude, of claude ah realizing that there were ah that there were bugs that only happened under certain crash scenarios, and repeatedly ah setting things up, killing them, and looking to see what happened, and then maybe adding instrumentation, little bits of of code that would be logging.
01:10:53
Speaker
doing things again. In a way that I would have been very, you know, it could be quite tedious drudge work to do that. You're sort of trying to reason through complex race conditions in a distributed system.
01:11:06
Speaker
um But if you've given the lllm enough information context about the goal and it's a goal that can be verified, ah then my experience has been, have a fantastic tool for just, um, patiently diligently working their way through a solution. The same is true for, um, for latency. You know, i set a really ambitious latency goal of, uh,
01:11:34
Speaker
ah Our P99 latency being under 100 milliseconds at 10,000 requests a second. And again, it had access to the load test rig. It obviously had access to the code.
01:11:46
Speaker
ah And our first attempt, our P99 latency was 30 seconds. i think it was 31 seconds, in fact. then over the course of some hours,
01:11:57
Speaker
We got it down to something like 25 milliseconds. um And part of that was Claude realizing at some point that it had noticed that the P99 was stuck on 208 milliseconds for ages in spite of what it did to the code. Nothing seems to shift that persistent tail latency.
01:12:17
Speaker
It added some instrumentation to the code separate from our load test rig that confirmed that actually nearly all the requests were happening under 100 milliseconds. And realized realized that um the latency was being introduced by our Docker networking bridge and decided all on its own, okay, I'm going to move the load test rig inside Docker.
01:12:39
Speaker
And then suddenly, you know, the the latency dropped. And that was all completely hands off. I just said the overarching goal that I wanted to achieve and and left it. I mean, that's that's insane for building large systems. I have to ask, before I completely react to that, how much do you now trust that system that's been built?
01:13:01
Speaker
he Well, i I have a lot of tests. It is better tested than than most systems that I've been involved in. And I think that's the fascinating thing about AI-assisted engineering is it's... it's We always knew that comprehensive testing of at least the important critical parts, uh, is, is important.
01:13:29
Speaker
Uh, And yet we were always constrained by what we could reasonably achieve in the time that we had available. The calculus has shifted with ai and suddenly it it's feasible to implement it tests for every ah conceivable scenario that you think you know you want to defend against.
01:13:49
Speaker
Or you know that's quite sort of... reactive maybe, you're writing tests about code that already exists, you could spend the equivalent amount of time making sure that you absolutely understand how the system should behave in all kinds of ah adversarial scenarios and make sure that behavior is really well specified in a way that theoretically you could have done before. But maybe you'd have been working on a whiteboard and that's got kind of finite space and then somebody comes along and you know rubs a bit out. know Now you're working with an LLM on a behavioral specification file that's durable,
01:14:22
Speaker
you can be interacting conversationally and so you can be asking questions of like, what happens if kind of thing that previously might've been quite hard to reason about, but now you've got the LLM um there for signposting and assistance that mean you can be much more rigorous.
01:14:36
Speaker
And so in a funny sort of way, we're all, I think, not unreasonably, we've been worried about Vibe coding and the spaghettification of code with LLMs, but yeah In diligent hands, you've now got the opportunity to to do the things that we always should have been doing much more rigorously. And perhaps we're going to see certain types of software that have been built along those sorts of lines being much more resilient than they ever were before.
01:15:06
Speaker
Yeah, yeah, I can see that. I genuinely can with the extra capacity and diligence. I have to ask you, I have to ask you the halting problem. How did you know when to stop looping?
01:15:20
Speaker
ah Well, at some point, you're going to get a few consecutive responses that say, yep, there's nothing more to do. Everything is fine to me.
01:15:31
Speaker
Eventually, yes. I mean, it might take... some very long number of loops. and I wouldn't ever trust the first one, but once you've done that a couple of times, probabilistically, you know um on the balance of likelihood, you go, okay, I think i think we're there.
01:15:48
Speaker
okay Okay. Then I have to ask you, for building a system of this kind of size, you talk about pinning down the specification and the constraints. That leads me to your contribution to this field.
01:16:00
Speaker
which you've been working on a system for pinning down constraints for the sake of LLMs, right? And for the sake of agents. Tell me about that. What's your novel addition to the world?
01:16:13
Speaker
So ah it's called Allium. It's a language for behavioral specification. It's an LLM native language.
01:16:24
Speaker
What does that mean? It means that there's there's no runtime except for the LLM, there's no compiler except for the LLM. It exists as a a specification of what the language is.
01:16:38
Speaker
ah ah And it has an accompanying skill which teaches your LLM how to read and write files in this format.
01:16:52
Speaker
And the language guidance provides um a lot of information about how ah a behavioural specification should be elicited from a stakeholder. There's also another mode that you can run in where it looks at an existing code base and distills that code into a behavioral specification. So it looks at the how of the code that actually exists and then attempts to reason about the what, and what it was trying to achieve and to encode that as behavioral spec.
01:17:22
Speaker
And the the the thesis behind this is that um' we're seeing a rise in spec-driven development. You've got things like SpecKit and Kiro and Claude Plan Mode. And and generally the solution is markdown files.
01:17:39
Speaker
You're going to write a long human prose description of what we're trying to achieve. There's definitely value in that. It's certainly better than nothing at all.
01:17:50
Speaker
um But human language just naturally hides contradictions and there may be ambiguities ah there. um you know that the In a long specification, ah a person might miss the fact that you're a bit ambiguous around the kind of authentication model that your system needs to have, or whether you know guests are allowed to do certain things. It's not just about security, course,
01:18:21
Speaker
course. It's also about you know any aspect of the behaviour of your your system. um And so the the thesis is that providing a formal way of expressing your behavioural intent ah provides much less opportunity for that to happen because you can start to leverage the LLM's ability to spot bugs in code if you're expressing your behavioural intent a bit like code.
01:18:46
Speaker
um Right. give me Give me a concrete example. what What does it look like when I write some Allium? So it's not designed to be written by...
01:18:58
Speaker
ah humans It can be read by them because it's ah it's designed to look a bit Python-y TypeScript-y. So in fact, you know we are trying to leverage that at that point around well represented in the training data um just to give the LLM a leg up, ah given that we're asking we're teaching it about brand new language it's it's literally never seen before.
01:19:21
Speaker
um But it arises through conversation. And so you would start by saying something like, I want to create an Allium specification for my system. I want it to do X, Y, and z And the skill around, uh,
01:19:37
Speaker
the specification language will be the one that teaches the LLM to say, hmm, interesting. When you said customer, what did you mean by that? Do customers map to accounts? ah What can they do? you know ah And through this kind of conversational ah iterative process, a lot like a product owner might have with a senior engineer yeah or so on, you get this rich and quite um rigorous specification of what the system needs to be, completely independent of implementation. It's got nothing to say about
01:20:21
Speaker
the database that you're using or the web server that you're using. It's ah it's pure ah behavior. um And so this is how we can kind of avoid getting tangled up in implementation concerns.
01:20:37
Speaker
And it's really powerful because LLMs kind of snap to the code. you know If all you've got is the code to look at, that is both the description of how the system should work and what it's trying to do, condensed into the same kind of substrate.
01:20:55
Speaker
um And yet there's no reason those two things have to be expressed in exactly the same way, ah because the how has all kinds of other things to grapple with, you know, sort of implementation, um ah constraints, so you libraries, and um maybe there are kind of expedient workarounds or you know bugs lurking in in the code.
01:21:16
Speaker
Having a completely separate way of specifying what the system should do isn't duplication, to my mind, it's providing the kind of resilience that you get from also having tests, you know which are expressing what the code should do in a different way.
01:21:35
Speaker
The trouble is that because LLMs tend to update both the code and the tests at the same time, they're both kind of evolving simultaneously. You've got no fixed point around which to check that those changes are correct. Whereas if you've got a behavioral specification as well,
01:21:51
Speaker
Then if you're only ever changing two of those at once, you've got that kind of fixed point. It's a bit like three legs of a stool. Right. So are you saying the workflow here is I have a chat with my agent to produce Allium files as an asset, which are then fixed and you would then go on to have it implement those, but you would...
01:22:16
Speaker
You'd expect to get a change which changes a lot of source code files, but doesn't change the spec files. Exactly. You will tend to be changing one or the other um at any given time.
01:22:29
Speaker
um and you can do as much upfront specification as you want to do. you're Part of the reason for wanting to build this ah language is to to work, to have it built entirely for that way of working with LLMs, where you you want to start by speccing out ah a prototype.
01:22:52
Speaker
um But you want to put that prototype under some kind of behavioural specification because then that allows you more easily to sort of grow it in a... ah in a kind of rigorous way ah where you're not going to end up in the situation where you try and change one bit of the code and another bit breaks and then you try and fix that bit and the first bit goes back to how it was because yeah it can be playing a bit like playing whack-a-mole sometimes. yeah so It's designed for this kind of iterative way of working. You start with a conversation, as I say, unless you've got code that you want to distill into a specification already.
01:23:28
Speaker
um And you do as much as you need to. And the specification has provisions for open questions and deferred specifications and other bits which are explicitly acknowledged to be kind of open.
01:23:41
Speaker
ah at that point in time. And so when the LLM, a different LLM probably, is coming along to actually implement that behavior, it can see, oh, this bit is the bit that has been well specified, this is the bit I will um write good code for, and that bit is currently undefined, so I won't i won't do that ah And so, yeah, it's it's when you hear spectrum development, often people think, oh, its we're back to big design up front and waterfall thinking. That's exactly not the approach that I advocate for. This is definitely an incremental and iterative process. It's just providing you with that fixed point that means that you've you've always got...
01:24:19
Speaker
ah some stability ah in your um your iterations with your LLM that allow you to ah to keep walking forwards rather than get caught in some kind of doom loop, which is probably an experience that lots of us have had with LLMs, where we just we reach a wall and there's a massive drop-off in in our velocity.

Clarity and Testing in AI Development

01:24:42
Speaker
um The thesis of Allium is that we we can achieve velocity through clarity of intent. And you're also saying that there's a ah space, a gap in the design space between the flexibility of human language and the detail of programming language then.
01:25:02
Speaker
That's a nice way of putting it. And is it between human language and a programming language? I say what it's between is human language programming.
01:25:15
Speaker
and you know perhaps the most rigorous specification language, TLA+, plus for example, you know which is going to provide you with formal guarantees of correctness.
01:25:26
Speaker
Allium sits in between those two kinds of things. you know We can't offer formal guarantees. We're living inside the LLM. um We're therefore subject to that kind of semantic um a kind of metaphorical similarity rather than symbolic correctness. We do have command line tools and other things that will check the validity of syntax. But even so, you know we are ah We're in between those two extremes. We're more rigorous than pros, but but yeah we can't offer the formal guarantees of TLA+. plus However, it is a programming language in the sense that it's got very well-defined semantics.
01:26:06
Speaker
ah you know you You are describing exactly what you want to have happen. And so in that respect, ah you know it's not inconceivable that in the future there could be ah ah an actual runtime Allium. And one of the directions that we're moving towards is to ah to make sure that we capture enough information to be able to implement property-based testing, for example.
01:26:31
Speaker
And so we're moving in that direction. And in spite of the fact that we are LLM native, ah it's important to me that it it it is a very well-defined language with clear rules about its use, ah even though we can't guarantee the LLMs will always follow them to the to the letter.
01:26:53
Speaker
but you can't guarantee that about human programmers either.
01:26:58
Speaker
So what kind of things can I specify concretely? Can I say, like, this project should use OAuth and support Google? This project should save all data persistently to Postgres, a durable database? Can I say your thing of the P99 for response time should be less than 100 milliseconds? Yeah.
01:27:19
Speaker
ah You can specify all of those things because there's an implementation guidance section for every ah every rule in the specification provides the ability to use human language as well.
01:27:36
Speaker
And so there's nothing that you can't put into your specification. You can add non-functionals. You can say, I'd like to use Google's OAuth. You can say, I'd like to use Postgres. That's not the core objective of a behavioural specification. though what we What we instead want to talk about are the domain entities that we're operating on, what their attributes are, um for example. So we're not going to implement the schema.
01:28:02
Speaker
that ends up in the database, but we are going to say, know, we've got customers and line items and and we've got quotes and so on. These are the things that our system knows about.
01:28:14
Speaker
You're going to have rules that that describe how transitions through your system work. There are going to be preconditions, post conditions, triggers. um Surfaces is is something that gets specified as well. So those are intentionally not talking about a web page or a CLI interface or an API.
01:28:39
Speaker
It's simply a Surface. It's a way that the outside world can reach your a domain. um And so Surfaces expose the things that the outside world ah can read and the ah the things that your code requires those people, those actors who are interfacing with your code to provide.
01:29:00
Speaker
But it's intentionally expressed in this quite abstract way because we don't know, this an API or is it a rich web front end? We don't really want to because the same behavioral specification could be used to drive an implementation of both of those things. And so when you're asking your LLM to implement, ideally you're providing some kind of additional context which is going to talk about your technology choices and those non-functionals and other things like that. Otherwise, you're going to get some kind of default choice. As we said, it's probably going to end up in Python or TypeScript. It's going to be TypeScript to the API, right? Yeah, that's right.
01:29:47
Speaker
This sounds like you're describing a language for business logic then.
01:29:54
Speaker
ah Well, so in a sense, yes. So so so kind of state machine transitions is is a key part of it for sure.
01:30:06
Speaker
um
01:30:09
Speaker
I mean, that's a lot of what systems need to do, isn't it? Manage yeah and manage interactions with data and to ensure that they're correct.
01:30:20
Speaker
Absolutely, yeah. um I'm just wondering, I mean, after we finish this discussion, i'm going to go and run it on one or two of my existing projects. I'm wondering what will be distilled out of it.
01:30:34
Speaker
Like, will will it boil out some of my... hard external requirements, or is it going to boil out what my core business logic is?
01:30:45
Speaker
e Well, it's fascinating, isn't it? there's ah There's inherently a judgment to be made about level of abstraction. Because if the specification ends up being too close to the code, it's not really adding anything.
01:30:58
Speaker
If it's much too high level, of course, you know every every website's the same. I go and i I make a request and I get some data. That's not useful. So you Part of the the skill is around, um and I mean this in the this the sense of ah you know markdown skill, the way that we guide the LLM is to make ah a sound judgment based on yeah kind of um senior engineer type expertise about what the right level of abstraction is.
01:31:30
Speaker
Of course, you can always guide that in the prompt that you provide along with your distill request. But our experience as a default is really good, and we've used it to find bugs in existing code um because in producing a behavioral specification that's attempting to communicate the the what of your system as well as the how in a kind of compressed way, ah because a spec is normally...
01:32:01
Speaker
somewhere between 5% and 50% of the size of the the code, depending on how ah complicated your semantics are or how verbose your programming language is ah So it's always introducing some kind of compression.
01:32:17
Speaker
And then by comparing the the specification against your code, you can start to go, oh, that's that's funny. That code's quite complicated for the specification that has described it. I wonder why that is. Perhaps the code's doing more than it should. Perhaps there's some kind of tangled intent going on there.
01:32:37
Speaker
And so although you can never say for certain that the code is wrong, ah or the spec is wrong without getting your hands dirty and taking a really close look. It provides this kind of... um topology of your code that allows you to sort of go and look at the the the areas where you might find something that's ah that's wrong.
01:33:00
Speaker
And as I say, we've we've we've used this to discover bugs in code that people were using very happily they weren't aware of. um And so it really does add value and and is part of this story about how I think LLMs can introduce more rigor to software engineering then than we've had to date.
01:33:20
Speaker
Yeah, yeah. i I wonder if you're going to... So I've done a thing with a few projects where I've ah run an agent over the code base and said, well, draw me some mermaid diagrams, right?
01:33:31
Speaker
So that I can try and visualize what's going on. It does quite a good, excuse me, it does quite a good job. And i'd sometimes I've looked at these diagrams and thought, well, that's not right.

Re-architecting Systems with Allium

01:33:44
Speaker
That doesn't seem like the architecture I'd want. and trying to prompt for for it to re-architect things the way the diagram should have come out.
01:33:55
Speaker
are you goingnna Are you finding that with Allium? Are you going going to be doing that where you distill out spec files for a system and look at them and expect a human to read through them and find actually contradictions, change the spec and have the LLM re-project it back into the code base?
01:34:12
Speaker
I mean, the the shape of that way of working is correct. But again, the specification files themselves are, you can look at them um to check individual rules, but i I don't think anyone will look at a whole file and attempt to to know whether that file describes the whole behavior that they want from their system. That's a lot to try and extract out of, ah you know, potentially a long file.
01:34:38
Speaker
um But for you know taking a look and checking that, oh yes, the right attributes are being exposed for my domain entities here, or yes, that rule ah has the right preconditions and it triggers the right thing. Those kinds of things, for sure, you could have a domain expert to look at. But I think for the for the kind of workflow that you're talking about there,
01:35:00
Speaker
the conversational one that we're all getting used to with our, with our prompts, uh, is, is where people are right now. And so kind of Allium is designed to work with that.
01:35:11
Speaker
What it, um What it tends to create is this um this dynamic where the LLM is asking you really sharp questions about, what did you mean when you said that? Or, you know, I've noticed that this part of the specification, that part of the specification are, you know, slightly inconsistent. What do you want to do about it?
01:35:33
Speaker
um So kind of there's a sort of sapier-warf kind of quality where you know that that theory that the language that you use describes your reality. think by yeah teaching the LLMs about...
01:35:46
Speaker
this structured language of behavioural specification just puts it in that frame of mind where those are the kinds of things it's asking about. and and it's really powerful.
01:35:59
Speaker
um And I am constantly surprised by ah the the quality of the conversation that we have and how quickly you can get to a really good description of the problem space having been told about the potential ambiguities and conflicts and edge cases that you might always just have shrugged off before because it was a bit you weren't really aware of them maybe your sense they might they might pop up but but ah there are there more important things to worry about right now um yeah you know the allium um
01:36:34
Speaker
skill, teaching an LLM to patiently, in a kind of Socratic way, work through these things with you makes it almost pleasant um because again, this ah the speed ah that you can realize ah means that it can be quite intense to be and being asked in quick succession lots of very detailed questions about how your system should behave.
01:36:59
Speaker
But an hour or two later, and suddenly those problems are solved.

Allium's Role in AI Conversations

01:37:03
Speaker
And maybe an hour or two after that, you can see the system actually running. It's so very rewarding. Right, right. So I've got the wrong end of the stick with this language. I thought it was a language for us to use to drive the LLM.
01:37:16
Speaker
But really, you're saying it's a language to help the LLM pin down its own thoughts, ready for when it talks to us. I think that's a great way of phrasing it, yes. Right.
01:37:27
Speaker
Okay, so I've got to go and try this and see what the results are. That's the only way I'm going to. How do i get started? Give me that beginner's guide. ah So it's ah it's open source. it's on It's on GitHub. You will find it at forward slash juxt forward slash Allium.
01:37:46
Speaker
um And it's very easy to install. you know It is just a set of markdown files. So you know there's a Claude marketplace interface. So even within Claude, you can install.
01:37:59
Speaker
do forward slash plugins and add the juxt to marketplace and grab it that way. Or if you're using anything else, there's a sort of and NPX skills installation as well. It's all described on on the readme. Once you've got it available in your LLM of choice, ah just type Allium. Allium.
01:38:20
Speaker
And it will take a look at your code and probably ask you you, do you want to add some features? Do you want to distill a specification? And away you go. You're immediately in that conversational interface informed by ah the kind of behavioral specification language.
01:38:38
Speaker
Okay.

Future Testing and Sharing of Allium

01:38:39
Speaker
I've got a couple of projects, but one that's been going very well and I'm curious and one that's been going a little bit badly and maybe I could use some assistance on it. So I'm going to go and try both.
01:38:51
Speaker
um I'll put the links in the show notes once I've done it. Henry, thank you very much. This is as interesting, pinning down an LLM's thinking.
01:39:02
Speaker
I like this. I really like this. i' going to give it a go. Henry Garner, thank you very much for joining me. Thank you. My pleasure. Cheers. Thank you, Henry. So I did try Allium on one of my projects, by which I really mean I downloaded the skill files and said to Claude, go on then.
01:39:20
Speaker
And it did actually work. It found a whole bunch of corner cases in the design, like genuine problems that I hadn't thought to think about. It found unknown unknowns.
01:39:32
Speaker
And then I tried it on a couple of other repos and it did it again. And then I tried it on a friend's repo and it worked for that. So yeah, genuinely, I think he might be onto something. I haven't tried it for a long run. I haven't tried it over six months, but there is something here. So give it a whirl if you're interested. You'll find the links as ever in the show notes. And while your agent is processing them in the background, please take a moment to like, rate and share this episode and make sure you're subscribed because whatever the future holds, we'll be back soon with another interesting mind from the world of software.
01:40:07
Speaker
Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Henry Garner. Thanks for listening.