Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
From Unit Tests to Whole Universe Tests (with Will Wilson) image

From Unit Tests to Whole Universe Tests (with Will Wilson)

Developer Voices
Avatar
2 Playsin 50 minutes

How confident are you when your test suite goes green? If you're honest, probably not 100% confident - because most bugs come from scenarios we never thought to test. Traditional testing only catches the problems we anticipate, but the 3am pager alerts? Those come from the unexpected interactions, timing issues, and edge cases we never imagined.

In this episode, Will Wilson from Antithesis takes us deep into the world of autonomous testing. They've built a deterministic hypervisor that can simulate entire distributed systems - complete with fake AWS services - and intelligently explore millions of possible states to find bugs before production. Think property-based testing, but for your entire infrastructure stack. The approach is so thorough they've even used it to find glitches in Super Mario Brothers (seriously).

We explore how deterministic simulation works at the hypervisor level, why traditional integration tests are fundamentally limited, and how you can write maintainable tests that actually find the bugs that matter. If you've ever wished you could test "what happens when everything that can go wrong does go wrong," this conversation shows you how that's finally becoming possible.

---

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@DeveloperVoices/join


Antithesis: https://antithesis.com/

Antithesis testing with Super Mario: https://antithesis.com/blog/sdtalk/

...and with Metroid: https://antithesis.com/blog/2025/metroid/

MongoDB: https://www.mongodb.com/

etcd (Linux Foundation): https://etcd.io/

Facebook Hermit: https://github.com/facebookexperimental/hermit

RR (Record-Replay Debugger): https://rr-project.org/

T-SAN (Thread Sanitizer): https://clang.llvm.org/docs/ThreadSanitizer.html

Toby Bell's Strange Loop Talk on JPL Testing: https://www.youtube.com/results?search_query=toby+bell+strange+loop+jpl

Andy Weir - Project Hail Mary: https://www.goodreads.com/book/show/54493401-project-hail-mary

Andy Weir - The Martian: https://www.goodreads.com/book/show/18007564-the-martian

Antithesis Blog (Nintendo Games Testing): https://antithesis.com/blog/


Kris on Bluesky: https://bsky.app/profile/krisajenkins.bsky.social

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Recommended
Transcript

The Limits of Software Testing

00:00:00
Speaker
It's worth testing your software, right? Let's just make sure we agree on that. But how do you do it? How do you make sure that software is correct? I think it's a lot harder than we'd like to admit, because there's always something you didn't think of.
00:00:13
Speaker
You could go full on TDD. You could have a complete suite of unit tests with 100% code coverage, and your software can still have bugs. because largely because unit tests only test components in isolation, and a lot of the complexity in a system comes from the way things interact.
00:00:32
Speaker
But we know that, so we write some integration tests, but not too many, because they tend to be slower and more brittle. We emphasize unit tests not because they're perfect, but because they're practical.
00:00:45
Speaker
But unfortunately, neither of those are sufficient. Both unit tests and integration tests suffer from a big blind spot. You only write tests for the things you think are going to fail. And a lot of bugs come as a complete surprise.
00:01:01
Speaker
So what can you do? ah think the most promising technique I know of is property testing. We've covered it before on this podcast. And the essence of it is every function has some rules that it should obey no matter what arguments you call it with.
00:01:15
Speaker
Things that should always be true. So let's throw random arguments at a function and see if it ever breaks its own rules. There's a lot more to that technique, and if you haven't looked into it, I strongly recommend it.
00:01:28
Speaker
It's a great way to test more cases with fewer code, but it's mostly a unit testing technique. Is there the whole system integration testing equivalent?
00:01:39
Speaker
Is there a way to test all the possible things that could happen in the world affecting a system we've created and check the entire system behaves itself? Can you ask a question as innocent as, no matter what happens, does the database stay up?
00:01:56
Speaker
Can you test it? I don't think we can. i think there are a lot of things that make that kind of question impractical. Apparently, Facebook put a lot of effort into building a system that could try and answer that sort of question, and they seem to have given up.

Introducing Will Wilson and Antithesis

00:02:11
Speaker
But joining me this week is someone who thinks he's cracked it, testing the properties of entire systems and simulating entire possible worlds. He's gone further down the testing rabbit hole than anyone I've ever met.
00:02:25
Speaker
He's Will Wilson. He's the co-founder of Antithesis. And they go all the way down below the operating system to build entire virtual worlds, Hypervisors of Controlled Chaos.
00:02:38
Speaker
It's an absolutely huge project. I love its ambition. I love the problems they've faced and the solutions they've come up with. They're absolutely fascinating. And I love the fact that in order to get this working, they had to play a lot of Nintendo games.
00:02:53
Speaker
Let's find out why. I'm your host, Chris Jenkins. This is Developer Voices. And today's voice is Will Wilson.
00:03:12
Speaker
I'm joined this week by Will Wilson. Will, how are you? Doing great, thank you. I'm glad we've got you here because you are going... We've had episodes on testing before, but i think you may take the biscuit as the most hardcore testing episode we've done so far.
00:03:28
Speaker
you know that's That's really funny. I don't really think of what we do as especially hardcore. I think we're just trying to mirror what happens in real life, which is what a test should do. That's why it's hardcore, because real life's dreadful and goes wrong all the time. That is super true.
00:03:43
Speaker
Okay, give us the overview and we will dive deep into the details. what What kind of testing are talking about and why is what we've got today not enough? Sure. So ah a sort of a cheeky question that I like to ask people sometimes is, let's say you make a change to your code.
00:04:00
Speaker
you know you You modify it. It's some feature that your boss has been asking for for a while. And you know it's a pretty big change, right? And you you run your test suite on it, and the test suite comes back all green.
00:04:12
Speaker
Are you now 100% confident to roll this out to all of your customers right now? I'm going to say no. If it's a large change, no. Not just going back to test suite. And so why is that?
00:04:25
Speaker
Well, I think for most people, the reason is that actually our test suites don't really cover the full spectrum of behavior that we could see in production. And the reasons for that are actually a little bit deep, um almost philosophical.
00:04:39
Speaker
It's that if you are writing a test to go into your test suite, that's some test case, some situation that you've anticipated in your own head. And by the way, if you've anticipated in your own head sufficiently to write a test case for it, there's a very good chance you've also written the code correctly.
00:04:55
Speaker
Um, um But the things that happen in production, the lovely things that set our pagers ringing at 3 AM and get us angry emails from customers and and and so on, those are the things that we didn't think of, usually.
00:05:09
Speaker
And so almost by definition, if you are ahead of the game enough to write the test, it's not actually the thing that's going to cause you a problem. um You need something that is able to find the things that you would not have thought to test for.

Innovations in Property Testing

00:05:25
Speaker
And that can be environmental situations. It can be weird thread interleavings, weird timing situations, weird user behavior, like weird resource usage patterns.
00:05:37
Speaker
Just all the stuff that that it just never even occurred to you could actually happen. Now, this feels like you're pushing me in the direction of property testing, which I really like. That's exactly what we're doing, yes. okay and Exactly. exactly like This is all motivation for property testing. right What we want to do is not write a test.
00:05:57
Speaker
What we want to do is write a test generator or like a fake client or you know something that we can just run. And the longer we run it, the more different situations it'll come up with and and throw at our system until we eventually actually hit the crazy thing that some user is going to think to do, and you find that you know in your tests long before it rolls out to production and long before your pager goes off.
00:06:26
Speaker
And so that's the basic idea of what we're trying to do. What I think is novel about antithesis and different from normal property testing Is that we're able to take that mindset and that approach and extend it from small data structures, you know, small pieces or modules of programs to like entire large real world interacting distributed systems.
00:06:55
Speaker
Um, I think that i mean the one of the things that makes doing property testing difficult is coming up with good properties. And how do you that's the first thing I want to know.
00:07:07
Speaker
How would I specify something that was useful and interesting and non-trivial about a system that large? Right, so there's ah there's a few there's a few things that are working in your favor here.
00:07:18
Speaker
um The first one is you can actually catch a vast amount of bad behavior without specifying every single property because most bugs will cause multiple properties to fail.
00:07:29
Speaker
Right? So for example, let's say that I've got some C++ plus plus binary and it's got some memory corruption in it, which goodness knows I've done that once or twice. um yeah So that could result in like a wrong answer.
00:07:45
Speaker
It could result in like a security bug. It could result in a user seeing some nonsensical garbage memory like in their response. Or could result in the system crashing. And you know so basically, if I have a property that's checking for any of those things, it could eventually find that bug because that bug can manifest in many, many ways.
00:08:06
Speaker
And so what we find is that actually getting a very, very small number of basic properties in place is often enough to immediately begin finding really high severity, really interesting and important bugs.
00:08:21
Speaker
And so with our customers, we usually just start with properties like It should never crash. you know it should you know I shouldn't see random uncaught runtime exceptions. you know i should like i It shouldn't time out and never come back to me again. like yeah like Pretty basic stuff like that. you know The server shouldn't be giving me 500 errors, like whatever.
00:08:40
Speaker
And often we already get to some interesting stuff. And then the next level deeper that you go is like if you're building large real world systems, Usually these systems are being built by distributed teams or they're being built by multiple teams of people in the same building, but who are still parts of separate organizations.
00:08:58
Speaker
And so there's often a lot of documentation and a lot of either formal API guarantees or like written in a Google Doc somewhere somewhere guarantees. If you can find it, yeah.
00:09:09
Speaker
you you know about what the system is meant to do and about what the different components are meant to do. And so you can convert some of that stuff into into properties as well. You can be like, look, the foobar service, you know it will always return you an integer you know within 300 milliseconds, assuming it's under this amount of load. And you know if it if it loses two replicas, it'll eventually come back once you... you know like That stuff is often written down somewhere in terms of an SLA or in terms of a promise that one part of your company has made to another part.
00:09:40
Speaker
And so, you know, turning that into properties in code is a little bit of work, but it's not that much work. yeah You're making me think of things like Selenium, right? Where you do browser testing, where you're we you're trying to say people are crazy and the system under test is complicated.
00:09:59
Speaker
And I find that the work to set up a good test and maintain it and actually test all that crazy behavior is a big thing. So my next question is, how do I specify a test worth running on a distributed system without going insane?
00:10:15
Speaker
Yeah, so this is a great question. um And i'm i'm i'm I know I'm speaking to engineers, so I don't want to sound too positive.
00:10:27
Speaker
Because if I sound too positive, I will we will trigger everybody's bullshit detector. um But i'm goingnna I'm going to take a risk here. So it turns out that the same thing that makes property-based testing or fuzzing or autonomous testing or whatever you want to call this class of techniques really powerful, which is that you've relaxed your specification of what the test is going to do, also makes it more maintainable.
00:10:52
Speaker
and I'll give you a really, let me give you a really, really concrete example. Please. Um, Let's imagine we're testing Super Mario Brothers. yeah that ah You've got that in your head, right? Yeah, absolutely.
00:11:06
Speaker
the traditional The traditional way to write a test with Selenium or with you know whatever your preferred test framework du jour is, or just you know with nothing, right? Just with your programming language of choice, yeah is to eventually basically describe in detail a set of steps that and then describe the expected outcome after that set of steps.
00:11:28
Speaker
yeah right So in the case of Mario, it's going to be something like, hey, if I hold down the right button for 2.3 seconds, and then I hit the jump button, and I hold that down but for 1.4 seconds, and then I do something else, I'll wind up in this place, and I'll have bounced on this Goomba, and you know i'll be yeah I'll be at this location.
00:11:48
Speaker
And I think that makes it really clear like why that's so awful and unmaintainable, right? It's like, as soon as I change not just important things about the game, but completely incidental things about the game, my test will break.
00:12:02
Speaker
I have had that experience with browser testing often, and I hate it. Exactly. And it's really bad in distributed systems testing because timing and like the timing of network requests and like when different tasks execute can be really flaky from run to run. And so it can even even if you don't change your software, the test can break.
00:12:21
Speaker
yeah um So the cool thing about property-based testing is that you're not making that kind of claim. You're basically saying things more like, hey, here are the buttons you can press in Mario. Here's the things you can do.
00:12:35
Speaker
Mario should never wind up inside a wall, right? The game should never crash. The score counter should never overflow and become negative. You know, like... whatever. um But what that means is now if I change the game, if I change the level layout, my test actually does just keep working.
00:12:50
Speaker
And it just keeps still testing my game and in the same way. And so I've simultaneously made my test more powerful because I'm testing many more different possibilities, not just that one path through the game.
00:13:02
Speaker
yeah But I've also made it more maintainable and less fragile. And so to now take this to a more distributed systemsy kind of context, right, instead of saying, hey, if I send this request and then I wait exactly this amount of time and then the server does this and then this packet gets dropped, then I expect this retry to occur and blah, blah, blah.
00:13:21
Speaker
If I instead say like, hey, you can send requests, you can retry things, the network is spotty, um but eventually you will get this response. after some number of retries. or you know And by the way, the server will never crash while you're doing this. And by the way, if there's two clients that are both trying to do this, they're never going to deadlock if they're running the same process or whatever.
00:13:41
Speaker
um you You make it so that incidental changes to your program don't just automatically break all your tests and you don't have to spend as much time maintaining them.
00:13:52
Speaker
Okay, okay yeah I can see how that would work if, and i'll I'll accept your claim on face value that this does actually find bugs in the wild. I can believe that. But to to use your Super Mario example, right, that how well how old is that game now? What, 40 years?
00:14:08
Speaker
Yeah, it's i'm pretty old. Speedrunners are still finding glitches and bugs. Yes. Yes. How do you, when I write this simple property test, how do you not end up spending 40 years going through the random search space to actually find me a useful case where things crash?
00:14:24
Speaker
Ah, that is a good question. And that is a lot of what makes this hard. um okay so So so so i'll I'll tell you a few things first.
00:14:37
Speaker
Number one, the thing about Mario is that is that people aren't trying really hard to make a bulletproof today, right? no like Nobody's like gone and patched Mario to not have all these glitches and bugs.
00:14:51
Speaker
yeah And nobody's, like you know besides us, has thrown a fuzzer at it to try to systematically find all the bugs. Wait, are you saying you've actually done that or just hypothetically? Oh, yeah, yeah, yeah. No, you can go on our website and you can see Endethesis playing Mario.
00:15:07
Speaker
And a lot of other Nintendo games, too. OK, this is good mad science. Yeah, yeah. So we have actually done this with a whole bunch of different Nintendo games. so But the thing is like the thing is, in the real world, number one, you don't actually need to find every single last bug in your program.
00:15:26
Speaker
You need to find the ones that are really likely to cause you a significant problem at some point. And number two, if you start early in your development process,
00:15:37
Speaker
finding bugs quickly and reliably, the moment that they're introduced, it's actually vastly less effort to fix them because you don't have to do some big complicated root cause analysis.
00:15:49
Speaker
You can just be like, My tests were green before this commit. And afterwards, this weird issue, this weird timeout started happening. I guess it was that commit.
00:16:00
Speaker
You can look at that commit and scrutinize it or you can revert it. And you know that actually is much, much faster than going back to some old legacy system and trying to patch everything up.
00:16:11
Speaker
Yeah, okay, I buy that. I'm going to challenge you on that, though, because you started with the example of if I make a major change and the tests say everything's fine, am I confident?
00:16:23
Speaker
One of the reasons I'm not confident is in a sufficiently large system, I haven't written all the tests. I haven't reviewed all the tests. I don't know how much of our footprint we're testing and how thorough we've been.
00:16:37
Speaker
and if you're So what you're saying makes sense when the code base young, but every decent code base grows to be old and large. And then you've still got that big search space problem we're talking about.
00:16:50
Speaker
That's true. um So I will say there's another, like, that there's two separate questions here, right? and And I think maybe I've confused my answers to the two. One question is,
00:17:02
Speaker
I'm making a lot of changes. I have a very complicated system. Like, how can I possibly keep the quality bar high assuming I have something that can find all the bugs quickly?
00:17:14
Speaker
And that's the question that I just answered you. um I think the second question, which you are asking, is also a very good question, is this is a tremendously complex system.
00:17:26
Speaker
How can you possibly find all the bugs quickly and reliably? And the answer to that question gets a little bit to the magic of antithesis.

Deterministic Testing with Antithesis

00:17:36
Speaker
Because what we're doing is not just exploring the state space blindly. right Doing that would you know would would take longer than the lifetime of the universe.
00:17:47
Speaker
it would take you know It would take a trillion years to find a winning route through Mario that way. right yeah Just randomly, blindly trying combinations. Amazon infinite number of monkeys coming in 2027.
00:18:00
Speaker
That's right. But what you what you what you want to do is use guidance and use feedback from the system under test to optimize the search.
00:18:11
Speaker
And notice when things have interesting things have happened, things that aren't necessarily bugs, but that are rare behavior or special behavior or unusual behavior.
00:18:22
Speaker
And So the test system can see that something interesting has happened and follow up opportunistically on that discovery. And that gives you a massive lift in the speed of finding issues.
00:18:36
Speaker
And the way that we're able to do that... is with this sort of magical hypervisor that we've developed, which allows us to deterministically and perfectly recreate any past system state.
00:18:50
Speaker
So people generally think of the value of that hypervisor as like any issue we find is reproducible, nothing is nothing as works on my machine. If we find it once, we can repro it for you ad infinitum.
00:19:03
Speaker
But the sort of real benefit of it is that we can do that while we're searching for tests or so while we're searching for bugs. So we need to step back a second.
00:19:13
Speaker
Yeah. Because I need you to get because we haven't actually touched on the whole hypervisor. You're the way you do this under underneath the spec I write is you're simulating a whole operating system.
00:19:26
Speaker
Yes. Need to explain that a bit first before we get into the magic of it. Yes. so So basically what we're doing is we're taking your real system like you would deploy it to production, and we're deploying it into a fake copy of production.
00:19:45
Speaker
And this fake copy all runs within a single you know a single guest operating system environment, which is usually Linux.
00:19:55
Speaker
And you know if you have multiple services or multiple nodes and you've got a database and a load balancer and a web server and and whatever, these all run in different containers interacting locally in that operating system.
00:20:09
Speaker
right And if you have dependencies like AWS, let's say, those also run inside there. right We have a whole fake AWS that we can deploy in there with you. And know we have lots and lots of other you know well-known cloud dependencies that we can just sort of put in there with you.
00:20:25
Speaker
Right. And so then we have this sort of magical hypervisor which runs all of this just like you and like a normal VM would. But the difference is that our hypervisor, if it does the same thing twice, it will like actually for real do the same thing twice.
00:20:44
Speaker
Like all of the like very low level decisions about when threads get scheduled or how long particular operations take or, you know, exactly how long a packet takes to get from node A to node B will reproduce 100% perfectly to run.
00:21:01
Speaker
from run to run i can come I can kind of make a guess on how that's working, but I'm going to make you spell it out for me, because how are you doing that without diving into the networking stack on Linux and then the clock processing stack on Linux and all these different subparts of the kernel?
00:21:19
Speaker
Yeah, it's it's actually very tricky. um though so must say So the thing that you just described um is actually how Facebook once tried to do the same thing.
00:21:31
Speaker
They built a system called Hermit, which basically is like a deterministic version of Linux, which tries to get this same property by modifying the kernel, the networking stack, et cetera, et cetera, the file system, making them all deterministic.
00:21:48
Speaker
um And that project... I don't want to talk trash about... you know But like basically it's it's abandoned. but It's like not <unk> not being continued um because because it's very, very hard and because those things don't present stable interfaces. right The kernel is constantly changing.
00:22:04
Speaker
you know File systems are constantly changing. it was just a moving target and a yeah very hard one and a very big one. Even on Facebook, that's a massive project. Yeah. yeah so So we decided to completely bypass all of those issues by going in at a much deeper level.
00:22:21
Speaker
So what we did instead was we run a completely unmodified Linux, completely unmodified user space. What we've done is we've gotten underneath all that.
00:22:32
Speaker
And in the hypervisor, we've implemented a deterministic machine model. So from the point of view of Linux, it's just running on a computer. This is just a very strange computer where operations always take the same amount of time, you know, if you if you do them again.
00:22:51
Speaker
And so, right and and you know, and if you happen to ask for a random number at some particular moment, you will get the exact same random number the next time you, like, go and do this. And so...
00:23:02
Speaker
That means that we can test any software in this style without modifying it or changing it or imposing a bunch of requirements on it, which makes it all like a much easier lift.
00:23:14
Speaker
So are you saying you've got a thing that will say, OK, so the CPU has returned this value and then the clock ticked and then the networking bus transferred this data and then the clock ticked?
00:23:27
Speaker
That's right, that's right. and And the nice thing is we can do that because we've made the actual execution deterministic, we're not like recording or writing down anywhere that we did exactly this in the and exactly that. right that would That's how, there's other systems that work that way, like RR, if you're familiar with that.
00:23:46
Speaker
um Record replay. um It's a system that basically records the result of every system call your program does and lets you replay them all again later, which is very handy. um But, you know, only works for a single process.
00:23:59
Speaker
And but more problematically, also, it just generates this huge volume of data. I'll Yeah. Because you've got to record every single thing the system ever did. But the nice thing about making it actually be deterministic is you don't have to record anything.
00:24:14
Speaker
The only thing you have to record is what were the starting conditions? And then at what point did I reach in and like inject entropy or like change the future in some way, you know?
00:24:27
Speaker
And then everything else just evolves, you know, like... like those 18th century guys thought the universe evolved, right? Like pure determinism governed by one equation. Like that's just, that's all that happens and all that can happen.
00:24:40
Speaker
If you could know the position of every particle at the start, you would know who's having breakfast right now. Exactly. Exactly. That's how it just actually works. um And so it means you, it's like tremendously more scalable because you don't have to store all this vast amount of data.
00:24:57
Speaker
I mean, this feels like a facetious question, but i do have to ask, does that actually work? It does. Yeah, it does. um it And it wasn't... it wasn't ah It wasn't, like we weren't sure it would work, right? Like, you know, when we when we tried it, you know, we had no idea if it would work.
00:25:12
Speaker
And we had some dark moments. um We had some really dark moments while building this, right? we had We had a few times where we thought, yeah, we've got it, it works. And then we would discover like one in every trillion CPU instructions you know, something is a little different.
00:25:32
Speaker
And that drove us insane for a while. But we eventually got it. So it's like, no, it's it's really deterministic. um and And like, this is not vaporware, right? Like we've got very large customers who are using this every single day and it works.
00:25:47
Speaker
Have you ended up building basically and like ah ah Intel chip with all the event buses in software? ah No, because that would be too slow.
00:25:59
Speaker
So right that that would be... um That's another way that you could try and do this. the the The trouble... and And that would actually give you some benefits over what we've done, right? If you did it that way, you could... Like a cycle-accurate CPU simulator, you could...
00:26:17
Speaker
find all kinds of like weird bugs that require true multi-core parallelism or like you know weird met atomic memory operations, stuff like that.
00:26:28
Speaker
yeah um We are not trying to find those bugs because 99.999% of developers can never even think about those bugs, right? like We're trying to we're trying to find we're trying to find more more everyday type stuff.
00:26:43
Speaker
and so Our approach sidesteps a lot of that. um What it does architecturally, fundamentally, is we say, look, most operations that run on a CPU can just run like you know normally at full speed on the host CPU because the result is deterministic.
00:27:04
Speaker
How long they take is not deterministic. And so we do have to completely fake the clock right, and ensure that when I add two numbers on the CPU, that always takes the exact same number of clock cycles, so regardless of where things are in the memory hierarchy or whatever.
00:27:21
Speaker
but right But leaving that issue aside, it's deterministic and can execute at full speed. And then if we detect that you are trying to do something non-deterministic, like read the timestamp or get a hardware random number or something like that, we can trap that instruction and return a deterministic answer.
00:27:39
Speaker
But what that means is 99% of your CPU instructions are just executing on the host CPU, and it's very fast. um right And so that that means there's not much performance overhead at all to doing this, which is, which is i think, really important to making it actually practical.
00:27:55
Speaker
Yeah, because we're going to talk about your optimizations, but you're still going to have a large search space.

Optimizing Testing Scenarios

00:28:00
Speaker
You can't afford for the CPU to run significantly slower than usual. That's right. And there's ah even a there's even a neat trick here, which is um we can actually run the CPU a bit faster than real life sometimes.
00:28:14
Speaker
Because if the if the guest CPU is idle, we can just advance time faster than real time to the next time it would do something.
00:28:25
Speaker
right So any time that the CPU is halted, we just fast forward to when the next interrupt would occur. And so what that means is we can you can think of it as like running the host CPU always at 100%. And then if the guest CPU is running at less than 100%, it ends up running faster than real time.
00:28:42
Speaker
you know An analogy here is, let's go back to Mario. right like If you're trying to emulate a Nintendo game on a modern computer, you could actually run that emulator much faster than real time because the host CPU is much faster than the emulated CPU.
00:28:55
Speaker
And so we can do something very similar if your system is idle. And that's really very powerful because distributed systems, client-server systems, are very frequently idle because they rely so much on waiting for network communication or waiting for events to occur.
00:29:13
Speaker
Yeah, yeah, I can totally see that. um tell me about then Tell me about how you interrupt the networking stack and making that deterministic. Because it sounds like you've got a problem where...
00:29:28
Speaker
You've got to simulate the entire internet if I access the internet and stub that in? Yeah. so So there's there's two there's two levels in which I can answer this question here. One is um the actual networking stack itself, right? Like in terms of I have put this data in this buffer and now I'm calling send on it.
00:29:48
Speaker
We make that deterministic just the same way we make everything else your computer does deterministic. Like, it you know, the timers, the timing, everything is just going to be the same. yeah um the The problem is exactly as you say, if my system goes and tries to call out to some thing on the internet, that thing could return a different answer the next time I call out to it.
00:30:10
Speaker
yeah And so that's why we just ban that. like and and that And that is like a limitation and a friction point for using this, and it's why we had to do things like give you a fake AWS and give you you know a fake everything else that you need.
00:30:26
Speaker
um so So what it means is is that you know you need to take your application and all of its dependencies, you know your database server, your whatever...
00:30:38
Speaker
whatever things need to actually be able to run to provide the service and provide them all inside of the hypervisor. um We have some thought that in the future we may be able to relax that limitation by recording just what happens at the network boundary so that we can go back and replay it exactly.
00:31:04
Speaker
right Imagine that you call out to some, I don't know, the chat GPT API, right? And you get some particular answer coming back. We can just note the moment, right? Like all the events inside the VM are already known because they're deterministic.
00:31:21
Speaker
So like if when we replay it, you'll make that call at the exact same moment in the simulation. And then we just need to record when the answer, when we delivered the answer back to you and what the answer was. And that's a very small amount of data, relatively speaking, at least for a lot of applications.
00:31:37
Speaker
And so we may be able to relax that at some point, um but that's not currently a feature we have. Presumably you'd have to save that for the next time you run with the same seed, right? Yes, exactly. And you'd have to, it would basically be a form of input, right?
00:31:53
Speaker
Yeah, this is absolutely ringing my functional programming bells. The hard part is always side effects, right? That's exactly right. That's exactly right. and so And so what we're trying to do is just say there are no side effects, period.
00:32:05
Speaker
um And, you know, that that that actually can get you quite far. um If you've got the legwork to do an entire fake AWS. Yeah, that's right.
00:32:16
Speaker
That's right. Are you doing like that things? The problem with AWS is you don't have the source code, right? but ah you are you doing like a fake Postgres or is it enough to just run Postgres in another hypervisor?
00:32:29
Speaker
ah You just run it in the same hypervisor. You just chuck it in there. It's in another container. Yeah, we have tons of customers who do this. um it's that That's very easy. Anything that's open source, like like piece of cake, right? Because it's just running on Linux, right? It's just running. That's right.
00:32:45
Speaker
like the The thing that's hard is is things like AWS. But it turns out that 99% of people just use AWS. You know, like if you get that one, you actually get a tremendous a tremendous amount of the market satisfied.
00:32:58
Speaker
um There is a long tail of third party things. You know, the good news is here also we actually can leverage the open source community a bit because a lot of people have written emulators for these things to use for their own local testing, right?
00:33:13
Speaker
Like there's a pretty good Snowflake emulator out there. There's like a pretty good BigQuery emulator. um there's like ah There's like a lot of these local local emulators. And so often those are a drop-in replacement for the for the service.
00:33:26
Speaker
And then we can just kind of go and we're we're off to the races. You'd be no worse off than your existing test suite. That's right. Yeah, okay. Okay, so you're simulating an entire operating system.
00:33:40
Speaker
I see that you're overclocking it. Let's say my normal process runs at 10% CPU usage. You can run 10 times faster. That's still not enough. Take me through how you were talking about how you optimize the search space. Take me through that.
00:33:54
Speaker
Yes, great. So the really key insight is that the determinism gives us the ability to travel through time, right?
00:34:07
Speaker
Because if I get to somewhere interesting in my program and then I make a mistake and I lose that thread, I can just be like, ah let's go back.
00:34:19
Speaker
And I don't have to save anything. I don't have to snapshot anything. I don't have to. I can just start the simulation again and do the same things that got me to the interesting place in the first place.
00:34:31
Speaker
If you had the same pseudo see the same pseudorandom seed, you get exactly the same series of of operations. yeah so so so So far, that's not that exciting. Like, who cares? But the reason it's really exciting is that there's a lot of bugs which sort of look like I need really rare event a to happen, and then I need really rare event b to happen, and then the bug happens.
00:34:57
Speaker
And so like let's say let's say that event A and event B are each one in 1,000. right? yeah Well, the probability of getting A then B is one in a million, and that's a lot less likely.
00:35:09
Speaker
And if I have event C that also needs to happen, now we're one in a billion. Wow, that's really unlikely. I'm never going to see that in my test. I'll just always see it in production, right? Because I've got millions of users. yeah So that's that's kind of why testing is hard in some sense. Yeah.
00:35:26
Speaker
But the crazy thing is once I have a time machine, once I have a hypervisor, I can run until I make event A happen. And then if I notice that event A has happened, I can say, this is interesting.
00:35:41
Speaker
I want to now just focus on worlds where event A has happened. I don't need to re-find event A every single time. can just lock it You know exactly how to there. Yeah, yeah, It's like if you play computer games, it's like save scumming, right? It's like I can just save my state when I got the boss down to half health and now always reload from that point.
00:36:05
Speaker
And so it takes me 1,000 trials to get event A to happen and now just another 1,000 to get B to happen instead of it taking a million trials if I always have to start from the start.
00:36:16
Speaker
Yeah, yeah. And then I can get to see after 3000 instead of a billion. And you can see how the more such incremental intermediate steps I've got, the bigger speed up I get. And in general, it's like the nth root asymptotic speed up, which is very significant.
00:36:31
Speaker
Yeah, yeah. And so the key then is, can I reliably recognize that something interesting has happened? right Yes. and what And what is interesting?
00:36:42
Speaker
right You've anticipated my next question. Right. so so And the good news, though, is there's all kinds of clues. There's so many clues. um you know One thing is we're running this all in our hypervisor, and so we can see which lines of code are running.
00:36:57
Speaker
And so if some new line of code in some new function runs for the very first time, that's probably interesting. We should probably try following up on that a bit. Hang on, how are you noticing that from down in the very close to CPU level?
00:37:12
Speaker
Well, OK, so we actually can just see call stacks from the hypervisor. um And that is an option that we that we have in our back pocket.
00:37:23
Speaker
But the way that we, practically speaking, usually do this, because it is a more delightful overall product experience, is we actually have our customers instrument their code with callbacks that tell our platform that we've gotten to particular places.
00:37:38
Speaker
And that's a better overall product experience because it means we can also report stuff about coverage to them, you know give them insights like, hey, this bug only happens when this particular line of code runs.
00:37:50
Speaker
We can associate to actual function names and so on. yeah um okay And we have good support for a large number of languages. And and depending on your some languages, it's literally entirely automatic and the customer doesn't even have to do anything.
00:38:02
Speaker
um so So basically, right so which lines of code have run and which new functions have run is ah one very good clue. right Another one is like developers often put log messages when something interesting has happened.
00:38:16
Speaker
like okay And and and sometimes it's sometimes the sometimes the log message is like, this should never happen, or something like that. right And so if our system sees the log message, this should never happen,
00:38:28
Speaker
Yeah, that might represent a test failure, or it it might not, but it probably means you've gotten to something that the developer thought wouldn't happen. yeah And so now probably we can, like, you know, we can we can now focus our exploration on what happens next, and can we get something really interesting to happen?
00:38:50
Speaker
um There's other more speculative things that we have, ideas that we have. You know, we, one thing... that we can do, for instance, is you can look at any time the code branches.
00:39:04
Speaker
like you know when we're going around a loop or something, or when we're taking an if statement. And you can see what is, suppose there's a loop and sometimes, let's take an if statement. easier example.
00:39:22
Speaker
Suppose there's an if statement and hit it 1,000 times 500 those times we do the and 500 of those times we do else.
00:39:29
Speaker
right yeah That's probably pretty boring if statement from the perspective of trying to find deeper behavior. But if we have a different if statement in the program and we hit it 1,000 times and it always takes the if every single time we've reached it.
00:39:45
Speaker
Yeah. Well, now, if we do get the else one time, that's a really interesting situation. Like, you know, that say that that's unexpected.
00:39:57
Speaker
And so our platform is smart enough to be like, oh, we got the else that time. Well, let's let's try and just start from there. And let's try and just really hammer this else case and see what else we can make happen down that path.
00:40:12
Speaker
Okay, that gives me a bunch of questions. um the first I think the first one is just make sure we're understanding this. Surely you found a seed that gets me to this one in a thousand else case.
00:40:26
Speaker
Yep. But now I want to fork off in a thousand new directions. Yep. Are you saying ah yeah you you can't be storing the state of the machine from that seed and then introducing a new seed?
00:40:42
Speaker
but I don't see how you take your original seed and modify it to be slightly different. ahhaha Yes. So what we are doing actually is periodically reaching in to the computer and reseeding everything.
00:40:58
Speaker
Oh, you're like snapshotting the state. Well, we we don't even have to snapshot when we do this. We're just reaching in and we're changing. We're just changing one register. We're just being like, hey, the next time anything in the whole computer asks for a random number, instead of giving it a one, give it it two.
00:41:15
Speaker
And periodic, and that does get recorded, right? At instruction number 37 trillion, we decided that the next time they asked for a random number, it would be this. And so when we're trying to explore more and see what new things we could find from some given point, we just go back to the last time we did that and we change the decision we made there.
00:41:40
Speaker
Oh, okay. Yeah, and yeah. and and and so And what's really cool about that, by the way, is it has this completely separate side benefit, which is when we actually do find a bug, we can then go back and sit and ask, when did the bug become inevitable?
00:41:59
Speaker
right This is this is all kind of kind of crazy. like know How? How? We can go back to the previous time that we reached in and changed the future. And we can try changing it to like 100 different things and see if they all still hit the bug.
00:42:14
Speaker
And if they do, it means the bug was already baked in. And then we can go back to the next one before that and do the same thing. yeah yeah yeah And we can sort of bisect backwards. And then we can find the exact moment when the bug went from really unlikely to really likely.
00:42:29
Speaker
And then we can do things like look which look at which lines of code we're running then, you know look at you know look at all all you know what what log messages were being printed then. And often that is actually enough to root cause the bug too.
00:42:42
Speaker
Yeah. Are you then like, ah how are you reconstructing that for user space? Because are you drawing me a picture of how the code went through the evaluation path actually took?
00:42:55
Speaker
Yeah. So basically everything. so we've built a special shim. that makes it so that every attempt by any software, whether it's Linux, whether it's the user's own software, whether it's um you know whether it's our software, which is injecting faults and and changing the scheduling and the order of stuff, um anytime any of those things try to get a random number, they go to this special random source that we control.
00:43:30
Speaker
Okay. Does that make sense? yeah and so And so by modifying what answer we give there, we can really quickly change the trajectory of what the system does.
00:43:41
Speaker
Like what operations are getting played against it, what order the threads are running in, whatever. um and so And so that's a way for us to like very quickly explore different possible futures from some point.
00:43:56
Speaker
um There are ways that you know that a user can shoot themselves in the foot. like if you If you at your program startup ask for 100 random numbers and then cache all of those you know and then look them up later to to decide what decisions to make, or yeah you will have effectively defeated our intelligent states-based search. right and so And so we tell users not to do that.
00:44:21
Speaker
hi but But assuming that you're just doing the the sensible thing and and going and asking dev uRandom for a random number before you decide what to do, we will have total control over the actions of your software and can go back and change them.
00:44:37
Speaker
Yeah, but so I can see how you can get to that point now. But how do you reconstruct that into something that's useful to me trying to debug the bug? Because it seems like you've got this history that says, OK, the operating system started up and then the software started up. And then I need to know that here's the path that went through my code and hit that point.
00:44:58
Speaker
How do you reconstruct that report? Yes. OK, great. Yeah.
00:45:03
Speaker
Right. So there's basically two kinds of histories here. that's that's that's That's, I think, the key. One is the history of like, what did we do? Right. And that's totally meaningless to you.
00:45:17
Speaker
um You know, and it's and it's and it's by design very, very, very small. Right. It's like at instruction, 37 trillion, I changed this random number. At instruction, 42 trillion, I like made this thing pause for a second. like It's like, doesn't mean a thing.
00:45:33
Speaker
Right. Then there's the history of what did your system actually do? And that's like a story that is mostly told in your existing log files, which you probably know how to interpret.
00:45:45
Speaker
And so along while we're testing your system, we are streaming out all of the logs exactly the way that you in production you would stream them to Datadog or Splunk or something.
00:45:57
Speaker
We're streaming them all out. and and storing them in this massive database. And when we hit a bug down one of these particular paths, we can then stitch back together the logs of your system to form a unified view, like a unified history of like what it did down that path until it got to the bug.
00:46:19
Speaker
Okay. Yeah, i's it but why... That makes me wonder, why are you bothering doing that? Why not just ignore all the logging until you find a bug and then replicate it and get the logging there? Wouldn't that be less storage space?
00:46:32
Speaker
ah Yes, and we're about to add that feature.
00:46:37
Speaker
Yeah, yeah the the answer is because we're lazy. Oh, okay. Fair enough. Well, number one, because we're lazy. Number two, there is actually some benefit to doing it, which will be lost for users who enable this optimization that's coming, um which is when you store all the logs across all the many worlds, both the ones that saw the bugs and the ones that didn't, you can then...
00:47:05
Speaker
host hoc go and analyze the logs as if they were production logs and go look for trouble that you hadn't defined as a test property ahead of time.
00:47:17
Speaker
So this goes all the way back to the start of our conversation. which was you know how do you how do you pick good test properties? So this is yet another way you can do it. um Suppose that you've got your antithesis test all set up, and you know we've run with you for a while and found a bunch of bugs, and now it's all green. Hooray, you're happy.
00:47:35
Speaker
And then one day you hit some horrible crash in production. or some some other horrible thing, and and and you're angry. You're like, man, Will, why didn't Antithesis find this is fine problem?
00:47:46
Speaker
And I ask you, well, OK, what does it look like when the bug happens? And you say, well, you know it looks like my system returns this particular error code. um And I'm like, OK, did we have a test property that says that your system should never return that error code? We go, look, and it it doesn't have that test property. right And well, that's too bad.
00:48:06
Speaker
um you know we can We can add that test property now, and we will now be able to find that bug in the future, which is good. um But it's still a bit bit sad, right? I would like to know, I'm i'm probably in the middle of a production emergency or you're in the production emergency, right?
00:48:22
Speaker
You'd like to know if your tests have ever seen this problem before. Well, because we've stored all the logs in this enormous database, you can just go back and query and say, hey, if i had this test property, would I have had any failures?
00:48:39
Speaker
Oh, okay. And then we can say, why, actually, yes. It turns out that on June 17th, we saw that exact failure. um Now we can give you ah very, very detailed log of what led up to that failure without you having to run any new test or do anything.
00:48:58
Speaker
But we can also give you a live reproduction of that issue because that's One other feature I haven't told you about yet is we've got this crazy hypervisor, but you can also use it interactively, not just in test mode.

Interactive Debugging and Industry Impact

00:49:12
Speaker
So once we've found a particular situation, you can ask us to replay up to that situation, and then we just give you like a bash terminal inside the hypervisor at that moment, frozen.
00:49:25
Speaker
Yeah. And so now you can go poke around your system and be like, okay, well, you know, what is this endpoint returning? Like what's in my database here? What does this file have in it? And just sort of pour through it and debug it without having to worry about, you know, debugging in production and all that that entails.
00:49:43
Speaker
Yeah. So is that, are you saying that's like stepping me back five minutes before the bug crashed? Yeah. Or can I actually like say, okay, I'm five minutes before the bug crashed and now I would like to simulate this thing happening on the network and step through it.
00:49:59
Speaker
you can do You can do any of these things. um the The default is we deliver you there at the nanosecond that the crash happened. and and in the reaction time And then you can poke. Well, actually, it's like it's not that accurate yet. It's actually to within like a ah few microseconds.
00:50:16
Speaker
But but you know you can you can then poke around or whatever. But yeah, a very natural thing to then say is like, OK, well, hold on. i Hang on. I want to go back a few seconds before the bug. And I want to go see what this thing was doing then. And I want to take a core dump of this process before the bug. And you know maybe I want to go kill this process and see, does the bug still happen?
00:50:35
Speaker
Or like I'll try and restart the database. And does that fix the bug? Does that unstick us? And you can sort of ask all these counterfactual questions. And you can do it without the fear that by doing that, you are going to lose the repro, lose the bug.
00:50:50
Speaker
Because you can always hit a button and just be right back there. Yeah, yeah, yeah. So you can maybe answer the age old question of, is it better to regularly restart servers or not? Which I've argued a lot in the past.
00:51:03
Speaker
Yes, yes. So yeah, that that's ah that that's why we store all the data, even though we don't have to. Okay, yeah. So so you're doing that on a kind of logging level, or you're re recreating the operating system at that exact point, or both?
00:51:19
Speaker
ah Right. So the input data we save is exact instructions for getting back to any situation. The output data we save is everything your program did you know leading up to some point.
00:51:31
Speaker
And the the data model for that latter thing is is very, very efficient. Because if you think about it, all of the histories and all the timelines that we're exploring have a natural tree structure to them.
00:51:46
Speaker
Yeah. Yeah, yeah. And so we can deduplicate output that has the same common prefix, right? If it's like, I tested your system for 10 seconds, and it did a whole bunch of stuff and output a whole bunch of stuff, and then we got to an if statement, and then we tried both branches, right? And we you know saw what happened down both paths.
00:52:05
Speaker
We can just store everything leading up to that if statement one time. Yeah, OK. I'm getting ah visions of git trees in my head. Yeah, very very very similar in some way.
00:52:16
Speaker
um we actually We actually ended up just recently... We're currently in the process of migrating to a custom database that we wrote that's designed to just store data in this format because nothing we could find commercially you know really really did the right thing here.
00:52:33
Speaker
I'm sure someone would sell you a blockchain for that. but Yeah, probably. Okay. um The one thing I'm really not clear on is You've still got huge heuristic space, right? How do you, if you're tracking every if statement in my code base to say, okay, here are the ones where we've done this a thousand times, so we need to keep an eye on that.
00:52:56
Speaker
Are you still not getting a massive amount of search space that you can't deal with? It's still very big. um There's lot there's ah there's a lot bit This is where sort of the the the dark art of testing comes in right? Like there's a lot that you can do to make this search even more efficient.
00:53:16
Speaker
Like everything I've discussed so far has just been almost on like abstract like first principles, knowing nothing about computer programs or how people write them, just using like math, right?
00:53:31
Speaker
yeah But then on top of that, there's a whole bag of tricks. Like, you know, i can give you some examples. um please A really easy way to find bugs in software is to construct asymmetric partitions between nodes.
00:53:48
Speaker
If you make it so that node A can talk to node B, but node B cannot respond, like debt that yeah frequently finds bugs. Just empirically, it does. Yeah, can totally believe that.
00:53:59
Speaker
And so we do it a lot, right? like you know You monsters. Yeah. There's just like... you know that there's you know there's just like There's a number of, or or like what, so a thing that we do a lot is um pause particular threads, right?
00:54:15
Speaker
and And let other threads race ahead and see if we can create data races or concurrency bugs that way. Right, yeah, makes sense. And, you know, and that's that's a very realistic thing to happen in production. Like anytime you have a GC pause in a Java server, that's, you know, other servers are maybe doing things. um so So that's,
00:54:35
Speaker
That's the thing that we do. But so then then there's a question like, how long do you pause it for? And it turns out that like, there's actually a pretty good answer to that question. You basically want to use like kind of a fat tailed power law distribution.
00:54:49
Speaker
And the reasons for that are, This is sort of a long story that we could go into. but But it's it's it's true. And exactly how you fit the parameters to that distribution vary from program to program.
00:55:02
Speaker
But like since we know that that's a pretty good distribution to use, and we know some parameters that work pretty well for a lot of programs, um we can sort of start there with any given new program.
00:55:13
Speaker
And that just immediately... um immediately makes us find more bugs faster than if we just started with a completely uninformed uniform random distribution. Right? Yeah, yeah. And so, you know, part of the thesis here also is that as we test more and more and more people's software, we're just going to learn more and more and more such things.
00:55:33
Speaker
And it's going to make the search more efficient. But I think you're right to dig into this because this is, i mean, in some sense, this is the part of the story that's most unbelievable, right? like the Like the thing we're talking about is like the Turing halting problem. Like you you know that there's no perfect solution here.
00:55:52
Speaker
You know, like, like so so so this is, we know from, you know, from the beginning that this is gonna be very hard. And I think a bit of our a bit of our belief is that it's very hard. So that means that that somebody should get really, really good at it or or multiple people should get really, really good at it.
00:56:11
Speaker
But not every single developer should worry about like, I'm going to pause this thread. What's the optimal distribution of pause lengths that I should use? Yeah.
00:56:22
Speaker
But do you do you find then that like, it seems to me then you've got a risk that I come to you with my software and you find a bug. And really that's a Java bug.
00:56:34
Speaker
It's like, that feels a bit unfair. Yeah. Yeah, um we we have done that before. um You know, i think that in such situations, people are usually still happy to know um because that Java bug could happen in production and you might be very sad if it did. And perhaps you should... Yeah, it's still... That's right. Perhaps you don't want to use that Java feature. um Look, I will say another thing is A thing that we try to emphasize throughout our product is that we're not judging you. like yeah you know like like yeah like Everybody's got different quality goals, and everybody's got different amounts of time for this stuff on their roadmap, and everybody's got different.
00:57:18
Speaker
you know and Some people really do want to find every single last bug in their program. ah Because they're designing a pacemaker or an airplane or something. yeah And then other people are, you know, they're a startup and they're making a pet food website and they're like, you know, it's okay if it's down occasionally.
00:57:35
Speaker
and And we want to support both kinds of developers, right? And so yeah we we do try to offer good facilities for for you to say, thank you for finding that bug.
00:57:46
Speaker
I do not intend to fix this. um please Please don't tell me about it anymore. um and you know yeah and And we will. you know And we'll and we'll we'll let you know if we... you know there's There's still things that could be interesting about that bug, even if you tell us that you're not going to fix it.
00:58:02
Speaker
For example, if we suddenly stop finding it, that might be interesting to you. Maybe it means you fixed it by accident, or maybe it means your test got weaker somehow and we can't find it anymore. Or the JVM finally fixed that bug and we happen to upgrade, right? Yeah, that's that's exactly right.
00:58:17
Speaker
and And so, you know, i I'm very sensitive to the fact that, um you know, We could easily by accident create something where every time you run our platform, it feels like a homework assignment.
00:58:34
Speaker
It's like, oh no, what have they done now? And we really don't want to like that. and we and we really don't want to feel like that We want to feel like your cybernetic best friend, which is helping you do your job better. And whether your goal is to eliminate every single last bug, or your goal is to just only find the very, very worst bugs and the rest of them, whatever, or your goal is just to know about them so that when you hit one in production, you can then reproduce it and fix it really fast.
00:59:04
Speaker
Like all of these are valid, valid goals, and we want to support you in each of them. Okay. Yeah, yeah. There's a natural question that follows on from that, but I have to ask the cheeky one first. Do you find that there are certain languages that are better or worse at things? Like, does does Go always cause garbage collector bugs? Does Rust always cause asynchronous bugs?
00:59:28
Speaker
Oh, I see what you're saying. ah Man, no comment. Do you want to go on record with answering this? Yeah, I...
00:59:40
Speaker
I will say, you know, like, i don't know. the thing i think the things that we have discovered are the things that everybody kind of already knows. For example, very large Java projects have vast numbers of uncaught runtime exceptions.
00:59:56
Speaker
like i i I assume you knew that from your past experiences. yeah and And I did too. But I do think I now have statistical confirmation of that. like you know they They all do. um you know Go is an interesting one.
01:00:10
Speaker
We do find lots and lots of concurrency bugs in Go. But it's not because Go programmers are bad. It's because Go has excellent tooling for finding concurrency bugs. So it's really easy in Go to compile your binary with T-SAN, the thread sanitizer.
01:00:28
Speaker
Oh, yeah. It's an LLVM ah sanitizer, which basically detects data races in your program automatically.
01:00:39
Speaker
dynamically at runtime when they happen. Oh, nice. You have to provoke the race. But if you provoke the race, you get a little printout that's like, hey, your program didn't crash this time, but there is a race here.
01:00:50
Speaker
These two threads access the same variable without any kind of mutex or or whatever. Oh, And so it's actually tremendously helpful. like you know we We have this like really good test property that we can use with Go programs as a result.
01:01:05
Speaker
um You know, so that that's cool. um You know, c programs do have lots of lots of memory safety issues. ah you know Yeah, I can believe that. I absolutely can. yeah so So it's not clear if you're not going to say certain languages are better or worse at some things because you might be more likely to detect it in some places.
01:01:31
Speaker
That's right. I will say certain languages have made it easy to find certain kinds of bugs, which I appreciate. I think that's a good thing. um And, you know, when it comes to the language wars, I'm really just not a partisan.
01:01:44
Speaker
i i feel like I've, or maybe I just, you know, my my view is that they're all bad. Like it it's just, it's just, you know, what's what's the right tool for this particular job? but Yeah. From a certain point of view, they're lovably terrible.
01:01:58
Speaker
That's right. And they're all made by human beings who are trying their best. Yeah, absolutely. Okay, so it's a good answer. It's not the controversial answer I need for clickbait, but it's a good answer. yeah um So the question, ah the naturally following question I was going to ask you is who actually does use this? I mean, what are people using it for? Do you get like operating system vendors contacting you?
01:02:24
Speaker
So you can go on our website and see a long list of our customers. um I would say that the initial success we've stumbled into has been with, uh, three main categories who we've sort of surprisingly gotten a ton of, um,
01:02:41
Speaker
One of them is any kind of database or data infrastructure or data streaming thing or you know distributed computational framework or whatever. you know Confluent is a big user.
01:02:52
Speaker
MongoDB is a big user. um you know Various Kafka re-implementations are all using this to test their stuff. There's a lot of that stuff.
01:03:04
Speaker
um And we've actually got a fair number of open source projects there too. We recently started testing etcd, um working together with the Linux Foundation on that one. That's been good. um So a lot of that kind of stuff.
01:03:17
Speaker
Then I'd say the second big category is ah blockchains, cryptocurrencies, which- Doesn't surprise me. does not Yeah, like it surprised me initially. And then once I thought about it, I was like, oh, that makes sense, right?
01:03:29
Speaker
They're big distributed systems with billions of dollars at stake. you know they probably They probably need some testing. And not always the most rigorous rocket science implemented projects in the world, dare I say.
01:03:45
Speaker
i I will say it's very bimodal. um i i Oh, yeah. I've seen a few that I've been really, really impressed with.
01:03:56
Speaker
um really like They really take the engineering seriously. And then I've seen a few that are a little a little bit flying by the seat of their pants. um and you know but But they they want to do better, which is why they're talking to us, which is great.
01:04:11
Speaker
um And then you know the third big category is like sort of adjacent to that one. It's like fintech. you know We've gotten a bunch of fintech customers. Yeah, yeah. you know Where the cost of a bug can be catastrophic.
01:04:24
Speaker
That's exactly right. um The one that surprised me the most that we haven't gotten yet is is gaming companies. I figured that the game industry would be all over this because you know bugs for them are very costly as well. And they spend so much money on QA right now. And yeah it's relatively primitive.
01:04:43
Speaker
um But for whatever reason, we do not yet have you know Valve or Epic or anybody like that as a customer. So if any of them are listening, give me a call. Could you do that? Could you support that, though? Because would you then need to, perhaps you already do, start simulating a GPU?
01:05:00
Speaker
Right. So I was initially thinking that we would just test their engine code and especially test their netcode, which is very tricky stuff and and and quite hard to get right. We could also actually test the game logic.
01:05:13
Speaker
And as I mentioned, like if you go to our website, you'll see many, many demos of using this to test Nintendo games. um And, you know, I think like complicated interactive computer game is one of the most complex software projects that that people make, you know, on a regular basis. Yeah. And sort of exercises every part of computing at some point. Right. Yeah, that's exactly right.
01:05:38
Speaker
um And I think you could get around the GPU thing. Look, I think we actually do want to start emulating GPUs. But in the meantime, you could get around it by just giving us a headless build of your game. um We then wouldn't be testing the rendering engine, but that's that's probably OK.
01:05:53
Speaker
OK. So there's a question I really want to ask you. Let me just check I understand something. When you're testing Mario, you are not emulating an old 8-bit Nintendo system. No, we are.
01:06:06
Speaker
Are you? Yeah. Ah, so you you've done a hypervisor for the NES. Yeah. Yeah, well, the good news is other people have already done that. so you know so So the history here is actually kind of fun.
01:06:19
Speaker
um We were working on the hypervisor, but it wasn't done yet. And we wanted to parallelize our efforts. And so we also wanted to start working on the bit of our platform that actually searches for the bugs, connects to the hypervisor, says, no, no go back to that point, change that random number, no, do this differently, no, go around that loop one more, right? like yeah That's a very complicated component as well.
01:06:43
Speaker
And we wanted to test that, but we didn't have the hypervisor ready yet. And so we were like, well, is there anything else in the world which is deterministic And we can take cheap snapshots of all the memory, and we can replay things perfectly.
01:07:00
Speaker
And there's a large set of interesting programs that we can test it on. And then we were like, yes, it's called a Nintendo emulator. you know And like the Nintendo emulators are by and large deterministic because the guys making TASs, tool-assisted speedruns, are making use of that property.
01:07:20
Speaker
And then they have very cheap snapshotting because the Nintendo only has like 512 kilobytes of memory or something. And so you can just snapshot the whole thing. and And go back in a time machine to the eighties and say, in the future, people will mock you for how small your memory is.
01:07:35
Speaker
Yeah, well, and and and and there's a giant, diverse, diverse playground of very different computer games, which actually are incredibly challenging to test thoroughly and require different techniques to get into.
01:07:52
Speaker
And so... This was just this unbelievable scene of like rich, rich research problems, inspiration for different testing techniques.
01:08:04
Speaker
um we We honestly learned an important lesson from every single new game we tried. Really? Yeah, yeah, yeah. i and I just assumed this was a very good marketing tactic, but actually it's serious engineering.
01:08:17
Speaker
Yeah, it started it started as engineering and then we were like, wait a minute, this is great marketing.

Lessons from Gaming Software Testing

01:08:21
Speaker
And so then we made the blog posts. But yeah, you can you can go and we we did, you know, we did a really my colleague Harrison did a really good post about um Zelda, which was one of the most complex games that we beat.
01:08:33
Speaker
And we found lots of bugs in that one. um You know, I'm actually about to finish a post on Metroid that's going to go up in a week or two. Probably be up by the time this airs.
01:08:44
Speaker
um Yeah, no, there's some fun stuff. Okay, I'm going to link to those in the show notes. The reason I ask is I have to wonder, I'm thinking of like software that needs to be super reliable.
01:08:55
Speaker
Maybe this has happened. If NASA approached you and said, we need someone to simulate the Mars rovers hardware, could you do it? Would you do it? Yeah. So um there's a guy named Toby Bell who gave a really good talk at Strange Loop, I think the last year that that that conference ran, um precisely about applying deterministic simulation techniques to a really advanced...
01:09:26
Speaker
mission and at JPL, where they were trying to, basically, they were trying to build like a in orbit, I forget if it was an interferometer or like some kind of like pretty complex camera.
01:09:38
Speaker
But the way it worked is like, you had two satellites that had to like perfectly synchronize their movements. And they wanted to test in advance that the flight control software would work correctly in every imaginable situation. And and Obviously these things like need to communicate over a network with highly variable delays because the satellites are different distances from each other, different points.
01:09:59
Speaker
And so they used very similar techniques actually to, to test it. Um, and we, we have had a few emails with the guys at NASA who are designing the proposed new interplanetary internet.
01:10:16
Speaker
So they've got this project to develop a new networking protocol for highly delay tolerant networking. Oh, when you're sending messages to Mars. Yeah, exactly. and and And that stuff needs better testing.
01:10:29
Speaker
And so we've been talking to them. And, you know, I would, I'm ah actually a big space fan, so I would love to work with them. That is mind blowing. I was on a plane a few days ago and I was watching rewatching the Martian, which is a terrific film. And that's totally making me think. Yeah, yeah, yeah. yeah I'm going to have to go and watch that talk. I think I think we have to stop there because you've got me hungry for listening to NASA people talking about space.
01:10:55
Speaker
Yeah, this this was really fun. You should surely take out that talk, though. I totally will. And then I'll probably go and have a a Nintendo game nostalgia festival. So thank you for giving me an evening.
01:11:07
Speaker
Yeah, thank you. Thank you so much for having me. Will, that was brilliant. Cheers. Awesome. Have a good one. Thank you, Will. Briefly, while we're talking about The Martian, I'm going to have to recommend ah different book by Andy Weir, Project Hail Mary. I really enjoyed it.
01:11:21
Speaker
Apparently there's a film version coming out next year with Ryan Gosling. I hope they do it justice. So if you want to wait for the film, other options for good things to read can be found in the show notes.
01:11:34
Speaker
Antithesis have a few posts about Nintendo games, so I've linked to those. I'm sure you can find the rest. I'm also sure you can find the like button if you've liked this episode. Please do take a moment to click it.
01:11:46
Speaker
If you're not already subscribed, please make sure you're subscribed for future episodes because we've got some interesting people lined up. And with that, I'm going to say goodbye. I've been your host, Chris Jenkins.
01:11:57
Speaker
This has been Developer Voices with Will Wilson. Thanks for listening.