Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

The Twitter Code Migration Disaster That Nearly BROKE IT

31 Plays4 months ago

When a critical code migration goes wrong, the stakes can be nothing short of catastrophic.

In this episode, we dive into the code migration that almost brought one of the world’s largest social platforms to its knees.

You’ll hear the inside story of how engineers raced against time to prevent total failure, the technical hurdles that made this code migration uniquely challenging, and the high-pressure decisions that determined success or disaster.

We break down the lessons learned, the strategies that worked, and how these insights can prepare you for your own future code migration.

If you’ve ever wondered what happens when code migration becomes a matter of survival, this is the episode you can’t afford to miss.

Get Tern Stories in your inbox: https://tern.sh/youtube

Recommended

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis image

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis

00:52:33·3 months ago

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild image

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild

00:49:01·3 months ago

Surviving High Stakes Code Migration Without Breaking Everything image

Surviving High Stakes Code Migration Without Breaking Everything

00:50:00·4 months ago

Code Migration Secrets: How to Finish in Half the Time with AI image

Code Migration Secrets: How to Finish in Half the Time with AI

00:30:11·4 months ago

Slack’s Code Migration Uncovered a Terrifying Truth image

Slack’s Code Migration Uncovered a Terrifying Truth

01:08:23·5 months ago

You have to decide image

You have to decide

00:21:16·5 months ago

You have to decide image

You have to decide

00:21:16·5 months ago

How They Cut Code Migration Time Without Sacrificing Quality image

How They Cut Code Migration Time Without Sacrificing Quality

00:52:32·6 months ago

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14 image

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14

00:51:46·6 months ago

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13 image

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13

00:54:30·6 months ago

Wrong Tool, Right Choice: PagerDuty's Cassandra Queue | Ep. 12 image

Wrong Tool, Right Choice: PagerDuty's Cassandra Queue | Ep. 12

01:16:33·6 months ago

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11 image

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11

00:52:36·6 months ago

“We Should Be Able to Drain an AZ” | Ep. 10 image

“We Should Be Able to Drain an AZ” | Ep. 10

00:55:50·7 months ago

Slack's 6am Database Club | Ep. 9 image

Slack's 6am Database Club | Ep. 9

01:09:26·7 months ago

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8 image

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8

00:58:48·7 months ago

Migrating Memcache in a time of DEMAND | Ep. 07 image

Migrating Memcache in a time of DEMAND | Ep. 07

01:28:00·8 months ago

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6 image

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6

00:47:13·8 months ago

What Litigation Teaches Us About Security Operations | Ep. 5 image

What Litigation Teaches Us About Security Operations | Ep. 5

00:52:45·8 months ago

Outscaling ElasticSearch at Datadog | Ep. 4 image

Outscaling ElasticSearch at Datadog | Ep. 4

00:52:45·8 months ago

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03 image

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03

00:47:16·8 months ago

Transcript

Twitter's 2009 ID Crisis

00:00:00

Speaker

They call it it the twitpocalypse. We had less than six weeks before we ran out of IDs. We would basically not be able accept any new tweets. This was an existential problem for us. When we tested it, it was going to take us like four or five hours to run this migration.

00:00:11

Speaker

It still really felt like we're making progress like eight months later. The next one was harder. There were databases and programming languages that didn't even have support for unsigned 64-bit integers. The support for 64-bits wasn't even as universal as you'd expect. We're doing it in the mission-critical, company-will-fail-if-we-don't-do-it context. Not in a high-risk, high-reward context. We're doing a high-risk, no-reward context.

00:00:36

Speaker

Today, we have Ryan King on Turn Stories. Ryan has been and engineer at Twitter. he was the head of engineering at Bolt.com. He was the principal engineering chief architect at Color, and now he's a principal principal engineer at Cruise.

00:00:52

Speaker

But we're going to talk about a story and a migration from a little while ago that changed the industry. So let's get right into it. Ryan, what was the moment that you realized that Twitter was going to run out of tweet IDs?

00:01:07

Speaker

Yeah, so it was but late 2009. I'd been at the company for about six months working on things related to storing tweets. And a member of our operations team, John Adams, did some very basic math and discovered that we had less than six weeks before we ran out of IDs. And for context of this time, 2009...

00:01:28

Speaker

you know thousand nine Twitter was mostly, you know, a website, a Rails website with a single database um and, you know, some web servers running in a colo, no real cloud services other than S3.

00:01:44

Speaker

Not the level of agility that have today, but we had about six weeks the first time we were about to run out.

API Challenges and Collaboration

00:01:50

Speaker

And what happens after six weeks? um So our... Due to you know the standards of the time, the database table that held all of our tweets had a ID number, which was auto-incrementing, and only 31 bits.

00:02:06

Speaker

So 31 bits gives you about 2 billion ID numbers. um It's a signed 4-byte integer. And that was just the default for a lot of things. I think it was the Rails schema default at the time.

00:02:18

Speaker

And it might have been the MySQL default at the time. And it takes us about three years to get that many tweets in company history. um But due to super linear growth, it was coming at us very fast.

00:02:35

Speaker

i That's a little terrifying. So you run out of tweet IDs. And what what happens? The site stops? Yeah. Yeah, so we would basically not be able accept any new tweets, right? Because we would not be able to ah add the next incremental tweet ID.

00:02:51

Speaker

And, you know, we viewed this, you know, a small small company me at the time, less than 100 people, ah but growing fast. This was an existential problem for us. And the technical solution the database was not particularly onerous, right? We could run a database migration and we could we could change the ID number.

00:03:11

Speaker

Except that Running a database migration on a MySQL table in 2009 required locking the table and ah not allowing any rights for it. And when we tested it, it was gonna take us like four or five hours to run this migration.

00:03:26

Speaker

So were able to solve that problem due to some like database replication trickiness. um Michael Iman or DB at the time was able to do some tricky stuff to make that possible.

00:03:37

Speaker

But we quickly found that due to Twitter's API, we had a lot of API consumers that were going to break. And while we could have solved this problem ourselves in a week, once we started talking to our API developers, it became clear that a lot of them had to do work in order to be prepared for this.

00:03:58

Speaker

And so our six-week window...

Database Migration Tactics

00:04:00

Speaker

ah We took like four and half weeks to get it done um because we didn't break any major API consumers. um Pretty stressful.

00:04:09

Speaker

The actual technical work was not onerous. It was not a mountain to climb. um But, you know, for a company whose job it is, the winner only took tweets and posted them and showed them to people.

00:04:20

Speaker

was pretty important to do. So what went through the um the API consumers? What did they... but Why did you need to take them into consideration?

00:04:33

Speaker

That's a good good question. So um if we go back to our time machine in 2009, Twitter ran a website and an API and no other consumer products. um All of the mobile apps and other services were built by third parties. um In fact, what became Twitter iOS was originally a third-party app called Tweety that Twitter acquired later.

00:04:59

Speaker

um But it was, you know, ah major part of the the consumer usage was third third-party tools. And so if we were to break them, we break Twitter for a lot of people.

00:05:09

Speaker

Got it. And changing the ideas, i mean, if they're running out of ideas, obviously would have been a problem, but Yeah, so the the challenge was, and and this and and this time we were about to run IDs was not the actual, the hardest one.

00:05:22

Speaker

um The next one was harder. um People made assumptions, right? They saw a bunch of IDs coming through that fit in a 32-bit signed integer. And so they wrote their code with 32-bit signed integers in it.

00:05:35

Speaker

And you just kind of assume it's fine, right? You know, billions of i of tweets is a lot. Like, who's going to run out of those anytime soon? And this is still the era where people, where it was nowhere near the cultural phenomenon it became later, um only just had people start using it for politics and campaigns and only just had celebrities start using it. the biggest followings were still sub one million people, ah one million followers. So it was it was a very different era in that regard.

00:06:04

Speaker

Got it. Okay. So it's six weeks, four and a half weeks in. Yeah. You notified the API developers. Well, we notified the API developers immediately, but it took about four and a half weeks to get everyone, everyone that we explicitly cared about to give us a thumbs up.

00:06:19

Speaker

So we're ready go. Right on. um And did you end up taking that that downtime? um We were able to make the downtime shorter by, again, doing some... MySQL replication tricks where we could take a replica and upgrade it and then promote it to the master, to the lead database.

00:06:37

Speaker

um Because you can you could you could update the replica separate separately from the the master database. and Unfortunately, still it was still called MySQL. um And so our downtime was less than an hour as a result, rather than, believe, the four or six hours, if my memory serves correctly.

Avoiding the Twitpocalypse

00:06:53

Speaker

we were able to get it done. And it was ah ah a breath of fresh air. um Interestingly, because there Twitter was such like an open product and i built people were using the APIs, it was um Many API developers cared about this and even developed a term for this um problem. They call it the Twitpocalypse.

00:07:18

Speaker

And there was even someone who used the API and built a countdown website, a single-serving website that was just a countdown to when this was going to happen that just used some basic math on IDs used.

00:07:30

Speaker

so So we survived Twitpocalypse 1. um But it turns out the nature of exponential growth is that things keep coming at you faster. It took us three years to have the first two billion tweets, um which roughly two to the 31st is two billion something.

00:07:48

Speaker

But to overflow 32 bits is only 4 billion. and And that second billion tweets came in four to five months, right? So it took us three years the first time and in four to five months the second time because usage was growing exponentially. So every incremental, you know, billion of tweets was a fraction of the previous time period that it took.

00:08:12

Speaker

Um, And so it wasn't long after we solved Twip to Apocalypse 1 that the second one was already upon us. And we had a few months to prepare this time, but we were having to deal with a slightly we a bigger problem, which was moving from 32 to 64 bits.

00:08:28

Speaker

sixty four bits and The you know young people in industry this day may not understand. like Many of our computers were still 32-bit computers. um and not you know Most server-grade stuff wasn't, but consumer-grade stuff was in many cases.

00:08:46

Speaker

um And so the support for 64-bits wasn't even as universal as you'd expect. Java did not have an unsigned 64-bit type. They may still not have it. um Only sign a signed 64-bit type.

00:09:00

Speaker

The Ruby MySQL libraries we had didn't properly support unsigned 64 bits, only signed 64 bits. um we Once we started digging in, we realized that like we were really... like It's a little crazy to think that 64 bit ID numbers was like actually pushing the boundaries.

00:09:18

Speaker

um In some cases. um And we had more time for the second one, but it was actually more of a disruption for people because a lot more code had to change. um A lot more assumptions were broken.

00:09:33

Speaker

um But it came out really fast. we we We survived the first one and we almost immediately had to start thinking about the second one um because it was only months away.

Birth of Snowflake

00:09:40

Speaker

Wow. Yeah. Four to five months is a lot longer than six weeks, but not...

00:09:45

Speaker

It is a whole different hill to climb. Yeah. Especially if... I mean, even even with modern tools, ah it's not that's not a trivial change to me. um So walk me through your timeline. What was... You had you had a few battle scars on the first one.

00:09:58

Speaker

What was your approach? Yeah. I mean, thankfully, we had... you know, our company had been growing and we ended ended up... you know The first time around, it was you know basically one and a half people working on it. And you know even just six months later, we had more help, both on the engineering side, but more importantly, on the API developer relations side.

00:10:15

Speaker

we had been building a really good team that was able to build those bridges and and help us communicate that. And so... it was It was less of a surprise to people, but still, you know, the number of groups and teams and and and the number of challenges with the underlying technology were significant. Like I said, there were databases and programming languages that didn't even have support for so ah unsigned 64-bit integers.

00:10:43

Speaker

um And every little problem was not you know um each of the problems were surmountable on their own. But when you have a large ecosystem of developers, each having to do their own work and internally also having to deal with multiple programming languages, multiple databases, um multiple days but you know technologies for data analysis and ensuring that they can all support this,

00:11:07

Speaker

um, meant that what could have just been an alter table statement took, you know, good three and a half, four months. If I, if I remember correctly, um, to basically just run an alter table, uh, on, on one database.

00:11:21

Speaker

Oh, wow. Yeah. But you you mentioned running, they're running up the command and take four months. They're getting ready, everyone ready to be able to handle the data. um yeah. Okay. that Yeah. That makes, that makes more sense. And it's, uh,

00:11:35

Speaker

inter it It's interesting that so you had it sounds like the shape of this this of twitpocalypse to was sort of fundamentally different than the first one that you ended up having a lot more people involved with internally and externally.

00:11:48

Speaker

but Was there a did you take a different a different approach to like how you how you thought about making sure that you were confident in running that flip because that that you can't put that horse back in that barn.

00:12:04

Speaker

Yeah, one of the challenges with these migrations we did and all the ones we're gonna talk about after this is they're very hard to reverse, right? Like once you've generated ID numbers that are larger than would fit, then like you can't go back. You can't change your data is back.

00:12:20

Speaker

and um And it makes it such that we weren't able to do any sort of incremental rollout for this work. um I don't know that to answer your question, i don't know that we thought fundamentally different about it, but we, i think in many ways, were more aggressive about communication and timelines because we knew, i think we knew enough to be a little bit pessimistic about how long things would take.

00:12:46

Speaker

And therefore, need to start earlier and be a little bit more assertive and aggressive about what's going on. um And we had the benefit of having just gone through it. Like everyone saw what happened. They learned a little bit about what's going on. We were at a company learning to manage change little bit better.

00:13:03

Speaker

Like we, we had some difficult changes back to back. Like we just, the iteration was, was ah ah somewhat beneficial for us. That's interesting. The the observation that sort of being able to do these changes is a muscle.

00:13:18

Speaker

I think that feels true based on the organizations that I've i've worked in that have ah differently developed that muscle. Yeah, it's a muscle. And like real muscles, um they grow based on stress.

00:13:32

Speaker

yeah You have to work them. You have to challenge them a little bit. And then they hopefully grow back little bit stronger. right on. So, um, to apocalypse 2.0, relatively straightforward.

00:13:45

Speaker

A lot of little changes, lot of little changes. Um, some of the changes are more fundamental because the underlying technology needed more change, but in the end, less drama, I think than the first one.

00:13:57

Speaker

Cool. Right on. So after that, did did you have to go to a hundred and twenty eight but ah No. because ah No. um they're They're not more...

00:14:10

Speaker

i left twitter in about twenty fifteen and the um Last I heard, the ID number generator that we'll talk about later is still in use. And they're talking about expanding to 128 bits.

00:14:22

Speaker

But that's partially because they use it for what I understand for everything, not just tweets now. um But no, it turns out we kind of had solved the the ID number problem for the time being, at least the size the ID number.

00:14:35

Speaker

um But what became obvious is that our strategy of storing everything in a single database was not going to last. So when I joined in 2009, we literally we had a single MySingle database with all the tables for all Twitter in it.

00:14:51

Speaker

And, you know, tweets, users, direct messages, the social graph, et cetera. And these things are all not the same size. So sometime in 2009, it was clear the social graph was way too big.

00:15:03

Speaker

And a team that I was familiar with but didn't work on and built a separate service for that. They moved it out to a partition database called Flock, um also based on MySQL.com.

00:15:18

Speaker

that would partition out so we could grow, you know, ah linearly with, with the size of the social graph. Right. um The next biggest thing at the social graph was tweets, right? Tweets just keep going, you know, every user who joins and then starts adding tweets and the corpus keeps getting bigger and bigger.

00:15:35

Speaker

And, you know, the first thing we did was move these tables to a separate database. So like there was still ah single database for tweets, but it was by itself. And so we didn't have to share space with the rest of them.

00:15:46

Speaker

But it wasn't long after Twipocalypse 2 that we realized that even that had a very short shelf life. And um we did the math and figured out what the largest hard drives we could buy were.

00:15:59

Speaker

And at the time, was only about a terabyte, maybe two terabytes of storage for a single server of the hardware profiles we had. And that was not much. that A terabyte was about, I remember correctly, it was about a terabyte per 10 billion tweets.

00:16:18

Speaker

ah this i'm I'm digging deep into my memory. um or maybe 5 billion tweets. And with exponential growth, it meant that we had, um by the time we did the math, we found out we only had six weeks left before the first database of, the first tweet-only database server would be full.

00:16:37

Speaker

Um, the six week number, it feels like magic here. Maybe I'm misremembering him and it's like plus or minus a few, but, um, we, again, we're in a situation where, you know, less than two months we had left of runway before a major technical limitation would go into place.

00:16:55

Speaker

And again, we're not be able to store any tweets.

00:16:58

Speaker

All right. So what was the strategy this time? Um, People have a lot of ideas. um I was the one, the engineer responsible for this project and um we we had a lot of ideas and we had to immediately reject almost all of them because they would take too long, right? We could have migrated to but distributed database of which there weren't a lot of good options at the time.

00:17:24

Speaker

We could have partitioned um that the tweets in the same way that were doing the social graph, basically dividing them up by users. um But we have a very short timeline and we didn't want, like six weeks is our estimate of how long it would take. But, you know, if usage increased or if something weird happened, like six weeks could actually be like four weeks or five weeks. Right.

00:17:51

Speaker

And um we wanted to be fairly risk adverse. And so we really only gave ourselves three weeks to do the project. And when have that short of a project, like there's very few things that you can do.

00:18:03

Speaker

um You have to be very conservative about your approach. And... um Again, spending time with our DBA, Mike Lyman, who, sidebar, unfortunately is no longer with us due to COVID, um which makes me mad every time I think about it.

00:18:18

Speaker

um ah We came up with what is maybe the hackiest bit of technology I've ever developed. um it is It is something that every new engineer at Twitter in that era, when they saw it, they were relatively disgusted.

00:18:35

Speaker

And so we'd explain these circumstances, which is we had but six weeks, which is really like three weeks. um We had to find a way to both store new tweets by adding incremental hardware and also be able to retrieve all them in reasonably performant way.

00:18:54

Speaker

And one of the insights that we found, which is like very obvious, is that there's a lot of time locality to tweets. People only care about the most recent ones. look This is also in the era of the chronological timeline. There was no ranked timeline.

00:19:07

Speaker

So people were only looking at the tweets from the last few days, mostly. And that insight led us to decide to um actually partition the tweets by time. So basically all new tweets would go into the most recent shard, like a slice of the dado.

00:19:23

Speaker

um And we could just add those as we go. And when you want to write a new tweet, just always write it to the most recent database. um And if you want to read them, you just have to walk backwards in time until you found the one you're looking for.

00:19:39

Speaker

And almost all the time, it would be in the most recent partition of the data. Every once while, you might have to step back a few. And so we decided this was the stopgap we could buy ourselves six, we thought we needed like six months to like get to the next solution.

00:19:56

Speaker

Um... It was not six months. Um,

Implementing Snowflake IDs

00:20:00

Speaker

uh, we, we decided that like this simple solution would be good enough because almost all of the time you're only reading from one database. You always write to one database.

00:20:10

Speaker

It solves the problem of the corpus size, which is what was we're going to hit, which is like just literally how many tweets can you fit in one, one of the biggest servers that we can buy. And so we decided to go with that.

00:20:25

Speaker

Um,

00:20:28

Speaker

The challenge where it got really janky was how do we actually roll this out? And how do we actually roll out new partitions? Because at the time we didn't have any um tools for distribute distributed coordination, like a zookeeper or etcd that like allow you to, in a distributed system, coordinate on hey, what partitions do we want and which one do you write to, et cetera.

00:20:53

Speaker

um And we weren't, we were working in a co-op where we actually had to you know call someone to rack servers for us. And so we didn't have the agility of doing lot of cloud automation.

00:21:05

Speaker

And The solution we came up with is, so and also like our configuration management for databases was YAML files in our Rails software. So um we when we added new like replicas or changes that we had to update YAML files and deploy and the Rails app in order to tell each web server which database to talk to for which table and stuff like that.

00:21:31

Speaker

um Would it have been great if we could have paused and built a configuration management system or a coordination system? Yeah, but three weeks, like it just not possible. Three weeks.

00:21:42

Speaker

Yeah. So leveraging all that stuff that we had, what we decided to do was for each time partition, we would, the DBA would, you know stand up new data, new servers and we, you know, the master and then replicas.

00:21:57

Speaker

um And we would implement the algorithm of sort of linear search back through time for, for reads i and then writes to the most recent one. But the problem of,

00:22:09

Speaker

switching from one to the next needed to be relatively at top relatively atomic. And the deploys to our web server were not atomic. They were rolling deploys. Right? Sure.

00:22:21

Speaker

And the ID number is still generated by the replica that's doing the writing because it's still an auto MySQL auto-increment ID. So basically what we did is we wrote the right algorithm to do the same walk back in time.

00:22:36

Speaker

And so it tried to write to the most recent one. And if it failed, it would write to the next most recent one and then so on. And so when we deployed the new partition, we would deploy it broken with no um with the database table either absent or misnamed so that when you would write to it, it would fail and they would just try the next one.

00:22:57

Speaker

And then to turn it on, we would run an altered table statement on an empty table on the newest one to sort of like swap it into place. And immediately everyone that was writing would start writing. Immediately within, you know, a second or right?

00:23:13

Speaker

um we got it done in like three weeks. We rolled it out. No one saw anything. It was like the least

Cassandra Migration Issues

00:23:20

Speaker

dramatic of these stories so far externally. um No um API developers complained.

00:23:26

Speaker

um ah We didn't break anything that was wasn't easily fixable. I think some of our data, you know, data analytics broke and, we're able to fix it pretty straightforward.

00:23:37

Speaker

um But it was both... a major win and also probably the kind of jankiest technology I've ever worked on. That's amazing.

00:23:48

Speaker

So, Fix the problem. You had three weeks. How long did that system end up living in production? um So. We thought we would need, don't know, like six or eight months and maybe how how many time partitions, like five or six.

00:24:06

Speaker

um I'm pretty sure it lived for. oh like approaching two years and had 14 time partitions. And at the end, because we we needed to create a new time partition based on the number of tweets that are being posted,

00:24:23

Speaker

We were like rolling a new one like every two to four weeks at the end. Again, because we're still in the exponential growth curve. And again, you know, so the first one was, you know, three and a half years.

00:24:38

Speaker

And then the next one was probably like four to six months then probably two to three months. And then like and a half two months. like Don't quote these numbers exactly, but like that was the sort of curve we're on where in the end, our DBA was basically constantly shuffling hardware to the head of the the part time partition and like setting up new partitions.

00:25:01

Speaker

And eventually we got past it, right? um and And eventually we were able to move away from it, but it took a lot longer than we expected. um and But thankfully, was...

00:25:12

Speaker

it was It existed without significant modification for entire time.

00:25:19

Speaker

You know, given the constraints, that is an amazing piece of engineering and trading, trading, trading the kind of toil where, you know, it's getting faster and faster like, obviously has some discomfort to it, but it, it did work. And the fact that this one didn't require external changes and it didn't require a ton of coordination. There's a lot to like about that.

00:25:41

Speaker

Yeah. Yeah. um So how did you get out of that situation? um So this is where the the story changes a little bit. So um after this, we decided um i was on a team called Infrastructure. Our tech lead was Evan Weaver, who later went to found FaunaDB partially as a result of these experiences.

00:26:06

Speaker

we um We had built this sort of time-based partitioning for tweets. We built a user-based partitioning thing for the social graph. And we were suddenly at a point where we could like take a breath and be like, what, like, what we want to do next?

00:26:22

Speaker

Because we can tackle some of these problems without this insane time crunch. And so we decided that we we wanted to invest in some database technology that allow us to not worry about these things nearly as much as we had. Like if we we could just have a distributed database that we could scale linearly or incrementally in an economic way without a DBA messing with stuff all the time and without really expensive hardware too. you know The database hardware was the fastest drives in a RAID configuration with you know like you know as many CPUs as we can shove and as much RAM as we can shove into one machine.

00:27:03

Speaker

you know, we wanted to move to commodity hardware, should we database, you know, for more stuff, you know, for more incremental capacity. So we stepped back their research. And so long story short, decided to invest in what was at the time a new open source project called Cassandra had been released from Facebook not long before that.

00:27:27

Speaker

we've ah We've talked about Cassandra on the show before and how the early days were a bit rough. we Was Twitter... How early in the adoption of Cassandra was Twitter? Yeah, I think we first started testing version 0.7. And maybe late 2010.

00:27:43

Speaker

Yeah. like two thousand late two thousand and ah earlier at the time. And some of this actually started happening concurrently with the previous stories. Like we're telling this in order, but there was some overlap because we could, we could come sometimes see the next things coming.

00:27:59

Speaker

And i i mean, I don't want to take too much credit, but do think Twitter's involvement actually boosted Cassandra's profile in some ways. um We were one of the first significant, you know, startups or tech companies outside of Facebook to start using it or always start talking about it and contributing to the community.

00:28:22

Speaker

um And so dug in, learned about it, started doing our benchmarking, and decided that we want to try it for strong tweets. And we built small team around it, two or three of us, to start investing in it. But we realized that we had um ah project we had to before we got to that.

00:28:47

Speaker

I don't know you talk about yakshaves on this channel, but um you know a yakshave project is you know a thing you have to do before you do the thing you have to do. um And sometimes it doesn't look like the first thing.

00:28:59

Speaker

um We realize if we're going to move to distributed database, we're to have a different way of generating ID numbers. Because so far we've been relying on MySQL to generate um sequential increasing ID numbers.

00:29:12

Speaker

um And so we went to drawing board, tried to figure out what would we want to do to generate ID numbers that's hopefully separate from the database, whatever database we're using, because we want some independence here.

00:29:24

Speaker

um after some design discussions, and like one particular afternoon where there was like eight or ten of us with a whiteboard, we came up with a design for what became the service Snowflake.

00:29:37

Speaker

um Snowflake was a a service that we built in later open source that would generate... kind of sequential ID numbers in a ah not very coordinated way, um such that we could um kind of scale it pretty far and um not have to worry about coordination problems,

Lessons in Project Feasibility

00:30:00

Speaker

not have to worry about really scaling ID number generation for a long time and would still fit in the 64-bit schemas that we were we were using.

00:30:08

Speaker

120 bits was scary, and we we had enough scars there that we didn't want to to go to 120 bits. And also, as aesthetically, um the ID numbers are in the URLs, and aesthetically, we didn't want to switch that to be like a UUID or some other 120-bit ID number.

00:30:27

Speaker

Makes sense. So Snowflake is... ah widely adopted in industry at this point. i I even remember shortly after it was launched that I was far away from Twitter in San Francisco and and ended up using Cassandra and Snowflake as a result to store yeah distributed traces in a production scale database.

00:30:47

Speaker

um Tell me, walk me through the approach of what How did Snowflake work? Yeah. So um Snowflake, the algorithm is very simple. It's so simple that after that design meeting, one of the other, even though it was my project, one the other engineers actually wrote the first version it before I even got back to my desk.

00:31:06

Speaker

um I think it was Mark McBride, who I think you've worked with at Slack as well. Indeed. worked with him at at his startup. but Yeah, the basic approach is 64 bits is is enough bits to waste.

00:31:19

Speaker

And so you put, because you want them sorted, roughly sorted, which we later learn the term is K sorted, meaning where K is like the amount of out of order that you can tolerate.

00:31:31

Speaker

You want the upper bits, the most significant ones, to be time. So we use a timestamp, a millisecond precision timestamp, I believe. And then we use the lower bits to guarantee uniqueness.

00:31:43

Speaker

So the next group of bits, which I think originally was about 10 bits, would be an identifier for the copy of the service that was generating ID number. And then the last bit would be a sequential ID, which is basically the um the number of IDs that that service in that time period had generated.

00:32:02

Speaker

And so if you had your your clocks working okay and synchronized okay, and if you assign the ID numbers for the sort the copies of Snowflake appropriately, like you didn't duplicate them, and each one could correctly not give you a duplicate ah serial number at the end,

00:32:19

Speaker

um Then you get unique ID numbers that are mostly in order, where the mostly is about the clock differences between servers. And for our purposes, it was good enough, right? You just want them to like show up in a timeline. basically Again, this is still the chronological timeline error.

00:32:36

Speaker

um You wanted to bill show up in timeline basically in order they were posted, but you know a few milliseconds here and there were completely inconsequential. That makes sense. Who knows? I remember ah at Slack ah hearing that reference in design documents. We actually ended up widening the ah window on, i don't think it Snowflake particular, but this approach, roughly speaking, to a second because a human doesn't care to within a second.

00:33:03

Speaker

yeah its ah It makes sense as a tunable parameter. And and yeah that was a design decision that was directly traced back to Snowflake's approach here. Yeah, and I think we were very lucky that we were a small company at the time with very tight connection to our product team, and we're all heavy users of the product.

00:33:22

Speaker

So I think it was able allowed us to reason about the actual implications of some of our work. From a pure engineering perspective, it's a... you know, there are deficiencies in design and that yeah the K ordering approach has weaknesses, right?

00:33:39

Speaker

It's only when you think about the product and then end user that you realize that the weaknesses don't matter and that you can actually generate ah much simpler design and a simpler implementation by ah applying or dropping the right set of constraints.

00:33:56

Speaker

Yeah, that makes sense. It works for tweets. It may not work for the sequential function calls across the distributed system. Absolutely. Yeah, absolutely. So now you're armed with an ID generator for Cassandra, an early version of Cassandra.

00:34:12

Speaker

Yes. ah Well, first we need to roll out the ID generator on top of our MySQL implementation. So we did we didn't want to roll these all out at the same time. So the ID generator is written. but Let's roll it out because we can generate those IDs and just insert them into MySQL and not worry about the under increment anymore.

00:34:31

Speaker

And so what's you know um we decided to roll it out. um Thankfully, we've been through worse migrations before. um And so this one had some major hiccups, but we were able organizationally deal with it.

00:34:47

Speaker

The first is, um oddly, a bunch of API consumers cared about the sequential nature of the IDs. And many those were analytics or sort of data mining kind of companies. And we had to just decide that we didn't care about them knowing how many tweets we had per day.

00:35:04

Speaker

um And that was just kind of none other business. And that was that was straightforward easy. The second significant problem we ran into is that no one believed it was going to work correctly.

00:35:15

Speaker

And this was true both externally and internally. um The design is simple. um Yeah, the design is simple. Like we open source the code as early as possible just so that everyone could see like, like we actually didn't want anyone.

00:35:29

Speaker

We didn't care if anyone used it. We just wanted to show people this is exactly what's going to happen. And many of the ecosystem developers had worries and questions still. i think some of them as mentioned were,

00:35:44

Speaker

Cases where they were using the sequential nature IDs that we we were just going to break and we were okay with that. Then I think there was just a lot of uncertainty. The people just, they didn't know how to think about it. And they didn't, it was a new thing. It was a new thing, right? Like it was not something people were familiar with. And so we spent a bunch time, like I gave a few tech talks at conferences, talked about it, published the code, et cetera.

00:36:08

Speaker

um Also internally, there was similar problems, right? Like we were now as at that point internal ecosystem where I don't even know all the teams using this stuff. And there was teams counting ID numbers in order to do things like tweets per day, right?

00:36:22

Speaker

um There was ah teams that were depending on i idea ID numbers being densely distributed for how they store them in the data warehouse, right?

00:36:33

Speaker

right And now we we're going to sparse ID numbers. like There's a lot of numbers that get skipped in this process. right Because

Engineering Reflections

00:36:40

Speaker

64 bits is more than enough for that. And so we had to work through each of those. And there were some migrations to do.

00:36:46

Speaker

Again, the first version of s Snowflake was written in under an hour. It took us another like three months to actually turn it on. um Because there was communication to do, trust build, some ancillary engineering projects to do to get ready for it.

00:37:01

Speaker

And then also, i felt the pressure to prove to everyone that they would be unique ID numbers. So we ran like so warrant the largest, longest load test I've ever i've ever run, um where I basically um ran ah set of Snowflake servers at maximum throughput for like a month.

00:37:23

Speaker

meaning we we we would overload them to the point where they would be like at risk of overflowing, like the counters and et cetera. um And did it for a month, trying to generate a unique ID number or a duplicate ID number.

00:37:39

Speaker

And i don't know how many like trillions of ID numbers we generated, but um we we did not generate a single duplicate. And so I think finally everyone was like, okay, that's cool.

00:37:52

Speaker

Until, um oh, one more thing. So the timestamp um that we were using um was a standard Unix millisecond timestamp, which is basically a counter since 1970.

00:38:07

Speaker

nineteen seventy s one of the things we realized is like, we actually don't care about all that time range back there. So we can actually just maybe start the counter more recently. Cause then that saves us a bunch of space and numbers, right?

00:38:21

Speaker

yeah Like we don't, we don't need those first ah time 40 years, right? We can, we can, we can fast forward. And I thought it was just the perfect decision of when to set that timestamp was when the first tweet was posted.

00:38:39

Speaker

In 2006, right? so I looked it up in the database because it's still there. It's tweet ID number 20 from Jack. The first surviving tweet is tweet ID number 20 from Jack um that says just setting up my Twitter.

00:38:52

Speaker

I look up the timestamp. I went and put that as a constant in the Snowflake code. And we're like, this is great. We save us, you know, 36 years of timestamps. Like, let's ship it.

00:39:04

Speaker

um We talked about it internally. We got ready to launch and maybe a week before launch, an unnamed team decided to actually test for the first time.

00:39:18

Speaker

And they realized that they were reminded learned for the first time that JavaScript number types are not integers.

00:39:30

Speaker

The number type in JavaScript is a 64-bit floating point number. And the reason this matters is that a floating point number, or technically a double here because it's 8 bytes, um can sometimes precisely represent integers, but not always.

00:39:53

Speaker

And the cutoff for IEEE whatever floating point type is 2 to the fifty third So okay integers below 2 to 53 can be precisely represented, not approximately represented by an e-byte floating point number.

00:40:11

Speaker

But over that, you actually only have approximate approximation, meaning that you get ah number over 2 to the 53, you parse it in the JavaScript into a number type, and if it's over 2 to the 53, you may or may not get the same number back.

00:40:30

Speaker

You may get some number close to it.

00:40:34

Speaker

And...

00:40:37

Speaker

uh, this was both extremely frustrating because we were ready to launch. Um, it was also frustrating because why would you do that to a programming language? Um, like why, like who, who thought this was a good idea? Like, aren't images a good idea?

00:40:52

Speaker

Um, and, uh, After we got over the frustration, we developed a plan. Number one is it turns out that like strings are probably a better idea anyways, because you actually don't care about the it being a number.

00:41:07

Speaker

If this is common practice now for people, but it wasn't at the time. um i like to think our experience may have informed people a little bit on that. But i bet we had we yeah we added in the API, we'd add a and everywhere in our code, we had to add the numeric type and the string version of it, which is literally just a string in base 10.

00:41:29

Speaker

um We had upgrade everything that was JavaScript in our whole ecosystem and in the third party ecosystem. We had to like explain this to everyone. Hey, if you are using JavaScript, you got to use the string version and you got to do quickly because we need to roll this out soon.

00:41:46

Speaker

um I don't remember how long I said it back, at least a month. of work. And also we decided, unfortunately, ah that we need to buy ourselves more time.

00:41:58

Speaker

And the way buy ourselves more time is to change when the timer starts. So previously I'd started the timer in 2006. um Because we're talking about log to face growth, even just moving it up by five years meant that we were using far fewer bits of of actual number.

00:42:19

Speaker

um And so I had i had to change the ah the time at which the timer started from when the first tweet was posted to basically at the time I was doing the project. So basically set the counter to zero like when I was shipping it. I really disliked that. That very frustrating because I thought it was such a neat solution.

00:42:38

Speaker

And the constant in the code was called the TWEPIC, like the Twitter EPIC. I think it's still called that or or was when we released it, but I was i was very sad about that. outside it's a a beautiful little easter egg thoroughly ruined by reality

00:42:56

Speaker

um so rolling it out that rolling it out i figured rolling out new numbers would be easy but of course javascript makes it hard yeah um but with new numbers rolled out cassandra and hand So we had the numbers rolled out and now we're like, okay, now we can really do Cassandra.

00:43:15

Speaker

And, you know at this point, like all the, all the pressure's off, right? Like we've, we've built

AI's Role in Problem-Solving

00:43:20

Speaker

so much runway. we have a bigger team. We can invest in things. We have more infrastructure.

00:43:26

Speaker

We have more like physical capacity. And we think we can really solve this database problem once and for all for Twitter. Yeah. Um, we build our load tests.

00:43:38

Speaker

We build a new Ruby based framework, like a ORM framework for Cassandra. um We build one for Scala as well. Cause some of our servers do Scala. um We start testing it seems reasonable.

00:43:51

Speaker

And you know, the Twitter approach was always like, well, let's, let's try it in production. And we integrated into the code base, started adding little bit traffic to it. and,

00:44:02

Speaker

And we had feature flags and incremental rollouts and automatic dark mode, which was pretty, I think, advanced for the time. And we started using all that work. And we would turn up the traffic, we'd find a problem, we'd turn the traffic down, we'd fix it, and we'd wash, rinse, repeat.

00:44:21

Speaker

And it really felt like we were making progress. And still really we were making progress like eight months later. and um We were still having to roll out the time-based shards for MySQL.

00:44:39

Speaker

um We're having to add, you know, the Cassandra performance was not what we liked, so we had to add more hardware. We um really felt...

00:44:52

Speaker

You know, every moment we thought like, we're you know, just around the corner is success. But every time we went around the corner, there was like twice as much work to do. um And sometimes work we didn't yet really understand how to do.

00:45:05

Speaker

What was, was there a pattern to this work? Because like I've got to imagine it was frustrating both to you and to the people sort of waiting for the newest solution.

00:45:16

Speaker

um Unfortunately, there was not a pattern to the work. I think we were pushing new technology to a extreme which it had not been pushed before.

00:45:30

Speaker

And I think retrospectively, what we came to understand was um it's fine to do that in the right context, but we were doing it in the wrong context. We were doing it in the mission critical, like the company will fail if we don't do it context, not in a high risk, high reward context. We're doing a high risk, like no reward context.

00:45:53

Speaker

Like, there's no There was actually no reward to improving tweet storage at the time. There was only downside. Like, we could we could only make it worse. And we didn't recognize that for a long time.

00:46:05

Speaker

That's interesting. So was there a was there a moment where you realized, like, what this hill seems maybe too tall to climb? Yeah, the moment I realized that was when my boss, our VP of engineering, canceled the project.

00:46:21

Speaker

And like we it's like he's like, we we can't do this anymore. And we're going take a more conservative approach. And in retrospect, he was like absolutely right. And we should have done that six months earlier. earlier like Investigating Cassandra made sense.

00:46:35

Speaker

Doing our evaluation made sense. if It was the wrong place to start. Starting with our most important database table was the wrong place to start with the new technology. And I was super burned out.

00:46:49

Speaker

Basically canceled the project. I told him i I need, I need a break. I'm gonna go on vacation. I literally left like right after that meeting, like I walked out office and then came back like two weeks later. Um, cause I, I mean, I hadn't taken a break for like two years and, um, i was just like, need break.

00:47:04

Speaker

um And that break was by the best thing ah did in this whole project because I actually gave myself time to think about what was going on and and like really think deeply about what had just happened and how we got into the pickle that we were in.

00:47:20

Speaker

Interesting. So what did you, what did you learn? So I, um, so this time thinking, and I came back, one of the great things about Twitter at the time, know, maybe like four or 500 people at the company at this point, you in this time of these stories I'm telling you, we went from about 30 people like 400, 500 people.

00:47:40

Speaker

to one of the One of the traditions we held on to for a long time was a thing called tea time, which was a Friday afternoon, later became a Thursday afternoon, all hands meeting, um which was originally envisioned as time for people to sit around drinking tea, but ah everyone drank like beer and wine instead.

00:47:57

Speaker

um And it was a time, especially when we were early early on, for sharing company news, um socializing afterwards, and you know, sort of the serious news of like finances and business and, but also and opportunity to share.

00:48:16

Speaker

And almost every week, a different rotating team would come and talk about a project they worked on or something they're working And I had given presentation, each of these steps of this process, I'd given a presentation at heat time about what was going on, right? The Twipocalypse 1 and 2 and the time-based sharding that we call temporal sharding.

00:48:36

Speaker

um And I volunteered to give ah another talk after this, actually not quite knowing what I was going to talk about, but then developed... maybe the day of or the day before, and explanation for what we're on with this project.

00:48:50

Speaker

um And the framework I developed, um which then became a framework we used for the rest of my time at Twitter, was um basically two-by-two matrix that evaluates projects.

00:49:06

Speaker

And the basic idea is that When you're doing a project, you care about kind of two states, the state of the world before you do the project state the world after you do a project.

00:49:18

Speaker

And if you simplify it down and just to like three states, bad, okay, or awesome, you can kind of classify projects into like where you're going.

00:49:29

Speaker

Like, was it bad before and bad after? Probably not a good idea. Was it awesome before and awesome after? Maybe okay. Like, don't know. There's incremental improvements.

00:49:40

Speaker

But the in-between stage structure is very interesting. Going from bad to okay, lot of interesting projects there, right? All the previous project I talked about, the two apocalypses, we were in bad state.

00:49:52

Speaker

We got to just an okay state. And not something to always be proud of, but like, we're we stable. We're fine.

00:49:59

Speaker

um States where you're in an okay position and you try to move to an awesome position, it's like, okay, we're stable. like Things are going well. Let's try to take this and make it like a ah big leverage point or a big growth point. like Let's go from okay to awesome.

00:50:17

Speaker

The danger is trying to go from bad to awesome. When you're in a bad state, things are failing, things are falling down. You have an imminent cliff of scalability or cost or performance.

00:50:30

Speaker

And rather than trying to take it from bad to okay, like in baseball, you're hitting a single, right? And just you can just hit singles all day and probably win. If you just swing for a home run, um the danger chance of failure goes up significantly and also the cost of failure, right? Because if you miss, if you're trying to go from bad to awesome and you miss, you're still in a bad state.

00:50:56

Speaker

you're going from okay to awesome and you miss, you're still in an okay state. And again, this is a very simplified model, right There's only three states in the real world. It's continuum and you can break project in the smaller projects. There's a lot of them.

00:51:08

Speaker

But This simple mental model explained to me why this big swing for home run with Cassandra went bad, is that we were still in a bad state. We had unstable infrastructure that cost too much.

00:51:22

Speaker

um We could not sustain it going forward. And we were trying to solve a much bigger and more more audacious problem than we actually needed to. And as a result, we took a big swing, which meant a big miss.

00:51:36

Speaker

And that a alternative strategy of taking a bunch of smaller swings would have been preferable.

00:51:44

Speaker

I love that. Yeah. And I think in particular, it's easy to take away a lesson from a canceled project like that, especially against something core as because it didn't work, we shouldn't have have done it.

00:51:55

Speaker

But yeah what you're saying is that all projects may not work. And consider front what happens when it doesn't work. And if you've got a case where you've got you can set it up so are little failures, that's okay.

00:52:09

Speaker

yeah And because it is probably not acceptable to fail that level of project in that particular area at that particular time. And that, that makes a ton of sense.

00:52:21

Speaker

Yeah. And like, I think the history after that bore this out and that we, for storing tweets, took a more incremental approach. We used the same, my SQL parsing partitioning that we'd used for the social graph.

00:52:34

Speaker

It required a bunch of work, but we kind knew the fundamentals. It was still my SQL. So there was a smaller migration. And we kept using Cassandra, but for very different use cases.

00:52:45

Speaker

The first thing we used it for was actually monitoring and our all of our telemetry too. sorry, it's pulling up. So, yeah, the first successful use case for Cassandra was um our monitoring stack that we used for for monitoring our services, storing traces, um <unk> etc.

00:53:03

Speaker

And it was a much better initial use case for for Cassandra because the cost of failure was different. um It was kind of net new, so we could experiment more. And it was much better use case on both sides.

00:53:18

Speaker

That makes sense. That's going from going from a situation where you're you're trying to spin up something entirely new, the cost of failure there is, well, you still don't have it, and presumably you're surviving without it.

00:53:29

Speaker

Yeah, absolutely. um and And taking the tried and true approach that you've proven with Flock makes sense for the the tweets, which is like fundamentally different problems and understanding the context drives the decision in a pretty meaningful way.

00:53:44

Speaker

Absolutely. That's awesome. I think our Our failure at the time was that we were forced to take the very incremental approach in those early projects.

00:53:58

Speaker

And when we no longer had the hard constraints of time and space, we failed to recognize that we actually still needed the time and space constraints in order to do good work.

00:54:09

Speaker

And that those sort of short timeline projects in the early days was actually a gift that allowed us to make ah very good Anchor Mail progress. And we needed to ah have the discipline to continue doing that, even when we didn't have to anymore.

00:54:24

Speaker

That makes sense. it's It's been a while since and these changes at Twitter, and you've done a lot of really impressive stuff since then. i would love if you could just walk through like an example or two of how you've taken those lessons and applied them in subsequent roles that you've had in totally different companies, totally different cultures.

00:54:46

Speaker

Yeah, I do um find that the bad OK Awesome framework has come up a number of times. um I think most recently working at Cruise, there have been a number of cases where we had services or code that um was really not doing well. It'd been around for a while, wasn't scaling the way we needed to, or not performing the way we needed to.

00:55:12

Speaker

And sometimes what you would get is a proposal to, you know, take this struggling service and rewrite it from scratch, right? And one of the things I've tried to help people think through is Again, what state are we in? Are we like a total failure state, bad?

00:55:31

Speaker

Do we need to triage and just kind of get our head above water? Or are we in a state where we could actually make a big improvement and get a big value out of it?

00:55:42

Speaker

In other words, are we in a bad state or are in an okay state? And where can we reasonably go from there? And you can go from bad to awesome. You just have to do it in multiple steps. right You have to get from unstable, low water to stable, head above water sustainably. um And then you can reevaluate, hey, like do we want to go and take this service that is now stable and you know do the order of magnitude improvements to make it more performant, more scalable, richer feature set, et cetera.

00:56:18

Speaker

So I found that theme come up a lot. and ah And it's often people doing what I was trying to do, which is trying to go from bad to awesome. and That makes sense.

00:56:29

Speaker

You know, I've heard this several times and we ah because I worked with i with Glenn, who worked with you, and and sort of heard the filtering of this this story down from a couple of of places. We've even talked about it a little bit on the show.

00:56:41

Speaker

um But it's the first time i thought... um It actually not only applies to projects, but also applies to teams a little bit. There's a whole section in Will Larson's book on improving teams. And he points out that you you really can't take a struggling team and turn them into a high-performing team.

00:57:02

Speaker

You need to get them out of the woods first, and you need to get them to the point where they're feeling OK. And only then can you really drive them towards innovation. Because if you go from a team that's just underwater,

00:57:14

Speaker

yeah You can't triple their headcount and expect them to suddenly produce novel software and great products that you need to do it incrementally. And that's exactly what you're describing. Yeah, and often you need, as a team transitions, you know, bad, okay, awesome, you need not only to change the mental frameworks that people use, right?

00:57:37

Speaker

You know, from, you know, if you're bad state, know, you got really triage and heavily, you know, prioritize and be more than willing to write janky code just to, like, move move the ball forward a little bit.

00:57:49

Speaker

Um, and when you're not in that state, you have to think very differently. You have to act very differently, different prioritization, right. To think about maintainability and about, um, the longterm that you don't think about in the earlier stages.

00:58:05

Speaker

And for teams, like a good, you know, I think one of implications that for teams is sometimes you need totally different people at the other stages. Cause there are people who are very good at the, um,

00:58:16

Speaker

you know crisis management, you know, very hack, you know, let's hack our way through this in the best way we can using code engineering creativity. And sometimes those are not the same people that can take you to the next level.

00:58:29

Speaker

And i I think I found my experience at Twitter was, um I was in the former group. I could i could do the the crisis management and develop the very clever or maybe some overly clever solutions that are slightly off-putting to think about.

00:58:47

Speaker

um But I was not always the right person to to dream about the five-year vision of you know how we how we do um something more complicated and more general purpose. And and I came to accept that. like That's just not my forte.

00:59:03

Speaker

Yeah, um absolutely. It makes a lot of sense. um One of the things that's come up a couple of times as we've talked is this idea of... Things that were common, but things that Twitter had to solve and now become common in the industry. Things like using strings for long number IDs, especially JavaScript doesn't handle that.

00:59:23

Speaker

Or distributed counting with a snowflake style approach. um There was a lot that came out of Twitter. And I think that's why it was so... It's so prominent in a lot of engineering culture because the these ideas turned into really industry standards. that You wouldn't have to solve these problems in the way that you did today.

00:59:43

Speaker

um to say nothing of having AWS. Is there anything that you saw at that time that you wish had become more of a standard within the industry?

00:59:56

Speaker

Hmm.

01:00:01

Speaker

That's a good question. i mean, I do think the interesting, my first reaction when you're saying that these things have become standards is that i think

01:00:14

Speaker

i think that's both true and false. I ah sometimes notice you know things like snowflake being, you know there's so many clones of it now and people use it, but people don't always use it when they need to.

01:00:26

Speaker

And people still write APIs where the ID number is ah an integer. And they'll probably fine with it for a while, like we were. And so i i actually worry more that people don't take these and make them standards as much as they should.

01:00:44

Speaker

And that there are many cases where people have to relearn the same lessons that we learned the hard way. Absolutely. like There's even, even you see this within then individual companies, it's one right way to do it.

01:00:58

Speaker

And not everyone does it. Not all the new code is the right way. Yeah, absolutely. Um, one last question before, before we wrap up for today.

01:01:10

Speaker

um It's 2025. AI is is changing the way we work. um There was a lot of a lot of the things you've seen both at Twitter and over the course of your career have have been these sort of changes which required a lot of work. how do you How do you see that changing? And what do you see going away? And what do you see staying durable as the role of you know engineers doing engineering?

01:01:40

Speaker

That's question. i

01:01:45

Speaker

the question i have about AI and engineering is still, what happens when things go wrong? And there's certainly many cases where AI can actually help you debug, triage, et cetera, problems.

01:01:58

Speaker

But all of the... stories we just told are about what happens when things fail or reach their limits or reach some saturation point or some overload.

01:02:09

Speaker

Right. And, um, I think that is an area where either we need to continue to improve, improve the AI tools to not think just about like,

01:02:20

Speaker

you know what is the best practice or the common practice, but also what are the limits of the various practices that we we would want to apply? And it seems possible. um i i just That is something I'm kind of consistently disappointed with, with the current kind of crop of tools, about how far how how far ah ahead can they think or consider um for various bits of work.

01:02:45

Speaker

Absolutely. that's That's really interesting. that there's both There's both a flavor of that, which is I cannot imagine current AI suggesting and then rejecting the idea to do Cassandra because it's too bold and too core and has the wrong concept.

01:03:01

Speaker

And AI may suggest that and wouldn't ever play it out. But also can imagine chat GPT coming up with a suggestion to do temporal sharding with tweets as a pragmatic solution. It just yeah doesn't have the context in so many ways.

01:03:16

Speaker

Yeah, and i think context is one of bits of info or but ideas here. is I think AI tools for coding are extremely valuable because they have an enormous context.

01:03:31

Speaker

They've been trained on enormous corpus of code and can regurgitate and resynthesize patterns and approaches that like there's no way a human could do because you didn't read 100 million lines of code.

01:03:46

Speaker

at the same time, they don't understand your context, right? don't understand your business trade-offs, your product trade-offs, your financial trade-offs, the people on your team and that. And um for the foreseeable future, there's still you still need some humans to do that.

01:04:00

Speaker

um Maybe those things will diminish over time. maybe Maybe the constraints will diminish over time and the the tools will get better. But it seems like ah the one of the interesting gaps where human engineers can really um be very viable still.

01:04:14

Speaker

That makes a ton of sense. All right. Well, thank you so much for coming on. One final question. It's a two-part final question. um Where can people find you on the internet?

01:04:25

Speaker

um And is there anything our viewers can do to be helpful to you? ah Sure. um My website um is therianking.com. I've read a blog there very intermittently. um i also post on LinkedIn, linkedin.com slash I and Ryan King, um where I post my random work thoughts um mostly.

01:04:50

Speaker

And it would be useful. i um i want more people to think about ah these hard, messy engineering problems, particularly like you all are for turn and how how to work through them.

01:05:03

Speaker

And, um you know, write write your blog posts or your podcasts and and share your experiences because this is ah how we learn about how we learn about our craft. I love it.

01:05:15

Speaker

All right. Thank you so much. It's been an absolute pleasure.