Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Wrong Tool, Right Choice: PagerDuty's Cassandra Queue | Ep. 12

39 Plays1 year ago

Back in 2011, PagerDuty had a problem that would make any infrastructure engineer wince: they were using MySQL as a queue.

In this episode, we dive into the fascinating journey of using Cassandra as a queue, the challenges faced, and the lessons learned along the way.

Arup Chakrabarti, former Senior Director of Engineering at PagerDuty, shares his insights on the early days of the company, the technical decisions made, and how they navigated the complexities of using a NoSQL database in unconventional ways.

Check out Arup's work! ➡️ arupchak.com

Get Tern Stories in your inbox: ➡️ https://tern.sh/youtube

Recommended

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis image

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis

00:52:33·8 months ago

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild image

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild

00:49:01·9 months ago

Surviving High Stakes Code Migration Without Breaking Everything image

Surviving High Stakes Code Migration Without Breaking Everything

00:50:00·9 months ago

Code Migration Secrets: How to Finish in Half the Time with AI image

Code Migration Secrets: How to Finish in Half the Time with AI

00:30:11·10 months ago

The Twitter Code Migration Disaster That Nearly BROKE IT image

The Twitter Code Migration Disaster That Nearly BROKE IT

01:05:19·10 months ago

Slack’s Code Migration Uncovered a Terrifying Truth image

Slack’s Code Migration Uncovered a Terrifying Truth

01:08:23·10 months ago

You have to decide image

You have to decide

00:21:16·11 months ago

You have to decide image

You have to decide

00:21:16·11 months ago

How They Cut Code Migration Time Without Sacrificing Quality image

How They Cut Code Migration Time Without Sacrificing Quality

00:52:32·11 months ago

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14 image

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14

00:51:46·11 months ago

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13 image

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13

00:54:30·1 year ago

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11 image

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11

00:52:36·1 year ago

“We Should Be Able to Drain an AZ” | Ep. 10 image

“We Should Be Able to Drain an AZ” | Ep. 10

00:55:50·1 year ago

Slack's 6am Database Club | Ep. 9 image

Slack's 6am Database Club | Ep. 9

01:09:26·1 year ago

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8 image

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8

00:58:48·1 year ago

Migrating Memcache in a time of DEMAND | Ep. 07 image

Migrating Memcache in a time of DEMAND | Ep. 07

01:28:00·1 year ago

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6 image

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6

00:47:13·1 year ago

What Litigation Teaches Us About Security Operations | Ep. 5 image

What Litigation Teaches Us About Security Operations | Ep. 5

00:52:45·1 year ago

Outscaling ElasticSearch at Datadog | Ep. 4 image

Outscaling ElasticSearch at Datadog | Ep. 4

00:52:45·1 year ago

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03 image

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03

00:47:16·1 year ago

Transcript

Introduction and Speaker Background

00:00:00

Speaker

Everything you look at for not just Cassandra, for any NoSQL based database, it says don't do this. Yeah, there's better technologies out there in 2025. It was now was not 2025. Especially for like startups, like you don't have time to figure out, right? You have time to figure out good enough most of the time. the The maintainers are always very clear. Do not build a queue on top of Cassandra. Do not run Cassandra across multiple regions.

00:00:26

Speaker

Do not run Cassandra in high network latency and variability environments. If we gave you 200, you knew that data was replicated across at least two regions. We were actually always going for three.

Career Journey of Arup Chakrabarty

00:00:40

Speaker

I am so excited to do this episode. On tonight's episode of Turn Stories, we have Arup Chakrabarty. ah He is former senior director of engineering at PagerDuty. He joined them at as the 15th employee and stayed with them all the way through IPO. Before that, he was at Netflix, he was at Amazon.

00:00:58

Speaker

um And we have known each other for, at this point, 30 years. so

00:01:05

Speaker

i am I am excited to hear a story that I have not heard before, which is weird, but ah welcome. Well, thanks. Thanks for having me, T.R. Yeah, just to kind of give a little bit more context on my technical background. So I worked at Amazon, Netflix, PagerDuty, always worked on what I call the behind-the-scenes team. So these were always the backend APIs, infrastructure security, all the stuff that when things run smoothly, no one notices. And when it breaks, everyone notices.

00:01:33

Speaker

So that was my hack.

00:01:37

Speaker

PagerDuty was a good place for you to end up. PagerDuty literally was like the perfect encapsulation of my career and in one product in one company. All right.

00:01:49

Speaker

So what

Cassandra as a Queue and MySQL Constraints

00:01:50

Speaker

we're going to talk about today is PagerDuty early on used Cassandra and they used it a little bit of a... interesting way. So tell me, set the scene a little bit for me. When was this?

00:02:00

Speaker

And yeah this is, ill I'll talk a little bit about what kind of even inspired this ah kind of outrageous use case for Cassandra. So they the the use case was using Cassandra as a cue, which everything you look at for not just Cassandra, for any NoSQL based database, you will say, don't, it says, don't do this. Like, do not do this.

00:02:23

Speaker

And the reason that we went down this route was basically, um you know, like most companies started in the late two thousand s early 2010s, PagerD was a rail stack built on top of MySQL and a little bit of, and we had a little bit of memcache in there.

00:02:40

Speaker

But we were building a lot the queuing stuff on top of MySQL, which again, don't build queues on top of MySQL. That's another thing that you can see in Flash and Bradet. And so we had decided to build a queue on top of Cassandra because of some of the tunability consistency guarantees that we needed. it So we were running workloads across disparate AWS regions, and we needed the ability to to write to a queue, but also be able to guarantee that, hey, we've written this data um across multiple regions before we'd start back at 200 to the customer.

00:03:11

Speaker

So that was the original inspiration of why we went down this choice. Yeah. choice It was a deliberate choice. So wait, you were using, i'm i'm not even going to let you get to the story. You were using my sequel as the cue before this.

00:03:26

Speaker

Yeah, basically using like delayed jobs, which was the the the library at the time, which worked actually fine for you know a couple of transactions per second.

00:03:38

Speaker

For the second, we even got to like tens of transactions per second. you would just start to see um write constraints on the disk. And so and you again, MySQL at the time was not designed for this stuff.

00:03:50

Speaker

And so we were using it in a way that while we were getting the consistency guarantees, if you think about MySQL synchronous replication, and you think about this idea of like constantly updating state on a single row, all of a sudden you're just you're going to grind the disk to a halt, which is why you don't want to be doing this. and In general, you don't want to be doing synchronous replication, but in particularly this case was very expensive.

00:04:17

Speaker

Yeah. Got it. So you had you had your MySQLs, you're doing the Q in there, and you had not only had you decided that you needed like multi-region, had you had you tortured MySQL into multi-region?

00:04:31

Speaker

Or was that part

PagerDuty's Scale and Reliability Challenges

00:04:33

Speaker

of the transition? we remember I don't think we... No, we were using a GRBD based replication mechanism. So ah for folks not familiar with GRBD, basically, it's ah it's a block based replication mechanism where you're using um at the disk level, you're doing your replication as opposed to the MySQL level.

00:04:53

Speaker

So another way to think it is like kind of like a fancy RAID device. And ah again, the nice thing with it, it actually can simplify lot the MySQL operations. But again, if you have crazy high write throughput, all of a sudden you can start pegging the disk and run into those constraints.

00:05:11

Speaker

Got it. Okay. So didn't seem like it was a good idea to go with MySQL. You're already running into problems with it. You made this choice to go to Cassandra. Tell me a little bit about like what was was like the state of pager duty at that time? Like like user size and...

00:05:26

Speaker

So company wise, we were like around, you know, that 10 to 20 mark um number of customers were probably in the couple of hundreds. Like, you know, it was it was a company that, you know, Paget has such an interesting um usage pattern from a it's not very seasonal.

00:05:42

Speaker

Like you think about for like a lot of consumer products and very unpredictable, very, very spiky, of course. And so, you know, you you could actually, we were able to get away with having a pretty small server footprint for a very long time because, again, we just, we didn't, our our steady state was pretty low. And yeah, we'd have these spikes where we'd be able to scale up pretty reasonably well.

00:06:03

Speaker

and And for that reason, we would over-provision it purposely just to make sure yeah know customers never felt that. So yeah, a couple hundred customers. and um But even at that low volume, we were just running into, like again, these constraints at the database level. And the bigger problem was you know We positioned ourselves, and rightfully so, as a company you could trust with your data. yeah We had this thing where if we gave you a 200, that you had confidence that that, sorry, not just confidence, that we were guaranteed we were making guarantees to customers that that data hasn't replicated across two to AWS regions.

00:06:40

Speaker

And so um that was something that I can compare to a lot of other companies. Reliability and consistency were just so important early on in the company's DNA. And, you know, I credit our founders, but also I credit our customers for kind of placing that that trust and responsibility with us, which, you know, we we saw as a privilege and obligation to the top.

00:07:02

Speaker

Yeah, absolutely. I mean, that was both uncommon for SaaS products to promise at that time, and it was still reasonably common at that time for Amazon to lose a region. Oh, absolutely. and

Learning from AWS Failures and Cassandra's Role

00:07:12

Speaker

And the thing was like it was, it was something that You know, when I think about like failure injection, chaos monkey, and, you know, PhD, we pioneered this thing called failure Friday because we didn't have the automation sophistication at the time to deploy some like chaos monkey.

00:07:29

Speaker

um It was all based on that premise. Like, yeah, like regions do fail, strippers fail. And, you know, if we fast forward to now, like, yeah, these things fail, but like, it's not, it's, it's not nearly at the grand level though.

00:07:42

Speaker

They really couldn't GCP just have a, yeah we're We're recording this the day after everyone didn't work for three hours. So we got to be careful what we say. ah So many people, all my friends who work over at GCP right now but ah but it but it but it's true. like you know There were these moments, in especially in the mid-2010s, where half the internet was just gone for you know a few hours. And it was a joke of like, all right, folks, take the rest the day off while the folks at amazon i right AWS are fixing their stuff, and then we'll come back and reboot everything like when we can.

00:08:23

Speaker

Yeah, ah different different constraints a little bit, but the more things change, the more they stay at the same.

00:08:31

Speaker

Okay, so Cassandra. Yeah, so Cassandra. So you know for folks not familiar, Cassandra, it's a NoSQL-based database that, um you know based on some of the original ideas from the Dynamo paper that Amazon published, um gosh, a long time ago,

00:08:48

Speaker

And it was created by a you know bunch of different companies and engineers from the open source project and everything. And the big thing was oh with Cassandra at the time was horizontal scaling. Like that was really where it shined. It beautifully partitioned data.

00:09:04

Speaker

And so you could separate your reads and writes and it, in it um And the big thing that, you know, if you're going from going from using a a relational database like MySQL or Postgres to Cassandra, um the big thing that I really had to wrap my head around was this idea that your joins were in your code now.

00:09:22

Speaker

And that you weren't you weren't offloading a lot of the the data of retrieval and writing. That complexity shifted into the code and away from the the database itself. And pros and cons, you know, there's there's ah some areas that you create some complexity, but you simplify some of the operational stuff.

00:09:40

Speaker

And, um you know, like I was saying earlier, ah Cassandra, the the maintainers are always very clear. Do not build a queue on top of Cassandra. Do not run Cassandra across multiple regions.

00:09:53

Speaker

Do not run Cassandra in highly network network high network latency and variability. environments.

Technical Community and Cassandra's Configurability

00:10:00

Speaker

But like I said, we were trying to really create trust with our customers. Again, if we gave you 200,

00:10:10

Speaker

you knew that data was replicated across ah at least two regions. We were actually always going for three. um But that was the beauty of Cassandra. yeah There was a lot of tunability in the config files. Even early on, we were using it, I think, from 0.8, I want to say. It was definitely pre-Bondot. PTSD-inducing version number to put in production.

00:10:35

Speaker

Yeah. yeah tropo Well, that would the funny thing because there, because Cassandra, especially the early, um the early, like a minor rep was not a minor rep. Like so much change across each one.

00:10:48

Speaker

And so, and we'll get to the upgrades in a little bit, but like it was because the upgrade, I don't remember the exact version numbers, but like yeah this was technically a minor rep and it was not.

00:10:59

Speaker

It was not a minor revision at all. But um but the really cool thing about Cassandra back in the day was um the configurability. And so we could put in things into the config files of like, hey, here's when to give back a proper acknowledgement. Here's when you can say the right failed. Here's where you could say the refill.

00:11:19

Speaker

And um that was just, it was so unique, which is that was also one of the big reasons we ended up using Cassandra for a lot of other options. And, you know, just to also set the scene a little bit, like things like Kafka and whatnot just were in its infancy.

00:11:32

Speaker

They didn't have consistency guarantees. they um The horizontal ah scaling and sharding story still wasn't quite there yet. And um the kind of thing was we just didn't know anyone on the the the Kafka team at the time we knew a couple of the Cassandra maintainers.

00:11:51

Speaker

And, and this is where like, you know, when you think about technology choices, it's always interesting. It's like, there's the quote unquote right decision, but that's such a loaded term, right? Because for us, the right of decision came from like, Hey, like, like I had, I had just, um,

00:12:07

Speaker

I'd just been working on Netflix, which had done some amazing work worked with Cassandra. So like I was able to learn from some brilliant folks, knew some maintainers that were working on Netflix. And so through that, you know, and then i joined PagerDuty and they had made the decision to use Cassandra before I'd even joined.

00:12:24

Speaker

But what was hilarious was, um and this became one the things I love about the open source community was how, you when PagerDance started to use Cassandra, we were able to reach out to the maintainers and they were wonderful. They were so, they were amazing at answering questions.

00:12:40

Speaker

And there was a lot of the like, wait, what are you doing? Why are you doing that? And then we'll explain the constraints and what we were trying to do. um They're like, okay, yeah, let's, let's try to figure this out versus, you know, there are still open source communities that, that work that way, but there's still,

00:12:56

Speaker

oh I don't know, maybe I'm getting a little old and jaded, but like there's a lot of like you're holding this wrong. Yeah, absolutely.

Balancing Technical Solutions with Business Needs

00:13:04

Speaker

like Which frustrates me with certain communities and you know, this is where it is that that kind of that weird like intersection of business and technology constraints and as an engineer when you're working on these things right is at that intersection and you you got to figure it out sometimes and sometimes the quote unquote right technical the solution doesn't actually solve the business solution or the business problem and and you kind of figure that out that makes it i mean that makes a ton of sense that you're just like

00:13:37

Speaker

there was, yeah, there's better technology. Us sitting here in 2025 are like, yeah, there's better technologies out there in 2025. It was now was not 2025. And this is true of a lot of things today. Like people building today, I'm sure people building an AI are like, yeah, I get that feeling.

00:13:54

Speaker

Yeah. and And the thing is, like, I, especially for like startups, like you don't have time to figure out, right. You have only time to figure out good enough most of the time. And that means like,

00:14:06

Speaker

Hey, you have an engineer that you know has some experience with the quote-unquote wrong tool, but it's going the quickest to deploy and solve it for maybe one order of magnitude scale. like Great. that that That is the the right solution there.

00:14:19

Speaker

And that just gets you to that next... You know, scaley ball that can be the next funding round. So go like that's that's we should optimize for. And, you know, I think I think as engineers, sometimes we get so caught up in the technical right that we forget what's right for the business. And even when you have these warning signs on the manual that says do not use Cassandra as a cue.

00:14:44

Speaker

It was actually, again, I will stand by this, like it was the right decision at the time with the limited resources, the limited knowledge and and time that we have. i i but I love it. Okay. I know this is this is maybe reaching.

00:15:00

Speaker

I would love if you remember if you could remember any of the specific, like, what did you go to the maintainers with? They were like, what? Oh, okay. Okay. So I think the big one was the cross-region stuff because Android at the time was not designed to run across um highly high latency environments. So when we me told them, we were like, I think i forget what and the exact config it was, but like basically like the standard, like I think it was a timeout value. And I think the standard timeout was like in the tens of milliseconds.

00:15:37

Speaker

And we were like, yeah. And the we're like, want to do a couple of seconds. but Like, wait, you want us you you want to figure out how to use this thing? Three orders of magnitude time?

00:15:50

Speaker

What? And we, again, we explained it to them and they were they were very... um ah I think patient is the nicest way to say it.

00:16:00

Speaker

ah They were inquisitive. They, they were not, they were nonjudgmental, which was pretty cool. That's great. Yeah. It was a very, you know, curious. They were very curious what, ah what the use case was. And that was,

00:16:13

Speaker

like rare word, it was fun. It was fun to solve that problem with them and to like figure out like, hey, like where, where's the config file? Because remember like the maintainers know obviously the standard config file in and out, but there's always hidden config variables, right? Right. Yeah. You dive into the source code to figure that out.

00:16:31

Speaker

But when you can work with the maintainers directly, because like, you know, can get, you can literally just search in the source code and find the values. But You know, you don't want to set it arbitrarily. and And so that was really fun to like work with me to understand figuring out some of those kind of more obscure config ah config variables that that we got to play with.

00:16:51

Speaker

That's fun. but So was there anything wrong with setting that value to like four seconds? It wasn't designed for it, right? So like the thing was, um and the biggest thing was writes grinded to a halt initially when we just set it that way. There was a bunch of other ah there are variables we we didn't set. i I think it was like, it wasn't a global timeout value. I think it was just a write timeout value.

00:17:15

Speaker

And

Cassandra Implementation Challenges and Solutions

00:17:16

Speaker

so we set it there, but all yeah the cluster, I remember, just grinded to a whole thing, like doing this all on test. And it just stopped working, basically. You couldn't write to it anymore. And I think from what I remember, it was something to do with like the way that the replication worked. It was kind of like, you know, you could think of the Cassandra replication almost like a ring.

00:17:36

Speaker

And it would basically like... go in a certain direction at certain speed. But if the servers on like the other side of the ring didn't acknowledge the write quickly enough, like everything every other server was like, okay, we never this data never existed. sure. Yeah, that makes sense.

00:17:52

Speaker

And so that's where, again, we had to play with a bunch of... It wasn't just this one write timeout value that we had to play with. There was a bunch of other ones. And as far as I remember, there was some like read timeout values we had to play with as well.

00:18:05

Speaker

um which ah again, just, it was the the nice thing of working some maintainers. Also, like I was lucky that we had a couple of early operations engineers at Gage of Duty that like, we're not afraid of diving into source code.

00:18:20

Speaker

was so lucky and, and yeah, just really interesting. In fact, um there's actually one of the more famous things that um one engineer, brilliant engineer worked with Evan Gilman. He wrote this article on a zookeeper upgrade, which, you know, that's, it's on the, ah if you search pager duty zookeeper upgrade, there's some really fun stuff. that that he wrote about it.

00:18:43

Speaker

Cool. That's different, different, it's different episode. We'll do that in a month or two. We'll do the first like multi-person ah episode. um Cool.

00:18:54

Speaker

that ah Yeah, it makes a ton of sense that there's like, you have to have like bags of variable, bags of tuner, tunability, yeah. Bags of config that work together. like Yeah. Well, and the the scary thing with oh Cassandra too is like we were using ah we're using Chef to automate a lot of the config generation and like Here's the the IP addresses of the servers you're supposed to talk like that kind of stuff. So it was like a service discovery point view.

00:19:22

Speaker

But what was always so scary about that was there were certain variables that had to be consistent across the entire cluster. And there were other variables that it didn't matter.

00:19:35

Speaker

So in the context of an upgrade, you really had to be careful about not just updating the config file, but like specifically which config variables you're updating it in in which steps.

00:19:48

Speaker

And ah if you just ran ran Chef on the server to generate the config file, it it just drops all. It just updates the whole thing at once. And so, and and the thing was we were trying so hard to avoid these artisanally configured variable files, but we ended up having to do it.

00:20:08

Speaker

Like we had to SSH to the machine, edit it by hand, and we hated it. We were so unvoined that we had to do it, but we couldn't figure it. We figured out what a better way would have done. We didn't have time.

00:20:20

Speaker

Right. we And you dealt with the SSHing. Right. We dealt with the SSHing. And um i remember one of our engineers, he had introduced, what was it, Ansible into the environment, which made that a little bit better. But um it it it was, what and then the other crazy thing with with that kind of an upgrade was you'd up there were certain config options that were actually stored in the database as well.

00:20:47

Speaker

So there's the config file that you deploy. start Cassandra up. And then there are certain configs in the database itself that you have to, you know, it's basically like a write operation. And so like separating those out too in the whole grand plan of like, how do you upgrade this in production without taking the whole thing down?

00:21:05

Speaker

Absolute nightmare. Cause it's just so many things could break and it just felt so brittle. And again, we were doing things where like, like the Cassandra maintainers were great. They're like, well, just, you know, it treated like cattle. She didn't that and bring up a new one.

00:21:21

Speaker

But the problem was even like, even though we were operating with how much data we're talking, like maybe a hundred gigs, like very small amounts by, but when you're shipping a hundred gigs across the WAN, right.

00:21:34

Speaker

it it It takes a while, right? And that was the problem was because we weren't working over 10 gigabit links. We were working over the open internet effectively. And on top, but we were encrypting everything ourselves, the IPsec. So it just, our data transfer was just slow.

00:21:52

Speaker

And so if we actually couldn't treat, at least back then, we couldn't treat everything like cattle because again, there was a little bit of of artisanal work and that was being done on a preserve.

00:22:05

Speaker

Yeah.

Managing System Upgrades and Complexity

00:22:06

Speaker

It's and no, it's, it's funny. There's this, like, it's almost easier to think about like state as like, this thing cannot be deleted. it is a pet. It has data on it that if a customer loses that data, we're like, we're all going to go out of business.

00:22:18

Speaker

And then there's, there's like, And the stateless stuff. But it's all the hard and interesting problems end up being like in the middle there where it's like, it'd be really annoying if we lost this node. Yeah.

00:22:30

Speaker

And and it's it's like, again, maybe it's the older I get the answers to most questions is it depends, right? And it's so frustrating because, you know, when when I was writing web apps very early in my career, one of the big things was never store state on an app server.

00:22:48

Speaker

Makes sense. You lose app servers left and right. you know But let's say you have some highly ephemeral data that you want to, you know, cat like a local instance of memcache and write it in a single process writing against it.

00:23:01

Speaker

It is hella fast. And there's a lot of scenarios where it makes sense to do that. And if you, again, if you lose it That's a bummer. But again, it may be fine because it's a highly reproducible data. But like to your point, though, like where a lot of the interesting problems lie are in that in-betweens zone where it's like, no no, no, this has to end up on a data. No, no, this is an application server. like This strict separation of trick and state, it gets blurry really quickly. And then you think about like when you start scaling up.

00:23:32

Speaker

and you're having to like maintain certain levels of performance and everything, this is where those compromises start to kind of creep in. And so then like bringing it back to a Cassandra upgrade, this is where the quote-unquote manual doesn't work because you introduced so much complexity and I would argue necessary complexity because your business is getting more complex. youre Your customer's use and desires and demands are more complex than what it was originally designed for.

00:24:02

Speaker

Yeah. Yeah. That makes a ton of sense. um That at at some point you're going to run into some scale problem where, you know, you, you need, you need something that doesn't work with the architecture you designed two years ago.

00:24:14

Speaker

Yeah. And that's okay. And, and, and this is where the, like, you know, again, bringing this back to upgrades a little bit, like you still like have to be willing to do that work though.

00:24:25

Speaker

And I think that's where like, I definitely got, that got caught early in my career where, We'd introduce all this complexity because reasons for good reasons, and you know some CVE gets published or there's some new like performance or something And maybe maybe you you know forked a library or something, you know something equivalent of that in your database or whatever, you can't upgrade.

00:24:50

Speaker

And that's like some of the scariest moments where you start going through it. it's not even about time or money. It's like you just can't because you made some one-way decision at some point in time.

00:25:05

Speaker

And that was something we we actually did a pretty good job with our databases at PagerDate where we always... um There's some other parts of the code base where we didn't make the best decisions, but um but at least with the database, we were like, no, stick with like the main like main line, like whatever. What are they using upstream? Okay, you a bring that in and then make sure.

00:25:27

Speaker

like There were times where like I remember thinking this, and thankfully my engineers were much smarter than me and they didn't let me do this, where I want to basically like set off our own config files. And like have like our own and like just like fork the fork the and database and we put in our own little logic around that. It wasn't either with like replication lag or something like, no, no, no.

00:25:49

Speaker

That's going to make things so much harder later on. They were at 100% right and big for them. They're smarter than me.

00:25:58

Speaker

I'm glad they're there. mean I joke all the time that like, you know, every now and then I speak about the stuff at like conferences and stuff. And I would like the first or second slide was always a giant disclaimer of like, I'm up here talking about this stuff.

00:26:13

Speaker

All the folks that I work with are so much smarter. And thank God that they, you know, they let me hang around and prevent me from not only shooting myself in the foot, but like also managing shoot them in the foot too. Right.

00:26:28

Speaker

um Okay, i want to I want to talk about the actual actual like transition here. because like So you started on MySQL, you've done a bunch of experiments with Cassandra. like How did you actually leg over to it?

00:26:39

Speaker

Hmm. So we, as far as I remember, we, we actually did take a small little outage. It was one of those, it was one of the things where we, we, we had done the analysis of like, okay, if we do a bleed over, I'm going to, and we realized like, you know what?

00:26:57

Speaker

It's actually, you this was one of the rare scenarios where it was safer to actually just take, I think it ended up being what we're talking like tens of seconds, like less than a minute. it yeah That was a safer ratcheting. Obviously, we have practiced it bunch of times. And it's terrifying, ah mean of course. like these things um But that actual cutover was was a the multi, you know i probably 20, 30 seconds where where yeah we were still caching everything, but things were not being written to the database. And

Customer Communication and Trust

00:27:29

Speaker

then we flipped it on, it all flushed through, and then everything thankfully worked. So that initial cutover from MySQL to DeSantis was...

00:27:36

Speaker

was still like that. That's interesting. So you did you publish the customers that you were taking downtime or was this just like an internal? No, it was only an internal thing. And we had very high confidence. And I think are like rightfully so, we have practiced it so many times that we were were feeling good about about doing it and it worked.

00:27:55

Speaker

That's cool. i I feel like for those kind of scenarios, the fact that it went well is the testament that you put in the reps. but like that the You don't get lucky the first time. You don't get lucky with these kinds of things. and i credit especially the founders. like they They cared about the plan.

00:28:14

Speaker

a lot and and there was a lot of scrutiny applied to everything from from everyone and um and yeah practice like that was the thing like we had you know scripture we had a we had two we had a staging environment a low test environment and we just practiced on both constantly to make sure that it was going to work and and that's the thing data databases are hard like this is where like upgrading databases and cutting across like it's It's very different from upgrading a code base. And again, obviously upgrading code bases has its own Google exes and challenges. I'm not trying to minimize them.

00:28:50

Speaker

But the thing that always scared the hell out of me with upgrading databases was like, when you what's when you start dropping data, that's it. That's it. Yeah. The data's gone. The data's gone. And like, and look, there were moments throughout PageRace history where that happened occasionally, thankfully not too often, but occasionally.

00:29:09

Speaker

And those are some of the hardest customer conversations I had. and Yeah. ah kind with There, you know, it was a huge trust violation where, i you know, I was like, yeah, we gave you 200 and that wasn't actually true.

00:29:25

Speaker

And I wish, I wish we had, sent a 500 and there was just some of these weird edge cases where we did it. And um i tell I tell technology leaders, especially in the reliability space all the time, like as as as hard as it is, always be honest with your customers when you screwed up and they'll be pissed as they should be.

00:29:48

Speaker

But you don't want to be in the situation where you tell your customers that you effectively lied. And that's that's a much, much worse position to be in. And that's where, you know, the difference between a 200 and a 500 is that.

00:30:01

Speaker

view if you If you've served up for 200 and it was a 500, you lied. Yeah, that's not bad. that's yeah that's no bed it is It's an interesting it's an interesting kind of Distinction to make, especially, question again, today, 2025, and like there's this weird middle world of like computers are also agents and they talk to us in English. and But like at the end of the day, people aren't doing business with a machine or an AI or a person. like They're doing business with an entity. And that entity, in every way, needs to be honest with them.

00:30:37

Speaker

And like that's both the people they call and yell at or the people that are like the machines that give them two hundreds or five hundred All the same, and this is where, again, obviously infrastructure reliability, this thing has up near and dear to my heart, but it comes down to trust, ultimately, between the two entities that are that are working with each other, doing business with each other, calling money, APIs from each other.

00:30:58

Speaker

It's trust. And turns out trust is pretty valuable and, you know, that it's, you know, it's, hard to earn and easy to lose. And yeah, like I, I was in a lot of customer calls where, you know, hat in hand apologizing and, you know, I could say like, well, look in the last year we had, you know, four to five nines of it didn't matter.

00:31:22

Speaker

That doesn't matter. It doesn't matter. The second you

Team Coordination During Transitions

00:31:25

Speaker

you send it to 100 when it was actually a 500, you've lost it, which, you know, later down the line actually led to some pretty interesting design choices on our APIs where we were actually pretty harsh on ourselves with our 500s, but purposely so.

00:31:39

Speaker

and Because we wanted customers to like, hey, like, We're not 100% this 200, we're going serve 500. rather duplicate something downstream than say we got it. and we'd rather you know duplicate something downstream then then say we we got it Interesting. interesting Yeah, that makes sense.

00:31:55

Speaker

I want to come back to that. I wonder if you sort of look at the moment of, okay, so you did this, you set up Cassandra, you didn't did a bunch of like i interrogation of the actual Cassandra ring, set it up, got it working, gave it the characteristics you want, then did this cut over.

00:32:11

Speaker

that's i imagine you know y'all are still pretty small at this point, and there's only like probably a couple engineers working on it. Oh, yeah. I think Mostly, yeah. Probably like seven, eight engineers working on it. Okay. Yeah.

00:32:23

Speaker

So how did you know, that ah going back to this idea of like PagerDuty is a monolith to most, um to their customers. Like it just, that's the whole thing.

00:32:34

Speaker

How did you, did you have to do any work to incentivize like the rest of the company or to inform or or did they have to do any work to deal with that? Or did you manage to keep everything kind of like below the waterline for them as well?

00:32:46

Speaker

Oh, no, we had to, we, no, no, everyone was impacted. So while there were, I've made engineers primarily focus on it. um Every engineering team, because, it pays you like most companies, know, it was this Ruby on Rails model that we were starting to tease apart. And while the first things we were teasing apart was the,

00:33:06

Speaker

um was what we called our notification pipeline. And that was what the Cassandra Q was being used for. And, um but, you know, it was so core to the product, especially at the time.

00:33:20

Speaker

And so every engineer had to do something. And so while there were seven, eight engineers working on it, you pretty much full time for a couple of weeks, but maybe a few months,

00:33:31

Speaker

um during that time frame there were like these small little things like oh crap like you know this other part of the data model is interacting with the notification pipeline the like but you know we have to work with of course the one engineer that knows it because that's what like at startup only one engineer not me um we have to work again to um to fix that Got it. um what do Did you have any, like, what was your approach to that? Was it, was it like people-based? Was it tooling-based?

00:34:05

Speaker

How did you, how did you keep track of that? Because like the company is huge at this point, but it's meaningful. It was people-based. And, you know, we had the benefit that there was so few of us that we could kind of do that. But, you know, we we had some, gosh, what were we using?

00:34:20

Speaker

We were using, think we were using Asana at the time. was AcuNote. Sorry, we like went through a walk lot times. We went through a bunch of project management systems early days. and um And we tried to sign out a bunch of tickets and and everything. and And the problem that like at least I ran into was because i didn't always fully understand like the complexity of the work, um I, in hindsight, realized was oversimplifying the work.

00:34:55

Speaker

You know, and again, you know, I'm, you know, i was still, was managing the team, but like, I was still technical enough to be able look at the code base to kind of figure out and they take some, some guesstimates. And boy, did I get it wrong of like, you know, looking at the code base and thinking like, okay, here's what I think needs to change. And i would write a ticket based on that scope and everything.

00:35:16

Speaker

And, ah and then the one engineer responsible for that model would come back to me like, what the hell are you talking about? Like, that's one, that's not how this works. But it was this, this, like, on my part, just a lot of frustration with myself, actually, of like,

00:35:31

Speaker

Why is it so hard to coordinate this stuff? And it was because like even at a small company, less than 20 people doing which, yes, it was a very sophisticated, hard upgrade. I don't want to minimize that.

00:35:46

Speaker

The code base wasn't that big. And yet I couldn't reason enough about what this upgrade was going to cost to the to the handful of other engineers that, again, weren't working on this full time. So they weren't in it constantly, but they still need some work from them. And yeah, I repeatedly got it wrong. And so basically I i ended up just like...

00:36:08

Speaker

finding tickets with individuals being like, hey, can you please scope this out for for me? and And so I can kind of, I can start to get a sense as like, I guess, pseudo project manager of like, what this is going to cost us fully.

00:36:23

Speaker

And again, thankfully, you know, I worked with really smart people that were like, yeah, those will take me. a couple hours, maybe a day or two, but that's it. And so it wasn't that expensive. But each time, like I thought I'm like, oh, something will take like, don't know, 30 minutes to an hour.

00:36:39

Speaker

And again, like I'll give myself a little bit of grace looking back, right? Like, of course, like I wasn't working in that code base all the time. So how the hell am I supposed to know exactly how much an upgrade is going to cost? There's no way I could know that fully. And hell, people working in that code base 100% of the time don't know.

00:36:58

Speaker

And that's what is so frustrating about this kind of work sometimes where like it's it literally feels like this like bottomless like or and bottomless pit of work where just you keep digging and digging and digging and finding more work. And, you know, it sometimes feels like it's like this like weird and squared, cubed, end to the end problem of like it's Sheep's getting bigger and bigger.

00:37:25

Speaker

like except fractal scoping or something. yeah which You know, even working at some the larger companies that I worked at where like, you know, you try to come up with some formula, like, oh, here's the size of the upgrade. Okay. Now multiply this by the number of teams and multiply this by the number of services that each team, it doesn't work.

00:37:43

Speaker

Yeah. And square it. huddle up Well, my job would always add an order, like time-wise double it in order of an ad in order of magnitude. So you think it's going to take two months.

00:37:54

Speaker

It's going to four years.

00:37:57

Speaker

Yeah. And that's and it doesn't. what So why do you think that is? that This seems like you are not the only person to say this. It seems like every week we talk about this of like, I thought this was going to be easy. And then I talked to someone.

00:38:10

Speaker

There was i feel like I referenced this half the shows, but there was someone who was like, we were upgrading from Python 2.7 to 3. And it turns out that actually we were upgrading for Python 2.6. And literally nobody knew that.

00:38:24

Speaker

like Why why do we keep running into this, like... pit of of unknowability all right okay i'll give you my my i'll give you my cynical take which is probably a little funnier um and then i'll give you my more optimistic take my cynical take is just outright hubris that like we think we know our shit we think we're so smart and we're so clever i built this thing i'm its maker therefore like i will tell it how much time it needs for an upgrade because i'm the one who built it right

00:38:57

Speaker

That's sure. And

Software Complexity and Upgrade Efforts

00:38:59

Speaker

it's objective. You know, obviously it's wrong. And I'll couple my my optimistic take on it. And something that I would always remind my engineers about is this idea of complexity increases regardless of what you do.

00:39:13

Speaker

That if you if you freeze a feature set on any product out there and you think about and particularly databases, customers are still using it. And they're going to add more data. They're going to introduce new read and write patterns. They're going to figure out new ways to break your API, whatever that is.

00:39:33

Speaker

And if you think that you can predict all that, if you think you can keep track of all that in like a dashboard or something like I would love to meet you and I would love to see where you did this successfully because i just don't think it's possible for human team company to be able to to track all that. And so when I get to this idea of like, you know, upgrades taking forever and you we joke about it, I think it comes from this place of like, well, the problem got harder without us even realizing it.

00:40:07

Speaker

And that's where, like, you know, I actually love that story about the Python 2.6 or 2.7. It's just like, oh, my gosh, like, no one was keeping track of this. And why should you keep track of it? If it's not hurting you, like, if you're not taking advantage, I don't know, Python 2.7 features, then, like, yeah, that makes sense that all of a sudden the problem got harder without you realizing it. and And that's the thing that, like, just given time, things just get more complicated. They get more complex.

00:40:35

Speaker

And there's nothing you can do about Then when I think about upgrades and these other things, this is our effort to try to like reduce that complexity and and and you know I guess even reduce some level entropy in the systems.

00:40:48

Speaker

But the problem is like it's short-lived because the second thing you finish, guess what? Time keeps going. It's going to get more complex and more complicated regardless of what you do.

00:41:00

Speaker

Yeah, absolutely. that That feels right. It was like code bit rots for sure. and And just like you're always you're you're always creating more stuff you don't know about. Like the world is outpacing you in terms of stuff. And that's that's coming into this little beautiful walled garden you've created through the open source code you work with or your users' behaviors or whatever it is.

00:41:20

Speaker

Yeah. and I buy it. I totally buy it. And again, like I don't mean to say that like, give up, obvious. you know like that i I will never say that. And obviously, up you know upgrades and and doing that kind of work is is near and dear to my heart. But but I do think as as technical leaders, we need to we need to acknowledge that like there's a lot of baggage when we say, hey, this needs to be up.

00:41:47

Speaker

and I think about even human pager duty, when we're upgrading ah version of Cassandra and and with the explicit mandate of improving our reliability and and and improving our customer experience, like we're literally a company built on the idea of like, hey,

00:42:06

Speaker

Increasing reliability directly feeds into revenue. even there, hard to justify it sometimes. and even there it was it was hard to justify it sometime Because of this opportunity cost of like, well, we could do this or we could build out some other features or some other product that our customers are also asking for. So like, don't I always, i always joke that like I had the best job in the world when I was responsible for reliability at PagerDuty because i cheated so much. Like I could just say like when like, when like engineers or other leaders be like, Hey, like why is reliability important? I'd be like,

00:42:43

Speaker

do you know where you work? why hey And I got to teach it. when people were like, hey, why is this upgrade important? And I could say things like, well, here's how it's going to feed into our reliability.

00:42:55

Speaker

And and you know we had a company culture, like literally our even our sales reps got Like i I was so stupid lucky and to work on this stuff. And, but then over time, like as, as just, you know, companies, products, everything just gets more complex. Like we, you know, i I tried very hard to hold myself accountable that I didn't cheat that I would say like, okay, well, yes, like reliability is important, but here's like the, the tangible value that I think is going to be adding to our customers and whether that was,

00:43:26

Speaker

you know speed, performance, reliability, you know, or things like, hey, like, you know, customers won't have to write into our support team to figure out why they got duplicate notifications, things like that.

00:43:37

Speaker

ah tried to hold myself accountable to that, which even even for myself, that was hard to do. Like there were times where I wanted to cheat, but Again, try not to. And then I think about every other mortal, every other software engineer, every other technology beater in our industry that you know they're they're advocating and and arguing for this stuff. And and in it's hard.

00:44:04

Speaker

It's very hard. ah A lot of peers will ask me for advice and like, hey, or rub I want to you know really push this upgrade through I'm really getting sick and tired of ah Python 2.6.

00:44:19

Speaker

which I'm told we have none of. And I grilled them. I'll put them through the ringer of like, okay, no, seriously, justify this to me. Like really show me the dollars and cents of ah of why this matters.

00:44:30

Speaker

Because, you know, what as you were talking about earlier, the cost is always going to be higher than you think. And that return, you you better be able to articulate that. Otherwise, know what? If I'm your CTO, I'm not going to support that.

00:44:44

Speaker

No. I mean, the the value, ah despite being, I feel like the... the poster boy for you you should upgrade your stuff these days. My actual advice to most people when they they consider an upgrade is like, do you have to?

00:44:59

Speaker

And the answer most times is no, because like until until it is impacting something that matters, right? Your feature development, your your reliability, your cost, your like whatever it is, unless you can peg it one of those things that matters. Like, what what are you doing here, man?

00:45:17

Speaker

No, and it's, you know, I work with a lot of early stage startups these days. And the last few questions around like reliability or security is another good one. And I love the intent. I do. Like, not that I love and is very important to me and I built and entire career on it.

00:45:31

Speaker

But I have the same pause. Like, what are you talking about? Like, is this is this really the most urgent thing? Like, do you have to do this right now? And I know that sounds very um like almost reckless.

00:45:44

Speaker

Like, I get that. Like, I understand why people think that. And maybe it is, but the truth is, is that like, it becomes this like, almost like a siren song to so many engineers, especially I find engineers, a little bit earlier in their careers, they'll hear the siren song, well, if I do an upgrade, I've added business value. It's like, no, that's not necessarily always true. I'm sure there's some benefits tucked away somewhere, maybe it's little bit faster, but like,

00:46:13

Speaker

You got articulate. You got to be able to figure out like, hey, here's the return we're getting on. Otherwise, as ah as a technology leader, you can't sign off on these things. Yeah, if you don't if you don't know it, then why then why do it at all?

00:46:25

Speaker

um Yeah, I'm curious. that i i I loved your point earlier about like everything adds complexity no matter what. And, you know, like my people like Cassandra, 100% added complexity.

00:46:36

Speaker

So tell me a little bit about... you know You're not done at that point. Now you have Cassandra. You've got to operate it. It's on version 0.006 alpha minus three.

00:46:48

Speaker

oh what was your How did you think about like tooling in that world? Is that something that you could like lower the cost of incremental upgrades like and operating it at that point?

00:47:01

Speaker

Or were you just like, you are now in a world of ah constant cost? I mean, I think the latter, unfortunately. I would have loved to... That would be a good story here. No, kidding. I mean, I would have loved to... like let Sorry, I take that back. like we We definitely did create some pretty decent automation based on that first upgrade. ah And again, we you know we did have this mindset of like, okay, we're going to have to do this again at some point. And...

00:47:27

Speaker

and so and and build some tooling, and particularly with ah we were using Chef very heavily at the time. But,

Resilient Architecture and Operational Challenges

00:47:35

Speaker

you know, we ran into so many constraints with the automation. And again, those are self-inflicted constraints. They weren't, I'm not going to complain about it any stretch. But, um you know, if i again, I was giving you the example about Chef where it would update the config file across all the servers at the same time, which sometimes could just create HAPIC on Cassandra.

00:47:57

Speaker

And in even the the krista ah ah operating guy for Chris Hander, a lot of them were like would literally say, like do not update all the good things at the same time and that's literally home year deal and uh and we we tried to to build in some i forget the plug i forget what the plug-in architecture for a chef was called but um we we tried to like build that in ourselves and we did i would i would say we did a decent job of it i actually think like you know we we put in a couple like safeguards and stuff of like you know

00:48:29

Speaker

If in upgrade mode, you know, stagger this. And if not in upgrade mode, again, deploy all of them. So like, you know we had some some clever tooling and everything there, but it was never to the point where I felt comfortable. Like it I never got to the point where I felt like I was comfortable giving the work to like a junior engineer.

00:48:51

Speaker

now That was always kind of like a threshold for me of like when when a work has been solved, maybe simplified enough to the point you can automate it a little bit more easily where I trust a junior engineer. There's enough safeguards in place, not testing or test plans at least.

00:49:08

Speaker

and And we didn't get there. And not because i like I think we could have, but we, you know, I made the business decision. Like, I just didn't think it was worth it because weren't doing it often enough.

00:49:20

Speaker

But boy, did that create some brainless. And that would create a lot of risk. and and And God help us. Like, did again, you know, going back to like regions failing and servers failing, like every now and then those servers would die.

00:49:34

Speaker

And we'd have to create this artisanally crafted pet Cassandra server to reintroduce into the ring. and And we would watch it overnight as the data replicating across the way. And it's just...

00:49:48

Speaker

It was painful. It was very painful. and um And we did it because ultimately I do think we we serve our customers well. Like at the end of the day, the beauty of ah the reason we moved in this moment was like we could do all this crazy stuff behind the scenes and customers never noticed. And it was great.

00:50:05

Speaker

Like that was awesome. Like there was no performance impact. If we lost a serve, hell, if we lost a region, like we had designed CagerDuty's architecture to to withstand ah full out region failures, which, you know, at the time was was pretty wild and and it wasn't as common as it is now.

00:50:23

Speaker

um But that was pretty cool. Like there, and it happened, right? Like there were a couple of region failures and we're like watching ourselves like, hey, still working. That's pretty cool. Yeah. Yeah, there's this there's this side. What is turning into a recurring theme of if you have a fundamentally resilient architecture, you can get away with a lot.

00:50:41

Speaker

You can undertool or swallow risk around doing these upgrades because if they go sideways, it's not customer impactant.

00:50:53

Speaker

it sucks for you, but like, you know, that's fine. Maybe I can make the decision today that I'm going to like hope to get lucky and then I'll eat it next week. That's fine. That was my fault.

00:51:06

Speaker

And the thing is like that, that took me and you know, I was lucky that I worked at a couple of larger companies early in my career because it in a good way skewed me towards this idea that like,

00:51:17

Speaker

really good resilient architectures are always in a constant state of failure. of but It's just a it's the little bit, right? There's always something broken. And, you know, if you're you're a CFO, that's probably terrifying, but like like but but But it's true. and And I do believe this, that it's like when you have these very loose couplings um between pieces of software and at a company, at business, and you can afford to have these like just little breakages here and there, I think you end up creating this amazingly resilient software that actually affords you a lot of risk taking and which, in my opinion, affords you a lot of innovation.

00:51:58

Speaker

And you can take these big swings. If you can you know upgrade servers blindly and it doesn't go, get rid of it. Okay, great.

00:52:09

Speaker

That's resilience right there, right? You can take these big swings in the context of upgrades. and um you know And I think we we understand this well on in-house software, like deployed to, let's say, like a Kubernetes or some other um container scheduler pretty well. Like we, you know, okay, deploy the new software, spin up a whole new set of containers. Oh, not going, okay, shut it all down.

00:52:33

Speaker

And, you know, we kind of take for granted that like we can take these big swings now, but it's phenomenal that we can do some of these things. And that's because like, you know, we always assume that like, yes, something's broken. some Okay, detect, to have the health checks running constantly. Take those servers, those containers, those endpoints out and and you can take these big risks, which I think is really cool.

00:52:56

Speaker

um But I also recognize like, you know, it's taken us as an industry decades to to get to this point for whips. for certain parts of the stack and and databases are hard.

00:53:07

Speaker

Like i don I don't think we're there yet for for the persistence layer. ah think we'll get there at some point. I don't know when. yeah But being able to do it for the other 80% or 90% of stuff, yeah. Like, that's amazing. That's amazing.

00:53:20

Speaker

Yeah. Yeah. and the And the nice thing is, like, we're not upgrading. We're thankfully not upgrading databases all the time, right? Like, you don't have to do this. And, and um you know, it's usually because there's some new features that, like, that was our desire at PagerView when we were upgrading Cassandra from MinorVersion.

00:53:39

Speaker

I put it in quotes those were not MinorVersion. I will die on that hill, definitely. Okay, I want a story. what is like what What was one of those minor versions? Pre-Cassandra 1.0, I want to say we were going from 0.7 to 0.8, I think.

00:53:58

Speaker

And like the entire config file had to be rewritten. I'm like, what the hell? like This isn't backwards compatible. This is not a this is not ignore are complying with semantic versioning.

00:54:09

Speaker

I will give them credit. Once they got to 2.0, that's when I felt like, okay, 2.0 will go. Yeah, they felt better. But even even when Cassandra 1.0 came out years later,

00:54:21

Speaker

um Again, it was a another big rewrite of the config file. And like I still think they were a little cheeky with some of the minor routes. It wasn't actually as backwards compatible as as I would have hoped. But i I will give them full credit. After 2.0, I actually felt they did a really good job.

00:54:37

Speaker

It was 0.7 to 0.8 0.07 to 0.08. It was terrifying. And again, this one of the cool things of working with these very... terrifying and like just and and again this is but of other cool things of like working with these very don't know what the right word is, but just like early communities. Like this is what I love about open source when like you can, you can attach to a project that, you know, has some, you know, had a lot of momentum behind of course, and the maintainers and the community was fantastic.

00:55:07

Speaker

They'd be really wonderful to to work with. And there were a couple of, of ah of, consultancies that also sprang up around Cassandra that, that were a delight to work with. There was just so much, um positive energy so that made the problem i won't say fun but i will say more tolerable that you like like there was a little bit of the like uh of shared suffering so like you know i'm complaining about the the ah minor minor version revving and and everyone like it was just like like everyone kind of giggled like when we talk about this like everyone everyone got

00:55:46

Speaker

Yeah, i I, you know, i mean, you know this, but I was i was at a company who was also on Cassandra Point. Oh, something horrible. And yeah, like um it was gruesome to try and debug that thing.

00:56:00

Speaker

And ah again, going back to me we, I will die on the silk. That was absolutely the right decision for PagerDuty at the time. We use it in this way to use Cassandra's IQ to...

00:56:12

Speaker

you know, abuse the crap as these config files and do these upgrades. Why? Because that's what we didn't have much of a choice.

00:56:22

Speaker

Yeah, that's my company was not so successful. So I think yours is a better proof point of the the technology. um what was So

Transition from Cassandra to Kafka

00:56:29

Speaker

um coming up on time, what does Patriot duty still use Cassandra today?

00:56:35

Speaker

and with So there's there's actually, I think it's a happy ending. um So Patriot deprecated, i want to say all of it, um shortly before I left. So we' were talking those five years ago and everything moved over to Kafka. because oh interesting.

00:56:52

Speaker

Kafka was introduced, I want to say 2018? Yeah. eighteen Yeah, maybe late 2017, 2018. That was right. i was right It was so cool to watch what the folks around LinkedIn were doing with Kafka. And you know we, funnily enough, had actually wanted to use Kafka instead of Cassandra when we had chosen Cassandra.

00:57:14

Speaker

But we couldn't get the consistency guarantees out of Kafka. It just wasn't tunable enough. And they just that changed over time. And like you know Kafka and the folks at LinkedIn were just Just doing amazing work.

00:57:25

Speaker

And so I don't remember the exact version of but very similarly, we had to do this big cutover. Now, the advantage, though, with the way that cutover was architected was we had taken almost like a multi-master, sorry, a multi-primary approach to that project.

00:57:42

Speaker

um to that work where we were writing to both the Cassandra queue and the the Kafka queues at the same time. And, you know, effectively we were, we were running parallel workflows on on top of both.

00:57:57

Speaker

And so you we were eventually just dropping stuff with CAD and we didn't actually want, you know, multiple notifications going out to customer. We weren't doing that. But thankfully because we everything but.

00:58:09

Speaker

Yeah, everything but. And that just gave us a lot of confidence. So like the final, but like it was technically like a cut over and didn't feel like one. it just felt you're just like turning one off, turning one on, but like we were slowly bleeding stuff over and, and we had much more sophisticated tooling around like, we could petition customers. So like, you know, only certain customers were on the newer system versus old. And like, like that was another thing that like I'd learned in as the, as Pager was getting bigger, you know, that it was probably at, you know, probably close to 10, yeah, probably close to 10,000 customers that, um,

00:58:45

Speaker

While scale and increases complexity, like number of customers and requests, all that stuff, it also gives you lot of optionalities with upgrades.

00:58:56

Speaker

And it actually, it gives you like really cool ways where you can like, partition your customer base. So for example, one of the things that that I um never fully was able to implement, the but like, know, we had a free tier at PageDuty.

00:59:12

Speaker

And um what I always wanted to be able to do is like, hey, these would be the customers that would actually get some of the upgrades first. And not because like, oh, they're not paying us like screw it wasn't that it was actually like these were from a pure, pure risk management standpoint, like pretty low risk customers. And I was it was always an easier conversation with them be like, hey, you know, again, if if we ever screwed up I would always be the first one and to call these ones like, hey, really sorry.

00:59:41

Speaker

and And then they'd be like, who the hell's calling us for PagerDuty? That's not an automated system, but that's a whole other set of funny stories. We had a B-card that we had sent out. And one of our office phone lines was in that B-card. So obviously most of our customers programmed their phones to like ah have a different alert tone.

00:59:58

Speaker

And so like whenever i would reach out to a customer, would tell them, hey, I'm going to call you at this specific time so you know that...

01:00:08

Speaker

I should have just called my personal cell phone, but whatever. ah But what I found, they were very forgiving customers, especially in the context of like if we're releasing new version, upgrades, or anything, versus you know our Fortune 50 customers, they rightly so would be so forgiving because they can't afford to be. And and that was something that um when I thought about upgrades, whether it was our databases, whether it was Apple,

01:00:35

Speaker

app versions, whether it was framework or whatever it was, um it just gave us a lot more optionality with the way we could approach the upgrade versus when you only have 100 customers, you don't really have to as much because it's like all your customers are kind of equal and in and you don't really know which ones you can kind of piss off a little bit more versus ones you can apologize to and and kind of clean it up.

01:00:58

Speaker

And so that was kind of the fun part of getting more scale. And so, yeah, so Pagerie doesn't use Cassandra anymore. um That was... ah A lot of engineers did just some phenomenal work.

01:01:11

Speaker

Again, people are much smarter than me. Thank you, we work on all this stuff. that um Did some great work. you know we We started using something that was a little bit more purpose-built for Qs and Q on top of Qs. And the other fun thing with that migration, though, is It was like, it was such a relief culturally, like from a, from a cultural standpoint, like, cause every engineer, whoever had to touch the Cassandra queue, like knew it was brutal, knew that anytime you were going to upgrade it, it was always that like, there was always like the magic incantation you felt like you had to do.

01:01:47

Speaker

And it just felt like a wave of like just relief came over the and the entire engineering organization when, when those final Cassandra keys were deprecated. Yeah, that that makes a ton of sense. there's i I had this moment where I was like, I should start a company where we measure engineering productivity.

01:02:05

Speaker

And you have to measure that. but You have to be able to measure that feeling of there's not a landmine that I could step on accidentally over here.

01:02:17

Speaker

I have no idea how to

Legacy Software and Innovation

01:02:18

Speaker

measure it, but it's real, like super real. and And the thing is, like, that that like that fear that comes with which with touching the Cassandra queue, with touching that part of the code base, which comes with touching that service, it's it's so much friction in the ways of innovation.

01:02:42

Speaker

and i And I actually get really frustrated when like people call it like legacy code. I've shared this job with you before. but I was on a panel once and one of the, I forget the guy's name, think I had said legacy software and he stopped me. And yeah, I said legacy software and he stopped me says, no, call it legendary software. Give it the respect code.

01:03:06

Speaker

that it deserves, recognize that probably a lot of your revenue is coming from that legendary software. And yeah, it's scary to to touch. but But the thing is, like we all have that.

01:03:17

Speaker

egg Every company, once you get beyond a pretty small size, you have that piece of software that everyone's scared to touch. And I think it it just wreaks havoc on productivity, wreaks havoc on innovation.

01:03:27

Speaker

And this is where I do think upgrades are important. like You don't want to be stuck constantly engineering around something when it's like, well, maybe you just need to upgrade it or maybe, you know, in Pedro's case, maybe have to deprecate it.

01:03:40

Speaker

And the upgrade is actually like a long scale, a long-term deprecation of of a piece of technology. But it's still an upgrade and it's still something that you have to have to do.

01:03:52

Speaker

i think we could have kept running Cassandra queues for a while longer. It could have. Clearly worked. It worked for a long time. It truly did. And that was another interesting thing. I remember as we were looking at the upgrade to Capca,

01:04:09

Speaker

like one of the real tangible arguments we have like hey like the thing is it is working somehow even though it shouldn't be it's still working is it still worth it and you know another argument for the up here which i totally was for was like look hapka increasingly is becoming a common technology so like we're gonna be able to hire engineers that know it more than cues on Cassandra. Cassandra, again, well, established technology, we can hire people for it, but cues on Cassandra, that's that artisanal stuff that, you know, is a little bit harder to hire for.

01:04:45

Speaker

Yeah. Yeah, absolutely. That was, a I heard that in one of the first episodes here here where we we talked about like upgraded from CoffeeScript and a non non-trivial part of that justification was like, I can't hire a front-end engineer in 2018 who wants to work in CoffeeScript.

01:05:00

Speaker

like Yeah, you can't. I mean, you know, look at, you know, i'm my favorite punching bag for that right now is COBOL, where there's so many um critical, critical systems in this country that run on COBOL.

01:05:15

Speaker

And,

AI's Role in Managing Complexity

01:05:18

Speaker

but you know, there's that only like a dozen. Yeah. really The expertise is literally dying out. Exactly. and and and this ah And this is where I know we're supposed to talk about upgrades, but like I do get excited about some of the AI stuff in this part of the upgrade story. of like And again, know it's it's still on there, but like this idea that, hey, I have this COBOL code base.

01:05:48

Speaker

Can you translate all this business logic into insert modern language framework, whatever? And even if you don't end up deploying that thing into production, I do think from like a learning and understanding standpoint for for software engineering, I think there's a lot of value in that.

01:06:05

Speaker

Yes. I think, you know, picking on cool ball, but there's there's a lot like that. Like even in my lifetime, like i wrote stuff 20 plus years ago and like C or Java or hell, like brain curl Mason templating that if I were to read that now, I probably wouldn't be able to understand it. So if I could take that code,

01:06:24

Speaker

have it's some AI agent bot, whatever, rewrite that into a language that I am just more familiar with right now, I actually would be more comfortable than diving back into that code base in the original language just because I could build up that context in my head more rapidly.

01:06:41

Speaker

Yeah, I absolutely agree with you. I think there' is there is a huge amount of, this is the compliment to what you were saying of around like the reason upgrades take longer than they should. really, not just upgrades, right like any software engineering project gets misestimated.

01:06:56

Speaker

It's because there's complexity there that you didn't realize was there. and like Even knowing that it's there, we still underestimate how much complexity there is. Yeah. That's the thing. and he said worse It's like, as you're trying to measure the complexity, it keeps increasing.

01:07:09

Speaker

Yeah. so it's good Planning out the upgrade project. It keeps increasing and complexity, literally as you were trying to measure it. So it's like, you know, you're, you're measuring a moving target. It, Oh, it's so hard. It's so hard. frustrating. But if if you can if you can tool up, but I think AI has a huge opportunity here, if you can tool up your understanding faster, like we are not going to do this waterfall. You are not going to have a perfect plan on day one.

01:07:38

Speaker

But if you can always close that gap when you find a new pocket of uncertainty and say like, okay, now I know what's going on there. As soon as it it enters your brain of like, oh, there's something over there that we don't know about because we saw an error or someone pointed it out.

01:07:51

Speaker

If you can close that gap really quickly and come up with a plan and update your existing plan and and the existing code that you have, the faster you can close that gap. I think that you really do get into a place where it doesn't feel like it's constantly getting away from you.

01:08:05

Speaker

wow Yeah. And, and I, will you know, to add to that, I also think if you can produce the cost of the measurements and like be able to figure out like, okay, like, Where am, you know, that like you if we think about the moving target, if it costs less and you can measure the moving target faster, i think that makes it easier too.

01:08:24

Speaker

yeah Yeah. This is where I i think some these AI products out there right now are getting pretty darn good at this. and And it's exciting to see how, you know, what once was like a bit of an art, like had a, I've always had,

01:08:38

Speaker

Every point of my career, I've had these brilliant engineers that could just like very quickly look at a code base and size it up. And we're actually pretty accurate of like, of what the the estimates upgraded would be.

01:08:51

Speaker

And I don't know, I think yeah we're not that far off from that being more accessible via these, these AS and these AI products. Yeah. Yeah, absolutely.

Automation and Customer Management

01:09:03

Speaker

um All right. I want to wrap up and do a couple of quick fire questions. um So I am going to, I'll ask a question or say something and just give me the first thing that comes to the top of your head.

01:09:14

Speaker

um What is one thing that you wish during one of these projects that you tried to automate that you couldn't automate that would have made all the difference? if So going back to the Cassandra example, if we had had the ability to automate the staggered rollout of configuration files that would have made a huge difference yeah but we we didn't have the the automation framework we're using didn't didn't support that and uh

01:09:52

Speaker

Truthfully, I think we were also just so culturally used to artisanally crafted pet servers. So I don't know if like we even would have used it, but yeah, that would have been it. The ability to automate ah staggered rollout of configuration files.

01:10:08

Speaker

Yeah, that seems like it would have been useful. Oh, well. um What was... Which customer, internal or external, did you piss off the most?

01:10:26

Speaker

I can't. Oh, man. um

01:10:32

Speaker

So this is a bit of inside baseball. So I joined PagerDuty after having worked at Netflix. and so I was a customer of PagerDuty at Netflix.

01:10:46

Speaker

And Netflix was a very early Patriot Duty customer. And boy, did we miss them off. I'm recruiting at Patriot Duty. And you would have thought that, you know, I had some relationships there. I knew some of my former colleagues.

01:11:01

Speaker

did not matter. They treated us still like a vendor as they should have. And yeah, so there was, yeah, they were, and and like, they were wondering, you know, Netflix is a wonderful customer in so many ways. and But when, when we screwed up, they paid, let us know we screwed up and and we got um it made us better.

01:11:21

Speaker

It made, it absolutely made us better as it made Netflix definitely helped make page review better, especially in those early days. But, Oh boy, I dreaded some of those calls.

01:11:35

Speaker

I love i love that you switched sides of the table and we're like immediately like, wait, wait a minute. yeah So the other side the table, I'm like, oh my God, why? It's like, you know, there's like those negotiation tactics, but it's like you like sit someone in like a lower chair. Like, why is my chair so much lower? All of a sudden, why is he like smell kind of funny on this side of the table? Why am I such a disadvantage all of a sudden?

01:12:02

Speaker

of The power differential is real. It's

Arup's Current Focus and Personal Interests

01:12:05

Speaker

just huge power differential and different power dynamic of just literally switching sides of the table. I love it. Um, and I, because we're here, we have, you know, appreciate you humoring me and telling me about all this stuff. What are you up to these days?

01:12:20

Speaker

These days? Gosh. So, um, I spend my time professionally doing two main things. So the first is, you know, I alluded to this earlier. i work with a lot of early stage startups, uh, investing, you'll be investing and, um, particularly like the depth tool security space, um, which, you know, it's become an AI space now.

01:12:40

Speaker

Um, And it's just amazing to work with early stage companies and founders. like up I tell people about angel investing, it's a terrible way to make money. It's a wonderful way to get inspired and meet you people.

01:12:54

Speaker

And it's an expensive way to do it, but that's why I continue to do it, is because I get to meet this next generation of founders and people moving the needle in this area that's very near and dear to my heart.

01:13:11

Speaker

um Professionally as well, I work with a company called Fortlight and we provide management, coaching and consulting to early stage startups. um The way that I pitch Fortlight is when I was first becoming a manager, especially at PagerDuty.

01:13:24

Speaker

This is all the stuff I wish I had known back then. And so, you know, my boss, Amy Riley, ah she was an early employee at Hotel Tonight and scaled their entire North America operations. And she started this firm because of similar desires. Like we want but managers to do great things and working as a manager at a startup. It's such a uniquely unique.

01:13:48

Speaker

horrifying, challenging, rewarding problem because we're manual. There's no textbook. There's no MBA program that teaches you how to do it You learn by literally hearing, hey, catch.

01:14:00

Speaker

um That's it. That's the boom. That's learning. um And so we're trying to make that a little bit easier at Fort Light. And outside of that, I spend a lot of time on my hobbies. I've got my ah keyboard right there. I've got a guitar, a banjo right here as well. And I do a lot of community theater around the Bay Area. So, you know, you're really desperate for something to do, you can always see me on stage singing.

01:14:24

Speaker

Yeah. granting her out. seeno I've seen the posts on LinkedIn, which is, I think, the appropriate way to to ah promote community theater. it's You know, what's so funny is I remember when I was working full-time and, you know, there'd be icebreakers of like, oh like yeah tell us something and you don't know about yourself. You know, Jerry, you and I have known each other for 30 class years. You saw me throughout my entire musical theater career and in high school all the way to now.

01:14:52

Speaker

And then hiatus of 20 years. Yeah. well But I tell people like, oh, yeah, I used to do like musical theater. And it was always hilarious to see the different reactions from from people, because some people.

01:15:07

Speaker

they They're like, oh, yeah, I could totally see you like doing it and everything, and especially folks that like a little bit more of a creative side or creative audience. upbringing They were performing upbringing.

01:15:18

Speaker

And then I'd have so many of my engineers who just look at me like, just like so confused and like, wait, what? Like, wait. you would memorize music. What are you talking about? You know and and it was great. And it was so, and so I've been very lucky that, um you know, since getting back into it the last couple of years, I'll close to LinkedIn. I'm doing a show and a lot of my former and current colleagues will come see my shows and it's, they see a very different side of a group, which is very cool to share. That's fun.

01:15:50

Speaker

That's super fun. All right. Um, Last question. Where can people find you on the internet if they want to ask questions or get in touch with you? um So if you go to ArutChak or ArutChakrabarty.com, I own both domains.

01:16:03

Speaker

So but that'll get you all my contact information. Otherwise, LinkedIn, just search ArutChakrabarty. um There's not many of us. It's me and one other guy in India with the exact same spelling. So you have a 50-50 chance of getting the right one.

01:16:16

Speaker

i will I will dig up the link and we'll put it in the video description. So make sure to get that right. Mary, thankfully, my my Google SEO is very good. So it's easy to find me online.

01:16:26

Speaker

beat him.

01:16:29

Speaker

All right. Thanks so much, Aruv. our pleasure, T.R. This was an absolute blast, man.