Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03

10 Plays1 month ago

When Postgres 11 approached end-of-life, a major travel company brought in site reliability engineer Ninad Pundalik to lead the upgrade. The setup: multiple Ruby on Rails apps, with multiple Postgres 11 instances on AWS RDS. No in-house infra team.

No clear ownership. Some tables still in use, others long forgotten.

Despite the uncertainty, the migration went smoothly. The team cut over cleanly, avoided downtime, and ended up with a simpler, more reliable setup than they started with.

Get Tern Stories in your inbox: https://tern.sh/youtube

Transcript

Postgres Migration Overview

00:00:00

Speaker

In today's episode of Turn Stories, I talk with Nanad Pundalek about a Postgres migration that could have been a nightmare, but it wasn't. You might think that upgrading Postgres from 11 to 16 is a series of small, challenging migrations, but that's not what they did.

00:00:13

Speaker

They spun up a new Postgres 16 instance and they migrated the data over in a single shot. That strategy gave them a rare opportunity. They mapped out exactly what data was important to the business. They deleted all of the legacy and useless data, and they tuned the new database for Postgres 16 without having to bring all of the old operational cruft of the existing system behind.

00:00:33

Speaker

It's great strategy. This ah episode is heavy on details, deep in the stack, and in production.

Introduction of Guest Speaker: Nanad Pundalek

00:00:39

Speaker

Enjoy. Hey, everyone, and welcome back to the Turn Podcast. Today, we've got a great episode with Nanad Pandalik.

00:00:46

Speaker

um He's been an engineer at he's currently a site reliability engineer at LiveKit. He's worked at companies like Dite and Slack, Astro and Microsoft. He's worked across the entire stack. And today we're going to talk about his experience migrating,

00:01:02

Speaker

um some database ah work, and maybe if we have time, we'll get into some some Kubernetes and containerization

Consulting Role & Company Setup

00:01:10

Speaker

work as well. So welcome, Nanat. I'm so happy to be here. ah It's always fun to chat about work.

00:01:17

Speaker

um So here we are. Right on. OK, so let's let's dive right into it. i'm We had been chatting earlier about ah this Postgres upgrade that you had tackled.

00:01:29

Speaker

Before we get into what what you did, um tell me a little bit, set the stage for me a little bit. what was the What stage was the company at? um What was the background? How were things when you arrived?

00:01:41

Speaker

yeah yeah Yeah, ah so I participated in this migration as as a consultant um and the company is in the travel space.

00:01:52

Speaker

They help folks with ah booking hotels, flights and other related services in travel. And they use the website that they have as another channel for their business. It is not their sole channel for business. And as a result, it is an important service for them. But it is not like a critical, this site should never go down sort of a service for them.

00:02:16

Speaker

ah which is important for the power context later on. um And the database that they had been using for a while, ah the setup that they had was they had a bunch of Ruby on the Rails applications, which talked to a Postgres database, and those ROR applications were running on a Kubernetes cluster.

00:02:35

Speaker

And ah because the team did not have in-house expertise for development, they had hired a consulting firm who had initially set up the entire project for them.

00:02:47

Speaker

And as time passes, versions change, like seasons change. So eventually the database and Kubernetes versions both reached end of support. And that's where um me and a couple of other folks came in and we tried to help figure that stuff out for them.

Necessity and Strategy for Migration

00:03:03

Speaker

ah The first thing that we'll talk about is upgrading Postgres. ah The Postgres that they were using was an AWS RDS instance, ah and it was a Postgres 11 instance.

00:03:15

Speaker

It was a single node with... Wait, it wasn't actually a single node. There were ah three or four different nodes. Each node had individual databases on it. And there were good reasons for splitting up the workload in multiple databases ah when the original site had been written.

00:03:33

Speaker

The performance patterns for a couple of workloads were very spiky. And the other stuff was relatively um straightforward REST API stuff, which was um hit by humans.

00:03:45

Speaker

The spiky stuff was hit by... um Because this is the travel industry, they end up interacting and interfacing with a bunch of third-party APIs and um Google, as always, for advertising and traffic inflow.

00:03:59

Speaker

And as a result of that part, the API traffic ended up being fairly spiky. um So the team chose to run it in a separate workflow. at some point of time ah in a separate database.

00:04:11

Speaker

ah But because there were like four nodes and all of them costing a decent amount, we decided we'll play around with these numbers a bit. We'll go for a slightly larger instance. And we ended up migrating all of the data from the three to four different instances into a single node.

Migration Process and Challenges

00:04:28

Speaker

And ah for the couple of months that I was part of the engagement after the migration completed, ah the node was capable enough of handling version upgrades, we went from like Postgres 11 to Postgres 16.

00:04:41

Speaker

So I am pretty sure we had a bunch of performance improvements just from that upgrade, which allowed us to co-locate all of these things. And also it helped with another thing, which is previously we had one database of each workload.

00:04:56

Speaker

Now we could have a standby replica, a cold replica um in case of like a failure for the primary node. um So the costs didn't significantly go down.

00:05:07

Speaker

ah The costs went down a little bit, but we did have much better reliability for the entire setup. um So this is sort of the stage um where do we want dig in over here?

00:05:18

Speaker

Yeah. So i you were brought in as consultants. was there Were there any ah engineers at the company who you know ah owned this setup or had familiarity with it?

00:05:32

Speaker

Or um was this something where you were coming in pretty cold and had to pick up you know what It was a lukewarm situation. the ah The team had a couple of folks who were from an engineering background.

00:05:48

Speaker

Essentially, there was a CTO in that, even though this was a travel boutique agency, they had a CTO person or a CTO role. And that person had a good amount of familiarity with the code itself, the application code.

00:06:00

Speaker

ah But then there were a bunch of database tables that were used. There were a bunch of database tables that were not used. That sort of was all tribal knowledge that had been lost.

00:06:11

Speaker

So we had to sort of rediscover it and figure out, hey, this is safe to kill or there are absolutely no queries here or this table actually has traffic that is still coming in. And we sort of took calls based on that.

00:06:24

Speaker

um So, yes, it was a mixture. Some stuff we had support, some stuff we had to figure it out on the go. Got it i Yeah, a little bit of a little bit of knowledge helps. others we So what pressure were they feeling to upgrade?

00:06:41

Speaker

was their i Was there a single reason or were there multiple reasons that all kind of came together? the The primary and probably the only reason for them to start this entire effort was that Postgres 11, and this is 2024, late 2024 that we're talking about, Postgres 11 was going to out of support ah from both the Postgres project itself as well as RDS, which is actually hosting the application, the database application for them.

00:07:13

Speaker

And after a certain point of time, If anything happened to the database, ah AWS, which is the cloud provider here, would rightly not be able to help them with fixing anything.

00:07:25

Speaker

um So that was the only reason that they started this entire thing. um The cost savings or the performance improvements that we talked about a little while ago briefly were a side effect of the entire effort.

00:07:39

Speaker

nice Nice to have. Yeah, nice to have, but they were not the goal. Yeah, that's prevent Amazon from turning off the instance because it's end of life. Yeah, essentially. That drives everything.

00:07:50

Speaker

ah Amazing. So walk me through the the process and the major milestones you wanted to. You had five versions to upgrade. How did you approach that span of an upgrade?

00:08:04

Speaker

ah I was fortunate enough that just a few months before taking on this assignment, we had done a similar migration at my previous employer, which was Dite, where we had been running Postgres 11 and we did a Postgres 15 migration.

00:08:23

Speaker

And as a part of all of that effort, although I did not execute on that project directly. a teammate had taken care of a bunch of that work. I was with that person for the entire planning and feedback conversations and figuring out prototyping stuff.

00:08:39

Speaker

So we had ah I had a fair idea that doing a versioned upgrade, like one version at a time, in-place upgrade is going to be super painful and it's not going to be worth anyone's time.

00:08:53

Speaker

ah It's usually just easier to spin up another instance and ah figure out how to replicate data from the old instance to the new instance and then figure out an application cutover plan.

00:09:05

Speaker

ah And that's essentially what we did at both the old at like Dite and at the company that I was consulting with. And The milestones essentially then were ah checking or sort of the checklist that we had in this place was each database essentially ends up having ah matrix of what versions are supported. And in that matrix, they also say, hey, you can upgrade from A to B, but you cannot ah upgrade from one to two or two to three.

00:09:37

Speaker

um So you need to get to those appropriate version numbers. And fortunately, um because RDS is a hosted database provider, they have generally taken good care of making sure that databases are at a reasonably recent version, the minor versions, not the major versions.

00:09:55

Speaker

So we basically came in, we checked, hey, ah is the database at the right minor version that we needed to be at? And once we did that, we... v took a clone of the database, we ran an in-place upgrade of the clone to get an estimate of what the time duration is.

00:10:13

Speaker

And it, of course, validated our earlier hunch that it's probably not a very great idea to do, like, hey, spin up a new one and then upgrade that, like, spin up copy from a snapshot.

00:10:27

Speaker

It's just easier to have a fresh node. um That also sort of helps with fixing a couple of other things. um There's always technical debt in any setup and ah by having a fresh instance, it helps in making sure all of the database parameters for buffers and um query parameters are all sort of but relevant to the new major version that you're going into because major versions are called major versions because these parameters and such behaviors change significantly enough that you want to revisit the assumptions behind that.

00:11:03

Speaker

And the other reason that we wanted to go with oh oh create a new empty database at the new version and then the replicate data over worse. ah Historically, a lot of the systems had been set up with infrastructure as code, but that entire repository had ah rotten off.

00:11:23

Speaker

There was a lot of bit rot and there was essentially no way that we could but reuse any of that code and build out the new sort of infrastructure for both the database as well as the applications.

00:11:35

Speaker

And we elected to spin up new databases entirely. ah It's easier when you have a new net new application rather than like trying to clone an existing thing.

00:11:47

Speaker

So we essentially spun up, we used AWS CDK in TypeScript because that we made the most sense for the team that we're working with.

00:11:57

Speaker

um underly what is What is CDK? AWS CDK is the cloud development toolkit and it essentially is oh I wouldn't call it a wrapper. It is a it is a generator for cloud formation.

00:12:15

Speaker

It will let you write code that looks procedural and it lets you then generate cloud formation templates out of it.

00:12:25

Speaker

And those cloud formation templates are essentially still declarative infrastructure as code that we're all used with used to. So Terraform, Pulumi, a bunch of these are all declarative at the end of the day.

00:12:37

Speaker

Sorry, I'm not sure about Pulumi, but um the others are definitely very ah declarative. And you can use this to build out infrastructure. And CDK as a...

00:12:49

Speaker

Project supports multiple languages as sort of like a language to write in. There's TypeScript, Python, Go, Java, and there's a couple of other that I don't recall right now.

00:13:01

Speaker

But TypeScript is sort of the first class um language in the sense that the actual internal AWS CDK team first develops in JavaScript and TypeScript.

00:13:13

Speaker

And then for all of the other languages, they use um foreign function bindings, FFIs, and they sort of use that to generate ah code that is essentially the SDK to use in other languages, which Officially, all of the other languages are still first class, but it sort of gives them a second class look and feel.

00:13:35

Speaker

um So TypeScript sort of made the most sense. And here, coming back to where we were, well check the versions. After the versions are checked, um start setting up the new infrastructure repository and use CDK to actually spin up the new database.

00:13:50

Speaker

We took that opportunity to revisit all of the parameters, look at which parameters were necessary for which workloads. This is where we realized that, hey, ah different databases and different workloads have different characteristics and there had been tuning that had been done previously to support that.

00:14:07

Speaker

um So we copied that over. We figured out how to do that. And in addition, um every database replication Postgres recommends you set up additional parameters which ah help a lot with managing the write-ahead log explicitly just for like a database replication scenario.

00:14:28

Speaker

It increases the slots available for replication, so on and so forth. um So you sort of set up your first line of parameters which is your application required parameters and then you have like your second line of parameters which are necessary for just the period of the migration of the data copy and RDS is going to of course like merge them all in, apply them to the DB and once your migration is done you can drop the replication specific stuff but keep the application specific stuff.

00:14:55

Speaker

So we set up the DB then we did a bunch of copies ah from the different source workloads and um as with most setups, the staging or the lower environment is not always in a great situation.

00:15:16

Speaker

And cost is always a much more significant factor in the staging setup compared to production. So staging had like one instance with all of the workloads and production had like four instances with four workloads.

00:15:28

Speaker

um So that sort of made replication slightly tricky to figure out. um Also, the instructions for setting up instanceto instance to instance replication are a little confusing.

00:15:39

Speaker

You can use um AWS has its own native service called the DMS or Database Migration Service, which in the background still relies on Postgres copying in in a couple of different forms.

00:15:53

Speaker

But ah that also is relatively opaque. It's not super easy to troubleshoot when stuff breaks and you essentially have to just restart the entire process, which does not give you a lot of control.

00:16:08

Speaker

So we essentially looked at um give me a second to look the exact notes up. But we used a Postgres native approach for replication.

00:16:19

Speaker

And in the source nodes we set up um targets or subscribers, ah sorry, publishers. And on the target node, we had the subscribers and we had data sort of get replicated over.

Verification and AI Tools in Migration

00:16:33

Speaker

Staging happened in the matter of a couple of hours. ah Production, of course, ended up taking longer, but that's OK. ah In staging, we also sort of took the easy way out and made sure that just the count of various tables and their rows were sort of similar between the source and the target one.

00:16:54

Speaker

But that was mostly like run a simple query in a SQL shell and eyeball the numbers and hey, if these are approximately the same, this is good. um In production, we essentially replicated the same process, which is UCDK to spin up a production copy of the environment, set up the instance, then manually set up the umscript ah publication and subscription for replication.

00:17:18

Speaker

But when the replication was close to finishing, I ended up working on a bunch of utility scripts that would connect to the source DB and the target DB, and it would run queries for a bunch of different parameters or metadata.

00:17:34

Speaker

It would, of course, look at the number of rows because that is sort of the primary indicator of, hey, a database has been copied over. But we also ended up looking at ah the space taken by each table here and there, and it turns out that Initially, it sounds like a great parameter to compare against ah different databases.

00:17:56

Speaker

And when you're running version, the same versions, this is probably foreshadowing, those numbers are going to be super close to each other. But when you're migrating versions entirely, we realize that those versions are going to be in the same ballpark, but they can be different enough that you notice them.

00:18:15

Speaker

So they're not going to be like, yeah, a small difference. They're going to differ enough that you want to actually stop ah paying attention to those. And you just so sort of look at the counts of rows and you sort of look at the ah write ahead log.

00:18:33

Speaker

um was the thing. There's a pointer on the write ahead log and you sort of compare that the pointer in the source and the target db are at the same thing. And so this script essentially queried like, hey, give me the status on this node, give me the status on that node and tell me if they are the same number.

00:18:50

Speaker

ah It's easy to do that in staging, but when you're doing it in production and when you've got like ah downtime window, you don't want to be running queries manually. So I just ended up running this script and that script was also ah set up in a way that it will keep post printing the output till it reaches the point where the numbers match and it exits.

00:19:11

Speaker

So that's sort of a sign to tell me that, hey, the replication is caught up. I can now just like cut over the application from the old database to the new DB. So that's sort of something that helped.

00:19:22

Speaker

um And something that's super relevant for turn over here is this particular script was my first sort of proper experiment with LLMs for coding. I played around with ChatGPT and Cloud, and Cloud gave me like a bunch of help and support in making sure this script worked.

00:19:41

Speaker

We ended up although all of the other stuff was written entirely in TypeScript, we chose go for this particular case because it was super easy to build a static binary that a I could just ship to one of the containers that are, that is running in, in the VPC cluster.

00:19:58

Speaker

And I did not have to then open up the database to the entire world just to check the status of the replication. o Gotcha. That makes, that makes sense. And, uh,

00:20:11

Speaker

how did you How was your experience with sort of... I think writing tooling for any kind of migration is ah is a critical step to like build your your confidence there. I want to ask about the LLMs, but before that, I think it's really interesting to think about like what numbers match have to match before and after to kind of validate that it's working.

00:20:32

Speaker

did you find Did you find that it was just time you needed to get the numbers to match, or did you run into situations where... there was there's a setup issue and like the number of rows never converged.

00:20:45

Speaker

There were cases where the numbers did not converge. ah At this particular assignment, a couple of those happened because...

00:20:56

Speaker

The tables had data either in the incorrect encoding or something of that sort. I'm sorry, I don't recall the exact specific, but the row was essentially not getting copied over from here to there.

00:21:09

Speaker

And after inspecting those particular rows manually, we realized that it was essentially test data that was not going to be relevant for ah the actual end users, it was still in production, but it was test data in production.

00:21:25

Speaker

So we chose to actually clean up the stuff ah from the source table and then run the migrations again. So that worked over there. um At the previous company at Dite, we did have a couple of other cases where the numbers did not match for a little while.

00:21:45

Speaker

um And if I remember correctly, at least one of them was this was not an encoding mismatch, but this was still sort of in the same space as a data type mismatch, um which probably was because application code, which was running at the time of the row being inserted, accepted that. But for some reason, the newer database version was stricter about that validation. And then it said, hey, this is not going to work. No BNO.

00:22:13

Speaker

So we had to fix those rows. um in the source DB and then copy stuff over. Got it. Was it obvious from just the like looking at the rows what what the fix was or did you have to go dig into old application code or otherwise find context?

Data Management and Efficiency

00:22:30

Speaker

At DITE, we did have to work with a couple of the application engineers to figure out what was sort of not making sense and um then fix the rows.

00:22:44

Speaker

At the consulting assignment place, it was fairly obvious that, hey, this is test data because the names for some ah entries in the row were fairly obvious like, hey, this is not an actual human. This is just a testing thing that's in there.

00:23:01

Speaker

Got it. So you really had to go find that that organizational knowledge about what the data was in the database before you could yeah either safely delete it up you can migrate it.

00:23:11

Speaker

Yep, yep. Yeah, another thing that sort of relates to this, um and it's probably going to be a tangent, so feel free to cut me off at that is, we found that there are a couple of tables, and this was like the case in both the scenarios, where there's a ton of data, which essentially ends up being either analytics or usage tracking data,

00:23:31

Speaker

that almost nobody ever ends up looking at. um All of that is built out with good intentions. At some point of time, you want to track what's happening. And um that data would also have been used at some point of time for a particular feature.

00:23:47

Speaker

But the feature got dropped and everything did not get cleaned up. So you're just carrying around a ton of baggage in the database. consuming all of these resources in the CPU, memory, and disk.

00:23:58

Speaker

um So it's just easier to either prune that data completely or drop that table entirely depending on the right situation. But that is, again, knowledge that is not going to be very obvious when looking at, well, the decision of should we prune this, should we drop this, or do we need to keep this, is is something that humans know.

00:24:21

Speaker

It's not be going to be something that a system can tell us. Yeah, the table is in the database. You're migrating the database. You should migrate the table, but that's not always true. to um How did you think about...

00:24:34

Speaker

As you think about locking in progress, how did you think about deploying the changes as you went? You mentioned eventually taking on ah downtime window to do the cutover. As you were working towards that cutover, how did you think about like locking in progress and continuously working so you you didn't arrive at that point and start have to start from scratch?

00:24:58

Speaker

ah I'm going to try to answer this. Please tell me if I'm going in the right direction. um

00:25:07

Speaker

as As we approached the cutover, a couple of things that were so sort of super interesting for us was get a sense of which applications were only reading data and which were the applications which were writing data and stuff that was reading data, you could possibly spin up a second copy. You you essentially do a blue-green deploy with the new database configuration and sort of make sure that nothing in the logs is obviously wrong.

00:25:37

Speaker

That gives you some level of confidence of, hey, the migration is actually working, the database is is ah up to the mark. ah You don't necessarily want to have customer traffic directed to that directly in the sense you do want to send the traffic in, but you don't want to send the responses out to the customers.

00:25:58

Speaker

um Essentially, read and shadow writes is what you want to do and compare that data. but That is also sort of fairly mature infrastructure.

00:26:08

Speaker

So it may not always be present and you have to take your best bet or um with read replicas or like with sorry with applications that read data only and you sort of take a risk and sort of so get a sense of hey this is working or this is not working.

00:26:25

Speaker

And at that point of time You have to rely on your regular deployment tooling to quickly roll forward or roll back as it required. That's at least what I found comfortable enough.

00:26:37

Speaker

Make your changes small enough and make them um easy to deploy or roll back. And when you have that, you can run it in like the read-only mode for a little while you monitor stuff is okay and then you sort of get close to the downtime window at that point of time you look at all of your right applications well you've actually looked at them beforehand but ah you look at them and you say hey this ah sort of needs a singleton during a cutover period to ensure that when data is being replicated from one node to the other we're sort of like

00:27:12

Speaker

um when you're looking at something like a Kubernetes deployment for that service, you will always have N replicas of that pod. um And you would get to a situation where some are talking to the old database, some are talking to the new database while the deploy is happening.

00:27:26

Speaker

And you will have inconsistent reads of the data at that point of time. So essentially, you can choose to just bring down the count of those replicas to one. And also make sure that the thread counts or database pool counts are also fairly predictable or low enough in this situation.

00:27:45

Speaker

um And you monitor that. That's also going to cause issues for customers. But then that's where the usual downtime notifications end up helping. You need to send out communications to all of your customers. Give them a heads up.

00:27:59

Speaker

Hey, these three features are going to always work. These two features will be ah running with a delayed response and these other features will not work at all or they might be incorrect for the certain period.

00:28:11

Speaker

um Setting up those expectations is super helpful in this sort of a situation. And then you actually go ahead and do that migration. And if you've planned it out well and if your writers are not still stepping on each other's feet, you can have the cutover sort of happen at a point where you don't have data consistency or integrity issues.

00:28:34

Speaker

At the consulting assignment, because the website was a second channel or an additional channel for them, we could safely just shut down everything, keep an Apache or an Nginx thing running in the front end, sort of put a maintenance page for the five-ish minutes that we were going to run that stuff.

00:28:50

Speaker

ah But for the previous space, we had to be a little more mindful about how we were doing the cutover. And like those five minutes give you sort of, sorry, I'll just close this out.

00:29:01

Speaker

Yeah, tra those five minutes then give you the time to um safely cut over, verify that the replication lag has caught up and then open up the gates again.

00:29:11

Speaker

Yeah, that's it.

Managing Migration Impact

00:29:12

Speaker

I like it. I love the idea of pulling the thread all the way out to the the user visible application and saying, what does what really needs to work? And in the case of the travel website, the answer is none of it.

00:29:23

Speaker

It's okay for five minutes. and yeah But in other situations that you might end up with a couple of key capabilities, and if you can just winnow the traffic down a small enough number, then...

00:29:35

Speaker

then great, then you can you can still take kind of a soft downtime well without taking the core capability of the site down. That's a really interesting approach. ne yeah um Let's come back to tooling.

00:29:47

Speaker

um yeah You mentioned writing scripts in order to check check correctness. um What was the most, you and you mentioned LLMs in that process, what was the most effective set of tooling that you wrote?

00:30:05

Speaker

um The thing that helped me verify that the replication lag is complete across different source databases.

00:30:18

Speaker

If you recall, I talked about how in the production setup, there were four source databases in the lower version, and we ended up ah choosing to migrate all of them to just one node at the end.

00:30:30

Speaker

um If it was like a one-to-one mapping, it's still fairly easy to just have two terminals open and run like a command a couple of times, like hit hit your up arrow, enter. and you like click over to the other and do that.

00:30:42

Speaker

But when you have multiple sources of and ah data that is flowing in or multiple sources of events. And ah the other sort of tricky scenario that this had been in was um because of legacy reasons, a couple of the databases were in an entirely different VPC.

00:30:59

Speaker

ah Like there's VPC A and VPC B of the source databases. And then VPC C was the new one, which meant... um but running all of those queries from one location, which is your local machine, is not going to be super easy.

00:31:14

Speaker

um So just having it in an automated manner run somewhere is a lot more reliable. but Yeah, I like dress so that as a generic approach. That makes a lot of sense. But also, it sounds like this this tool is very specific to this migration, right?

00:31:30

Speaker

That ended up eventually happening. Like, you can look at

00:31:36

Speaker

Well, there are generic parts and there are um specific parts to this. When you're looking at something like a Postgres migration, which is um definitely not going to be unique, there are metadata tables that you're interested in, ah which essentially tell you the count of rows.

00:31:53

Speaker

ah Those parts are, in fact, if you look up Stack Overflow and Postgres Forums and a couple of these other locations. There are historical ah queries that people have put together. Some of them are fairly strong, ah complex unions. Strong is probably not the right word, but complex unions and ah ah queries that will give you the state of the source database.

00:32:19

Speaker

And you can run the same thing on the target database and it will give you essentially like two tables to compare. um So those parts are the generic parts when it comes to... um the knowledge that, hey, this particular table is had bad data, so you're okay with only copying the first n things out of it, or this you can entirely drop data from here and not worry about it there.

00:32:48

Speaker

those are like That's still relatively easy to make generic enough, but the knowledge is going to be application-specific, so you will end up customizing that. That makes sense.

00:32:59

Speaker

um How much, as you were building these tools out, certainly some of this work is done manually and some of it is done done with tooling.

Automation Philosophy and Tool Development

00:33:09

Speaker

What's your philosophy on when is it worthwhile to make the investment in automation?

00:33:15

Speaker

Oh, as a site reliability engineer, I think my answer is fairly simple. If I have to do a thing a third or a fourth time, then I'm going to go ahead and automate it if I'm going to do it just once or twice.

00:33:30

Speaker

I will probably err on the side of doing it manually. um But when you're any anyway talking about a lower environment and a customer facing or production environment, ah it is fairly obvious that you're going to at least do it twice, at least twice.

00:33:49

Speaker

But when you factor in all of the difficulties that a typical migration includes, there are going to be multiple failed attempts and then one successful attempt each in each of the environments, which easily bumps up the number to three plus or four plus.

00:34:05

Speaker

um So if I'm thinking about migrations, this particular experience ah really drove home the point that Maybe there is going to be generic tooling and maybe there is going to be specific tooling, but always look for opportunities for this and keep working on those opportunities.

00:34:23

Speaker

that mean That makes a ton of sense. If you're going to do it you're going to do it at least twice, then and you can be sure that you'll do it perfectly and there will never be a need to rerun it, then you can do it manually. But that's not how software works frequently.

00:34:41

Speaker

um Looking back on this, and you know now now that you've done this migration twice, maybe it's worth automating. well What would the perfect tool for Postgres migrations look like?

00:34:58

Speaker

That's an interesting question ah to think about.

00:35:07

Speaker

A couple of things that I would want to definitely

00:35:12

Speaker

keep an eye on and use like a tool to help me with this. Every database system ends up having tribal knowledge associated with it, ah which is which is the data that is actually relevant to the application, which is the data that is actually relevant to the customer.

00:35:30

Speaker

Those are two different sets of things. Like there will be overlap in these two answers, but they are different sets of things. And which of this data is... relevant to the operators of the system and which of this data is relevant to nobody.

00:35:45

Speaker

Those are sort of the categories that I would put it in. And a bunch of this data can be guesstimated by looking at all of the statistics that all modern mature databases database systems collect.

00:36:02

Speaker

um How many queries are being sent? How many of them are read queries? How many of them are write queries? Which of them go to which tables? Because we're talking about relational databases, I'm talking tables, but I'm sure something equivalent exists in the NoSQL world.

00:36:16

Speaker

um It is super helpful to have educated guesses about which tables are important for the customer and which tables are important for the application because if you're coming in As an infrastructure person, ah whether you're coming in as a consultant or as ah a full-time employee or some some other context, a mixture of either of them, you do not always know the end-to-end of the user flow.

00:36:47

Speaker

So if you can get estimates on, hey, these are the 10 important tables, you can always prioritize your correctness checks for those. um So any tooling that will help me answer which are the important tables and which are tables that were technical debt is something that I would definitely want to have for the next migration.

00:37:09

Speaker

That's an easy answer for me. I'm sure if I rack my brains hard enough, I might come up with another answer or two. But this is what I have right now as a wish. But that's where you start. I think that's a that's a fantastic insight because i think I think it's very easy if you haven't done one of these um to think that the hard part is is moving the data or some difference between the um the old version and the new version.

00:37:35

Speaker

But as As you're describing, the actual answer is how why does this database exist for the purpose of the business? Why do these tables exist? And that's a much harder question and and does seem much more worthwhile to answer.

00:37:50

Speaker

oh i want to pick at that a little bit. weirdre Where does the answer to that question come from? like how do you How do you know that? You mentioned a couple of things, but where do you need to go dig to find that those answers?

00:38:03

Speaker

Oh, yeah. So um the database statistics are just going to be sort of the spark that ignites that fire. You still need to go and talk to.

00:38:16

Speaker

If you have like a very large setup, you'll go talk to the application engineers. They will have context. But if you're ah also a large setup, then most application engineers have context. deep knowledge but very narrow knowledge if if that makes sense like they have a lot of knowledge about their specific feature set but they don't always entirely know how the rest of the thing works um and if you have a smaller place then there's a better chance that talking to your application developers is going to be very fruitful ah you should still talk to all of them and then you also want to look at who is sort of the

00:38:54

Speaker

person who has the product owner role and they can sort of guide you towards the critical user flows. Another helpful thing at a mature organization is your metrics or your alerting or your SLOs are going to already be covering what these critical flows are and that will help you narrow down what or verify the data that has come on from your metrics and say hey these are actually the ones that are relevant for me and then you go from there to like actually trying stuff out does that help is there something else that you were interested in

00:39:35

Speaker

That makes a ton of sense. um How do you, as you think about like, you're almost pulling this up into a, into a question of prioritization and organizational like desires. At some point you need to flip it back down into like, we're going to move this table and we're going to delete this one. or we're going to focus on these.

00:39:52

Speaker

um How did you, how did you manage the process of, know, especially as a contractor coming in and making sure that you got all of that information such that you could do the right thing technically?

00:40:06

Speaker

That's an interesting conversation. I think ah for this particular scenario, the answer is going to be a bit of a cop-out um because I had another partner, that the person who was actually um taking care of the application side of things while I was doing the um migration at the infrastructure layer.

00:40:28

Speaker

um They sort of had the

00:40:36

Speaker

the permission to go ahead and make changes as necessary, as long as the core functionality of the application did not

Migration Completion and Technological Excitement

00:40:44

Speaker

break. um And that gave both of us the freedom to ah take those calls ourselves.

00:40:51

Speaker

um Of course, we did verify that stuff with... the CEO and the CTO roles in the company, the people who are holding those roles. But as long as we were able to make those changes in staging and have a couple of folks who were familiar with all of the workflows in the application use the staging application for a bit and give us confidence that, hey, there's nothing broken here.

00:41:19

Speaker

we were able to go ahead and make those changes, take those calls of like, hey, this table can stay, this table can be pruned, etc., et etc. Yeah, for sure. it's At some point, you have actually have to pull the trigger and you get all the information you can and you you start to pull the trigger and you validate it as you go. And if it all looks right, then and it all alls well that ends well.

00:41:42

Speaker

Yep, that's the... Very cool. um So what's the state of the world the world now? You finished the migration, everything everything went peachy? Oh, yeah. ah The database migration went pretty painless.

00:41:57

Speaker

We ended up um getting to a point where, like I mentioned at the beginning, we had multiple RDS instances and all of them were like, the only nodes or that had that data.

00:42:11

Speaker

Of course, RDS had the automated snapshot setup. So in the very worst scenario, you could restore from a backup, but that would mean a loss for the interim period where the snapshot had not been taken.

00:42:23

Speaker

um We got to a point where we had two nodes and another sort of ah thing that we tried after the full migration had been done was we migrated the the RDS instances from the legacy or classic x86-64 hardware over to Graviton hardware.

00:42:45

Speaker

And ah because RDS is essentially a managed service, ah you can mostly assume that especially if your workload is classic CRUD workloads with REST APIs, that most of those workloads are going to give you a similar performance at usually a much cheaper ah cost.

00:43:07

Speaker

ah So that's sort of what ended up happening. Like we migrated all of the data over to modern instances and modern versions. And then we went to cheaper versions. And ah we also were able to make sure that the parameters that were necessary for the larger spiky workloads were tuned for the newer database version and stuff has been running fine. There been no database related incidents since then. There were a couple of incidents that happened at that place due to new code being rolled out, um but that was not due to this particular migration.

00:43:46

Speaker

Awesome. Glad it all went smoothly. All right. We're just about out of time. So um this is fantastic. The questions, the answers you have around all this are are amazing. And sounds like a a really positive experience for everyone involved, which is not always the case.

00:44:07

Speaker

That is true. um That is true. I was fortunate that I've seen a bunch of database migrations before um from the sidelines. This was the first couple of times that I actually did those migrations myself.

00:44:21

Speaker

So there was some support back in there.

00:44:27

Speaker

Right on. ah one One final question. Where do you see, what what are you most excited about in technology right now?

00:44:37

Speaker

Oh. um

00:44:43

Speaker

Well, one is, of course, the work that I'm doing right now. i work with a company that makes it super easy to well, do what we're doing right now, which is talk to each other across continents and oceans.

00:44:57

Speaker

um be we have ah We essentially make WebRTC as a service. um And I'm always super excited about making it easier for people to work with each other or just interact with each other.

00:45:11

Speaker

And that just makes... the world a better place. That's always something that excites me. And that if you look at sort of like you mentioned it at the beginning beginning of the conversation, I've pretty much worked in productivity companies of some form or the other, whether it was mail or chat applications or video communication applications.

00:45:31

Speaker

That sort of is always something that I enjoy working in. And the other thing that I'm really looking forward to is um We did briefly touch up on this.

00:45:42

Speaker

LLMs have their faults, but they also make it super easy for prototyping ideas and getting stuff out there. And people have a bunch of different ideas that they have not been able to explore and experiment with because writing code and dealing with systems is hard.

00:46:00

Speaker

LLMs don't make all of that easy, but they make at least a part of it easy. They make it easy to prototype stuff. So I'm really looking forward to ah all of the cool ideas that people build out.

00:46:13

Speaker

And I know this will sound very ah cliched, but I am currently optimistic about it. i am I absolutely agree. um I think there are a ton of rough edges and a ton of difficult strengths and weaknesses of LLMs that we just, most most of us haven't internalized yet. We don't know how to use this tool, but there's there's not a technology that I've been this excited about in a very long time.

00:46:39

Speaker

So we'll figure out. Awesome. I'm looking forward to what comes out of this. Absolutely. um Finally, if folks have follow-ups, where can they find you on the internet? um they can find me on Mastodon where I'm the most active. I'm happy to put links in to show notes or wherever it makes sense.

00:46:59

Speaker

um I have a blog which I infrequently write on and there's also X or Twitter which I check infrequently but that's probably the easiest way for people to reach out to me.

00:47:11

Speaker

Great, we'll include those links. Awesome, thank you. All right, thanks so much, Ninad. This was fantastic.