Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
How Chick-fil-A adopts GitOps and K3s at the Edge image

How Chick-fil-A adopts GitOps and K3s at the Edge

S3 E18 ยท Kubernetes Bytes
Avatar
2k Plays1 year ago

In this episode of Kubernetes Bytes, Bhavin Shah and Ryan Wallner interview Brian Chambers, Chief Architect at Chick-fil-A. Brian walks through some of the design decisions, challenges and architecture of how Chick-fil-A uses Kubernetes at the edge in their restaurants.

Join the Kubernetes Bytes slack using: https://bit.ly/k8sbytes

Ready to shop better hydration, use "kubernetesbytes" to save 20% off anything you order.

Try Nom Nom today, go to https://trynom.com/kubernetesbytes and get 50% off your first order plus free shipping.

Interested in attending Boston DevOps Days?

Timestamps

  • 01:05 Introduction
  • 06:22 Cloud Native News
  • 19:13 Interview with Madhuri
  • 01:13:20 Takeaways

Cloud Native News:

  • K8s 1.28 https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/
  • SC assignment stable- https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/#automatic-retroactive-assignment-of-a-default-storageclass-graduates-to-stable
  • Non graceful shutdown stable - https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/#generally-available-recovery-from-non-graceful-node-shutdown
  • Ceph RBD and FS in tree deprecated
  • Control plan and node supported version go from n-2 to n-3
  • Redhat Openstack services on OpenShift - https://www.redhat.com/en/blog/red-hat-openstack-services-openshift-next-generation-red-hat-openstack-platform
  • Alcion 21 Million funding round: https://techcrunch.com/2023/09/19/alcion-which-provides-backup-and-security-services-to-enterprises-raises-21m/
  • Veeam was major funder: https://www.techtarget.com/searchdatabackup/news/366552363/Veeam-leads-funding-round-for-SaaS-backup-provider-Alcion
  • Kubescape 3.0 - https://kubescape.io/blog/2023/09/19/introducing-kubescape-3/
  • GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances or MIG based sharing
  • https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances
  • https://aws.amazon.com/blogs/containers/maximizing-gpu-utilization-with-nvidias-multi-instance-gpu-mig-on-amazon-eks-running-more-pods-per-gpu-for-enhanced-performance/
  • Akuity launches Kargo - New Open Source project to automate declarative promotion of changes across multiple app environments - https://www.businesswire.com/news/home/20230918552920/en/Akuity-Launches-Kargo---a-New-Open-Source-Project-to-Automate-the-Declarative-Promotion-of-Changes-Across-Multiple-Application-Environments
  • OpenTofu - Linux Foundations alternative to Terraform - loads of community support
  • https://www.linuxfoundation.org/press/announcing-opentofu?hss_channel=lcp-208777
  • CFP already open for Paris!!!! https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/program/cfp/
Recommended
Transcript

Introduction and Sponsor

00:00:00
Speaker
As long time listeners of the Kubernetes Bytes podcast know, I like to visit different national parks and go on day hikes. As part of these hikes, it's always necessary to hydrate during and after its turn.
00:00:14
Speaker
This is where our next sponsor comes in, Liquid IV. I've been using Liquid IV since last year on all of my national park trips because it's really easy to carry and I don't have to worry about buying and carrying Gatorade bottles with me. A single stick of Liquid IV in 16 ounces of water hydrates two times faster than water and has more electrolytes than ever.
00:00:38
Speaker
The best part is I can choose my own flavor. Personally, I like passion fruit, but they have 12 different options available. If you want to change the way you hydrate when you're outside, you can get 20% off when you go to liquidiv.com and use code KubernetesBytes at checkout. That's 20% off anything you order when you shop better hydration today using promo code KubernetesBytes at liquidiv.com.

Meet the Hosts

00:01:08
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts.

Episode Overview: Cloud News and Data Management

00:01:20
Speaker
We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:01:38
Speaker
Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts. Today is September 26, 2023. Hope everyone is doing good and staying safe. Let's dive into it. I mean, it being September already, we're at the end of the year. Yeah, we passed the what a terminal equinox equinox or something like that. And now we are officially in fall.

Fall Activities and NFL Season

00:02:04
Speaker
The weather certainly seems like fall. Sweater weather.
00:02:08
Speaker
Yeah, swear to word that. Nice. Yeah, I can't complain about that. What's new, Babin? How are you doing? I'm doing good. Just keeping busy. As you said, I followed by the conference season, planning for things at KubeCon and other shows. But I don't know. I had a couple of slow weeks just at home, played board games with friends, with neighbors. I had a couple of friends over for dinner. NFL is back, so Sundays are always preoccupied and planned for.
00:02:38
Speaker
Yeah, I'm clearly behind on the DIY project. I haven't made any progress. So just an update, like nothing. Marvin hasn't done anything on his wall yet.
00:02:51
Speaker
Do you have any episodes that takes you to finish it up? If people start publicly shaming me, then I might need to pick it up and actually do it. Take a day off. I know Tim is listening to this. Tim, I'll leave the day off, dude. I won't publicly shame you. I'll let other people do it then.
00:03:11
Speaker
How have you been, Ryan? It's good, been good. I went riding last weekend, I

DevOps Days Boston Plans

00:03:18
Speaker
think, down in Connecticut during the Hurricane Lee or whatever it was. Well, I think it was still a hurricane, right? Yeah, but it didn't really rain much, at least where I was. It probably rained in Boston a bunch.
00:03:29
Speaker
It was nice. I stayed dry. It was one of the hardest things I've done in a while. Left a little sore. Oh, wow. But it was a, it was a blast. Otherwise, yeah, it looks good to slow down every now and then. I think that's good. We'll be at, I know you're going to be at DevOps Days Boston and so am I.
00:03:46
Speaker
Yeah, following a lead. Once you said you wanted to swing for it, I was like, yeah, that's a good show. So if anybody will be there who listens, come say hi. We'd love to meet some local Bostonites who are listeners. That'd be really cool.
00:04:02
Speaker
I don't like, they have been sharing speakers on the LinkedIn channel and some of those sounded like really interesting. So yeah, I would love to meet people locally, maybe get some guests on the pod, right? Yeah, absolutely. I haven't been to Dallas State Boston in a little while, which is surprising since it's a local conference, but I'm excited to be back. It's been a while. I think I was there in like the 2015 era timeframe in those years. Oh, wow. Yeah.
00:04:29
Speaker
It's been going on for quite a while. I mean, obviously there was a little gap for everybody in the 2020 timeframe, but yeah, excited for that one.

Guest Brian Chambers and Chick-fil-A's Tech Evolution

00:04:37
Speaker
Speaking of being excited, we have a really great guest today. We won't introduce him just yet. We will dive into some cloud native news. There was a bit of news this week, it was a little less slow than the previous. I know, it's starting to pick back up. Well, it's conferencing. Or the fall, the fall version of it.
00:04:59
Speaker
If you've ever had a puppy and raised it to become a big dog, you know that changing food and finding the right food is hard to get right. Ultimately, you want them to feel good and act happy and be okay with what they're eating. They're part of your family, after all. I have an eight-year-old Golden Retriever named Roscoe, and he's always had a sensitive stomach, so finding the right food was kind of a pain. That's where Nom Nom comes in.
00:05:24
Speaker
Nom Nom's food is full of fresh protein that your dog loves, and the vitamins and nutrients they need to thrive. You can actually see proteins and vegetables like beef, chicken, pork, peas, carrots, kale, and more in the ingredients.
00:05:39
Speaker
So here's how it works. You tell them about your puppy, the age, breed, weight, allergies, protein preferences, chicken, pork, beef, and they'll tailor a specific amount of individually packaged Nom Nom meals and send them straight to you. If you're ready to make the switch

Edge Computing Solutions at Chick-fil-A

00:05:54
Speaker
to fresh, order Nom Nom today and go to https forward slash forward slash trynom.com slash Kubernetes Bites.
00:06:03
Speaker
and get your 50% off of your first order, plus free shipping. Plus, Nom Nom comes with a money-back guarantee. If your dog's tail isn't wagging within 30 days, Nom Nom will refund your first order. No fillers, no nonsense, just Nom Nom.
00:06:22
Speaker
Yeah, sure. So I wanted to start talking by one of our previous guests, Armosecurity, Armosec, and their open source project, which is also a CNCF project. Cubescape, they have a new 3.0 version now with a few new enhancements that warranted a 3.0 or a major release. The first one being configuration scan results. You could do it one at a time. You can run the CLI utility or you can deploy Cubescape agent on your Kubernetes cluster and do a scan once.
00:06:50
Speaker
and have the results displayed on the screen. There was no way to capture those results unless you were copy-pasting things. Now they have a couple of new CRDs, config scan summary and workload config scan summary that will allow you to store results of the previous runs so you can actually compare things. I feel that that would have been a feature already in the tool, but I'm glad that they added it in 3.0. There you go.
00:07:12
Speaker
They also have a new EBPF engine for lower resource consumption, a new Prometheus exporter, and then image scanning, which was a feature for only the in-cluster variation or in-cluster version. Image scanning is now supported with your CLI version as well. If you have CubeScape CLI installed on your Mac, for example, and you have configured a container registry, you can run your simple command to scan the images there or even do one at a time. Quite a bit of updates in CubeScape 3.0.
00:07:43
Speaker
Very, very cool. Then I wanted to talk about a GPU and here I am and all of these buzzwords. AWS has been running these series, at least a couple of blogs that I found that focuses on how you can use NVIDIA GPUs for different multi-tenant EKS clusters. There are a few different ways to splice or share a GPU instance or actually run an application using GPUs.
00:08:09
Speaker
The easy ones are just single process using CUDA, multiprocess, and then the sharing part comes with either time slicing, multi-instance GPU or vGPU. We'll link a couple of blogs in the show notes, the first one talking about time slicing and how it allows you to, I don't know,
00:08:29
Speaker
schedule these workloads one at a time. And then whenever Kubernetes Scheduler is scheduling pods, it will take into account the GPU resources available when making scheduling decisions if you have the NVIDIA Kubernetes device plugin installed. So that's goodness, like how you can share the same GPU for different applications.
00:08:47
Speaker
The second one being the multi-instance GPU. I know you'll need the expensive A100 GPUs in this use case, but it will allow each tenant or each application to have its own dedicated resources from a GPU memory compute perspective. So same cluster, but dedicated GPUs using the multi-instance GPU framework.
00:09:08
Speaker
Cool. Sounds a lot like what we've been used to with Linux C groups for CPUs, right? In terms of sharing those and splicing those and configuring how those are accessed. Yeah. Very cool. I know. I was just impressed because they are sharing hands-on examples of how to use these different architectures on EKS, like bringing Kubernetes into the picture, bringing
00:09:31
Speaker
EKS and making this available to all of the EKS customers today. I'm sure you can copy this and use it on different cloud or on your own if you have access to Nvidia GPUs in your on-prem data centers. But yeah, good techniques to follow or good blogs to read, I guess. Very cool.
00:09:48
Speaker
Yeah, next up, one more bigger announcement and then I'll just list a few a QT, right, the Argo CD Enterprise Company, or Argo Enterprise Company, they introduced a new open source project called Cargo, Cargo with a K, and that is
00:10:03
Speaker
their way of helping or improving developer experience and allowing them to move their applications through different stages in SDLC or software development lifecycle. So instead of having to rely on custom scripting based on CI tools that you're using, Cargo allows you to subscribe to artifacts like Git artifacts, image updates, Helm Chart updates, and based on those updates, your application will move from let's say dev to
00:10:28
Speaker
staging to test or dev to test and their test to staging QA and so on and so forth, all the phases in the cycle. So interesting project to check out. It just came out. So I don't know how it looks from an actual feature perspective, but knowing actually like they'll invest more into it as well.
00:10:45
Speaker
And then a few different things to just quickly recap open tofu. I really like the name. I don't know for some reason, like being a vegetarian, maybe. The name's gotten a lot of flack, I feel like. I know it has, but it's more catchier than open TF, you know?
00:11:02
Speaker
So OpenDofu announcement, I know everybody who's active on social media might know this, but for people who didn't know, Hashcob changed the licensing model for Terraform. So what Linux Foundation did was it created a fork or took that same repo. Now OpenDofu is available as an alternative to Terraform with loads and loads of community support. Even in the blog that we link, you will see a list of all the different vendors that say that, yep, we are going to support OpenDofu going forward.
00:11:31
Speaker
So it's not just one or two companies. It's like hundreds of companies that are saying that they'll support this open source project. So something to look out for. And then CFPs for KubeCon Europe are already open. Dude, I can't believe that. Like planning phases for KubeCon North America, Chicago, 2023 and CFPs for 2024 are already open.
00:11:54
Speaker
Yeah, I don't know. I don't have any ideas right now, but I need to start thinking about this. It gives you a chance to submit a talk that might have been turned down to the next KubeCon. If you already worked on one, you could kind of... I know you're not supposed to just regurgitate what's been turned down, obviously, but you have an idea there that maybe you could drive on.
00:12:14
Speaker
I mean, I guess didn't get accepted. But if I add a few more things to it, maybe that gets accepted. So yeah, the percentage ratio for acceptance is like really low. So good luck everyone.
00:12:26
Speaker
And then I know this will lead or provide a segue to your news item. But SpectroCloud, one of the vendors in the Kubernetes ecosystem, they announced a new funding round. They didn't really say what round it was, how much money there is, what their total valuation is. It's just like, OK, we have some new money from Qualcomm Ventures. And that's about it. So if you are at SpectroCloud or if you know more details about this, please share it with us. I don't know. This feels random. Like, oh, we have new money.
00:13:04
Speaker
But give us what it is. Speaking of KubeCon, actually, I'm going to go back before your set way. Yeah. And I heard that the North America one is supposed to be in Salt Lake City, Utah in the whole another year from now. Oh, I didn't know that. OK. Yeah. Well, this is this is hearsay. I heard it from a few people that so maybe I'm wrong, but I
00:13:17
Speaker
But that's it. It's

Industry News: Red Hat and SpectroCloud

00:13:26
Speaker
think that would be very cool. I've never been to go to that city quite often. So that's if that's true, I'm excited for it.
00:13:26
Speaker
exciting stuff, right? Anyway, congratulations.
00:13:32
Speaker
You know, my mind directly goes to like, oh, Salt Lake City, November, good time to like book a trip to Arches National Park, like after KubeCon, like, dude, that's only National Park remaining in Utah for me. So that's perfect. There you go. We have a golden mind.
00:13:48
Speaker
I like that. I like that. Cool. So yeah, I mean, in terms of funding rounds, I did have one on here, which is probably the one you were referencing. ASEAN raised $21 million in the funding round. The major funder was Veeam. And the connection there is the founder is, was previously a casting who's
00:14:12
Speaker
acquired by Veeam. There's probably some good relationships going on there. So congratulations to Neeraj, Talia, and I have a Bob Karmra to all that funding round. Really exciting stuff there. If you don't know, their whole thing is around sort of a driven Microsoft 365 backup. It's very interesting and different take for Neeraj there. So very cool stuff going on over at Elsion, which I
00:14:39
Speaker
I think they might be local to Boston. Maybe I'm making that up in my head. It might be. The next one I had here was we're going to throw back a little bit to OpenStack here. So Red Hat did announce sort of their next gen OpenStack services, but this is specifically a early release for Red Hat OpenStack services on OpenShift. So I remember actually using the Cola project
00:15:08
Speaker
I don't know if you were familiar with that one, but the Cola project was basically bootstrapping OpenStack on Kubernetes and is all in containers. You could use Kubernetes, you could use containers, but very cool stuff. I think this kind of is targeted at the telcos a little bit more, but really a reimagination of the control plane for OpenStack and how to deliver it in a OpenShift environment where a lot of
00:15:38
Speaker
organizations and and folks are going and running their applications so i think this is sort of in line with a lot of what we've been hearing about even in like with qbert right where you get to run you know workloads that may be suited well for uh vms and i know a lot of telcos have
00:15:54
Speaker
you know, virtualization services and things that are kind of built for OpenStack. I think a ton. I think one of the articles references the percentage of, of like a 4G, 5G that's running OpenStack. So if like you're using your, I think one of the articles says like, if you're using your cell phone, likely you're using OpenStack.
00:16:12
Speaker
Okay, interesting. Ryan, I know you have experience with OpenStack, right? I get the whole cube word on OpenShift thing. Yeah, that makes sense. That makes sense only if you're running OpenShift on bare metal. Don't do VMs or VMware and run OpenShift on top and then run VMs on top. That doesn't make sense. Why OpenStack on OpenShift? Doesn't OpenShift give you some of these services that OpenStack did?
00:16:37
Speaker
Well, I think the big benefit for OpenStack is really taking advantage of what's already there in terms of scheduling and scaling and those kind of things. The OpenStack services can take advantage of those for providing those and installing those and managing those. That being said, this goes back, I think, to more of Kubernetes being that data center operating system, right? That process that the way we hear about Borg being used at Google, where it kind of runs everything, right?
00:17:05
Speaker
Okay, so this I think is in line with that term, you know Kind of perspective where Kubernetes is kind of the thing that just is a workload scheduler We've really associated with containers and cloud native apps But as we get closer and closer and more of this kind of integration happens I think it becomes more of that data center operating system where yeah sure go ahead and run this other platform or go ahead and run a VM or go ahead and run a
00:17:29
Speaker
what's appropriate for the use case. So I think I'm all for it. I'd be curious to see what it does for the OpenStack community, whether it gets more people involved. It was obviously huge quite some time ago, but still clearly used in many, many cases.
00:17:47
Speaker
Okay. That makes sense. Thank you.

Chick-fil-A's Tech Stack Transformation

00:17:50
Speaker
Fun stuff. Little throwback there. And of course, we did mention it last time. Kubernetes 120 is out and I won't go into all the details here, but we did put a bunch of links to storage class assignment is stable, non graceful shutdown is stable.
00:18:06
Speaker
so F R B D and F S are officially deprecated. Control plane, yeah. Control plane node support can go back another version. So if your control plane is updated, it can go not only to N minus two, N minus three, meaning a couple or three versions older to kind of extend your support because all the workloads you want less disrupted and that's the whole point.
00:18:26
Speaker
Yep. We'll put those links in there. But that's what I had for the news, Bobbin. So today we have a very exciting guest. His name is Brian Chambers. He is the chief architect at Chick-fil-A. So you might have heard of Chick-fil-A?
00:18:43
Speaker
They're pretty big, I think, in the last five years or so. Although I don't necessarily eat it a ton myself. I do. I have had it, and I have enjoyed it. But it's really interesting to see their technology side of things and how they're managing all the stores. And Brian's been a really great advocate for talking about this. He's talked about a number of different venues, as well as the data on Kubernetes community and a bunch of others. So we got the chance to have him on the show. And I guess without further ado, let's get Brian on the show.
00:19:13
Speaker
All right, Brian, welcome to Kubernetes Bites. Thanks for joining today. Before we dive into it, you introduce yourself and tell the audience a little bit about who you are and what you do.
00:19:23
Speaker
Yeah, thanks for having me guys. Looking forward to it. So Brian Chambers, I'm the chief architect at Chick-fil-A. We are a quick service restaurant company, probably most people have heard of, but if not, we have about 3000 restaurants, mostly in North America. And leading the architecture practice basically means kind of two things like trying to contextualize a lot of the business problems and challenges that are going on and helping people understand like,
00:19:51
Speaker
Um, you know, what problem are we trying to solve together? And then how well is technology supporting that today? So we can figure out like where to invest. Um, and then also like looking at all the cool new emerging stuff and really making sure that we right size, uh, our investment in technology, um, across the business. So making sure that we have the right capabilities for, you know, software engineers to build and deploy things, um, as well as, you know, things that the business might use directly as well. So we're really focused on that kind of putting tools in the toolbox and then, um, consulting with people and helping them pick the right things, um, you know, as much as possible.
00:20:21
Speaker
Very cool. I know, that's awesome. And like if people haven't heard about Chick-fil-A, guys, just go on Google Maps and like find you the closest one. I don't eat meat, but my wife really likes Chick-fil-A, so. You're missing out, Bobbin, apparently. Awesome.
00:20:36
Speaker
Okay, Brian, so I know you are on the Kubernetes Bites podcast, so we're going to talk about Kubernetes in some way, a former factor, right? But I wanted to talk about what things looked like before, like before 2017 when you went to KubeCon Austin and it snowed, and that's when you had the light bulb moment that I need to use Kubernetes. So what did the Chick-fil-A architecture look like before, and then we can dive into what it looks like today.
00:20:59
Speaker
Yeah, there's really two parts. I think we're probably gonna talk a lot about the edge side of things, which is a restaurant. But, you know, we also, of course, had a non edge footprint for a long time. So, I mean, we were one of the, the typical, I think, enterprise companies that did a lot of, like, big middleware tools, and, you know, your ERP centric, a long time in the past, you know, just
00:21:25
Speaker
a lot of package applications in addition to some SAS applications. And then, I mean, really software engineering and development looked like building integrations and things like that on middleware platforms between these different tools. And I say all that because anybody who's worked in that world knows that those things can be really helpful in a lot of ways, but they can also be really challenging as well. So Kubernetes, which we'll get to, was a part of that modernization story.
00:21:54
Speaker
But probably the thing that most people have heard of related to Chick-fil-A and technology and that we'll spend a lot of time on today is our story about technology in the restaurant. And so really the history is, you know, the word edge, I guess is kind of just a novelty. It's been something that I think a lot of people have done for a really long time, like had compute close to the action in the store where it was needed. But for us, it was really our point of sale system came with a
00:22:21
Speaker
Like a Windows server that ran and still still does run in many of our restaurants. It's on the way out, but it's still there and it was a place to really run the point of sale. But what it didn't do is provide really any facilities for building and deploying custom things that did anything other than point of sale.
00:22:39
Speaker
So there was some stuff that had to get kind of jammed in there over time, but it was always like super risky to make changes and really long lead times and rollout times and just didn't have the flexibility. I mean, it was like deploying things directly into, into the windows operating system. Um, so that's kind of what that looked like before. And there were a lot of reasons that we wanted to revolutionize that and do, uh, different stuff, um, with our edge footprint. So that's kind of the two sides and what it looked like historically. And as cloud, uh, kind of came along.
00:23:07
Speaker
We started embracing some new paradigms, but for our primary applications that were traditionally like data center and such, and then also the way that we think about Edge as well.
00:23:17
Speaker
Gotcha. Gotcha. Makes sense. I actually didn't know the history of Chick-fil-A. I had to, like, Google it. Because I just know the modern Chick-fil-A that I think we know today. I didn't realize it traces back to, like, you know, the early 1940s something I think I saw. Oh, wow. OK. Well, it didn't have a different name, right? But yeah, no, not too much computing back there. Not too much.
00:23:41
Speaker
Back to Truett's original restaurant on the south side of Atlanta called the Dwarf Grill and grew up into Chick-fil-A eventually as it went to multiple restaurants. That's been our brand for a long, long time now, but certainly a lot of the expansion has been during the time I've been there, which is the last 20 years or so. We've grown from a handful of states to pretty much every state except for Alaska and North Dakota.
00:24:08
Speaker
It's definitely changed a lot. The tech has changed, but also the business has gotten a lot bigger and more diverse in the process as well. Got to get some Chick-fil-A in Alaska, North Dakota, I guess. Got to have gold.
00:24:23
Speaker
Cool. Well, I think as we talk about Edge a little bit more and why you went that direction, I think it'd be good to get the understanding of what the challenges were that made this re-architecture happen. Was there a point in time where you're like, flip the desk, we can't do this anymore? Too many outages. Tell us more about that.
00:24:47
Speaker
Yeah, it wasn't quite very dramatic. It was really driven by incoming business need, to be completely honest. So one of the things I really liked that Chick-fil-A started doing, you know, eight or 10 years ago,
00:25:01
Speaker
is we started having these things that we call them big moves for a while. Now we call them strategic bets, but it's really just like a top down. What are the biggest things that the business is trying to solve for right now so that we can rally around them disproportionately? It doesn't mean it's the only things we do. There's a lot of other stuff that isn't strategic bets that's critical to running the business and serving customers and all those things, but it's places that we feel like there's really big opportunity if we can either solve problems or take advantage of opportunities.
00:25:31
Speaker
Um, so we try and focus on those. So one of those several years ago is around, uh, what we call restaurant systems capacity. Doesn't mean just technology systems. It really means the whole way that the restaurant works. And for anyone who's listening, who's been to Chick-fil-A, you probably experienced, uh, the busyness of the restaurants, whether it's a drive-through line that's like wrapping around the property.
00:25:51
Speaker
It still moves fast for a bunch of reasons, but huge volume there. I mean, same story inside. It's just super busy. We've picked up new channels like delivery that we fulfill as well as partner delivery. And there's a ton of volume there. So we just have a lot of challenges that are growth related, which are really good ones. But it's really made it a challenge to continue to scale all of the systems that we had in the restaurant. And like our goal is to maintain really great experiences for all the people there.
00:26:18
Speaker
which of course is the customers, but it also includes that restaurant operator and their team members who are doing a lot of really hard work. So all that led us into starting to look at what could we do in the restaurant that would help us force multiply as much as possible, help us scale. And we're an organization which like most everybody these days believes that technology is one of the key elements. We would say people process data technology, but technology is a key
00:26:47
Speaker
element and thinking about how do we help the business get where it wants to go, achieve the things that it wants to achieve. And, um, so that kind of led us down this journey of, uh, getting into things like, uh, IOT, uh, enablement starting in our kitchen. Um, we really wanted to, uh, look at the kitchen as a bottleneck that we were trying to solve problems for decision-making for team members easier, et cetera. It led us into starting to connect things. IOT is a buzzword, but really just connecting all the things that exist there.
00:27:14
Speaker
Friars, grills, holding cabinets, all these kinds of things with the ability to start to work towards, you know, maybe automating stuff or at least providing more real time intelligence to the people working there.
00:27:26
Speaker
take away some of their thinking and cognitive load that they have to do to figure out what to cook and when, and try and systematize a lot of that stuff. So the really fast version of that is when we started doing things with IoT, it's like, okay, there's like two options here. One is you make the things smart, and you're feeding data into a fryer or a grill or whatever, and you're asking it to make its own decisions and maybe send some things back out.
00:27:47
Speaker
That sounds really bad. Is you put the intelligence somewhere else and we would have loved to have done things like that in the cloud. It's far easier for your constraints. One of the constraints of the cloud though is that if you're on at a site that doesn't have super reliable connectivity or you just can't tolerate latency or potential outages.
00:28:07
Speaker
you've got to come up with something else. And so that led us into the edge and then ultimately to Kubernetes. But what we really wanted was a way to take applications and run them in that restaurant environment and let them interact with all the stuff that was there, all the things. Point of sale is something that's entering that ecosystem like right now as we speak. So really building this ecosystem of everything's open, everything can share data.
00:28:31
Speaker
We can externalize intelligence from the things into the edge and build apps that can sort of like interact with all this stuff and make great things happen for the operator, the team member, the customers as well in the restaurant.
00:28:42
Speaker
Got it. Got it. Yeah. One point you made there was IoT Edge being buzzwords. And we hear them all the time. I think more and more every day. Was Edge something that was sort of the direction you heard it and saw the architecture and said, this is what we need?

Understanding Edge Computing in Restaurants

00:28:57
Speaker
Or were you kind of or I mean, you just said you were kind of driven by growth and a bunch of other things. Was it just happened to be that sort of the architecture that you were driving towards happened to be sort of Edge? Yeah, I mean, of course, the way I'll answer it sounds like
00:29:11
Speaker
We're awesome, but I mean, the truth is that like we didn't come up with that term the first time, but we didn't really know what to call what we were doing at first, like other than just like computer in the restroom. So we actually ended up using Edge pretty early on, but like there were other terms at the time that Edge wasn't really big yet. It was fog computing is what people were saying. Oh yeah. I'm glad that one went away. I understand the intent of it, but I was like, oh, we don't want to call it our fog compute.
00:29:41
Speaker
We kind of started using Edge. We had this nifty little logo that we just called it the Edge Compute. We initially rebranded it, but that was really driven, like you said, driven by a business need to manage the constraints that we had at the restaurant and try and create a place to run applications that was resilient, really as available as our LAN is, which is pretty good. A LAN in the restaurant is great.
00:30:04
Speaker
Um, and our way connections are pretty decent, but, um, you know, they're mostly up, but there's those times that they're not. And we anticipated business needs, um, coming that we're going to not be okay. Like if you have something that's a 10 X force multiplier for a team member, which is the goal. And then it goes away even for an hour. It happens to be during lunch. That could be really bad operationally and at service or.
00:30:27
Speaker
you know, um, a bunch of issues, a lot of stress. And so we just wanted to mitigate some of those things, which led us to think about putting something on prem. Maybe that's the other thing you could call it, but, um, we liked the term edge. And, uh, as that sort of developed, it's like, yeah, that is what we're really trying to do. Our version of edge is at the store. Other people's is really, my definition of it is just as close to the user as is necessary to create the experience you want, but no closer. Um, so for a lot of people that CDNs or
00:30:53
Speaker
serverless functions at a CDN or whatever, for us it just happened to be we needed to get into the restaurant due to Last Mile Network. So that's what we did.
00:31:02
Speaker
That's an interesting use case. When I was thinking about the Chick-fil-A use case, I was mostly focused on point of sale systems and how you run those, but building a hub for all the devices that are inside the store and making sure order gets processed at the right time and make sure all the intelligence is built at the edge. That's awesome. That's new things that I learned today.
00:31:25
Speaker
Yeah, I don't know what your busiest time of day is for Chick-fil-A because I know that you kind of do breakfast as well. I haven't done that myself. But it's kind of like you have a Black Friday at certain times of the day, every day, I mean, in this world of sort of
00:31:40
Speaker
a bad dining experience, people, you know, go to the internet with your pitchforks, right? I mean, that could really affect the restaurant at the end of the day, right? Yeah, totally. I mean, and we're, I mean, 11 30 till like two o'clock, roughly. It's crazy. And then we have like, you know, smaller spikes for dinner. And it kind of depends a little in the restaurant, but this is just sort of, you know, the median or whatever. Yeah, but little bumps, you know, at breakfast time and then some wall and then lunch is crazy. And then a wall and then you've got the dinner time that
00:32:10
Speaker
comes in the evening that isn't quite as busy as lunch, but can still be really busy. And some stores just based on where they are tend to have bigger dinners, but I'd say that's probably rare. But yeah, we've got these
00:32:20
Speaker
You know, big spikes of traffic that are happening. I mean, they're predictable. Like roughly we kind of know what we're going to deal with. There's not a, we don't get 10 X more customers most of the time than we would on a normal day. So it's not like your web scale systems type problems, but it is a really, you know, it's a whole bunch of people who are looking for a great experience who we want to serve really well. And the technology.
00:32:42
Speaker
isn't reliable and doesn't do what it needs to do, that could impact that. And that's really what it all comes back to is the goal is just to make sure that people have a great experience. Nice. Yeah. I think Ryan's altered your motive with this question was just like figuring out the best time to go and to show up. When to show up? Yeah. Yeah. A coder of the law. We would love that. I had to get the data out of Brian.
00:33:04
Speaker
Okay so Brian we have teased Kubernetes enough and Kubernetes at the edge enough. Let's talk about what the stack looks like today and why did you choose K3S specifically to build this new architecture or the new stack architecture and let's cover everything from the hardware all the way up to the apps.
00:33:20
Speaker
Okay. Yeah, sure. Um, so maybe I'll just give you the, uh, the high level survey view first and we can kind of dig in. So, um, if you think about the stack, essentially what we have is, uh, each restaurant has three, uh, Intel Nooks. And, um, on that we essentially are running, um, on bare metal. So you've got, uh, Ubuntu, we were, we're 18 oh four, which is.
00:33:39
Speaker
The long term support edition that was the newest at the time that we started and then we essentially I won't go into all the details about it because it can get really hard to talk about without a picture. But we did some stuff with overlay to essentially we lay down a partition scheme before we get to the operating system that effectively the OS is there.
00:33:58
Speaker
the one that you're going to boot off of and all that kind of stuff. But you've got the OS there and you've got some other partitions, some of which are durable. So you could store certain things there that would last through the next thing, which we call a wipe. And then with overlay FS, we did some tricks that basically let us build up our image on top of that, which includes K threes and patches and all these other kinds of things that may be on the box.
00:34:19
Speaker
Um, but that's all wipeable remotely back to its initial, uh, show up state, the initial image that it had. The only thing that would remain would be on that durable partition, um, which are pretty minimal. And then essentially that, what that lets us do is wipe our whole stack back to, you know, to zero for having operational issues or whatever, um, and start over or, uh, you know, rebuild on a different version potentially of OS. We actually haven't had to do that long-term support, uh, OS so far. We've just added patches.
00:34:47
Speaker
but lets us do that or a new version of K3s or whatever. So that's sort of the stack, the Intel Nooks, the partition scheme, the Ubuntu 18.04, and then we ultimately were running K3s on bare metal on top of that. And so you asked about why K3s.
00:35:04
Speaker
Uh, time travel back to before that, uh, cube con that we talked about at the very beginning. Yeah. Um, one thing that we knew is that we wanted these container or these applications. I spoiled it. Uh, these applications are going to run in score to be containerized. I'm like, why will.
00:35:19
Speaker
We really wanted to mirror the paradigms that our software engineers are used to in the cloud as much as possible. And container-based deployments have become the de facto for us, as well as it was obvious that was where the industry was going for the foreseeable future. And so we didn't want to do it just because of that.
00:35:38
Speaker
With that comes all of the open source ecosystem, you know, all of the community engagement, all of the stuff that you, you get in addition to the technology itself, which we really put a high value on. So, um, we knew we wanted containers. And so we looked at all the options like HashiCorp, Nomad, um, you know, Docker form. We actually had a prototype.
00:35:58
Speaker
in a couple of restaurants that was Docker Swarm based initially. But that KubeCon that we talked about, the 2017 Snow KubeCon, it was just apparent that Kubernetes was going to win the day when it came to container orchestration. And again, kind of same point, but
00:36:14
Speaker
When something wins the day, you have other things that continue to exist that are still can be great solutions. But there's a lot of a lot of the ecosystem comes with it. A lot of a lot of people want to work on that stack. Like you can't understate that. Like the best solution isn't always the most technically pure solution. It's also the one that brings the things that make sense. It's something you can
00:36:35
Speaker
manage and support, which requires people. If you want people, you got to have people who want to work on that thing. Um, it's gotta be a compelling job. So like you wrap all that stuff together and it was like, okay, like we tried a bunch of stuff. We kind of had a perception that maybe Kubernetes was too heavy for our use case. Um, but it looked like it was going to win the day. And then we found, uh, K threes, um, which for anyone who's unfamiliar, it's Kubernetes minus all the cloud, um, orchestration stuff in a single binary.
00:37:01
Speaker
I think it now supports that CD, but it was SQLite based in the early days. And it was just simple to bootstrap, stand up and run. And so once we found it, we just really loved what it brought to the table and kind of embraced that full on. The last thing I'll say to that question is,
00:37:20
Speaker
One thing that I think we did with Kubernetes that has been really, really great for Chick-fil-A is we didn't get super cute about all the things that we could do. Our number one goal was I have a containerized application or maybe like a foundational service, like a database or something like that. And I wanted to be resilient to a node failure. And I wanted to be scheduled to run somewhere that's active as part of the cluster. And that's what we use it for.
00:37:48
Speaker
There's a couple little additional things we've done over time. Like we have an operator that does some stuff with restarting pods when secrets get updated and things like that are convenient, but we really wanted to keep it as simple as possible. So that it was as easy to manage as possible because we knew from the start we're dealing with, um, you know, at the time it was probably 2,500, but 2,500 and growing.
00:38:08
Speaker
uh, footprints of this and you just got to make a lot of trade-offs and try and keep it as simple as possible for it to be something that you can actually, uh, you know, work with and manage at scale. So I think that was a really good decision for us, uh, in selecting K threes as well as that we chose it, but also the way that we chose to use it, um, for success.
00:38:27
Speaker
And I think it makes sense. For Kubernetes, if you want to run a Kubernetes cluster, you need minimum requirements from a CPU and memory perspective, which I think makes the nooks unusable. You have to go with server-class hardware that's optimized for the Edge, and that creates a whole lot of other issues. So K3S makes sense.
00:38:48
Speaker
Yeah. I'm curious about the nooks too. Was like form factor or anything else sort of a gate to sort of choosing the nooks? Yeah. Great question. We've loved the nooks. They've been amazing. It's sad until discontinued them, but glad.
00:39:06
Speaker
They've been awesome. So yeah, form factor was one of the biggest things that we were considering because we are very space constrained in the restaurant. That's one of those capacity constraints actually, like having room to keep all of the food and have place in the office having room for people to even get in there and do the things they need to do.
00:39:23
Speaker
So, um, we're dealing like with a very small physical space, uh, that we can put these things in, you know, along with the other networking gear and point of sale server and stuff that's been there. So very small. So we couldn't go with like, you know, blade servers or something like that. We don't have a server closet or anything of that nature. So usually this is in, in the office, like the back office in the restaurant, um, you know, in Iraq on the wall.
00:39:44
Speaker
So space warm factor and then cost was a big thing because this was brand new, you know, five and a half, uh, six years ago. And we didn't, we didn't know what was going to happen. We didn't know what kind of resources we were actually going to need. We knew we needed some bare minimum stuff and we knew some of the early use cases. We weren't sure like what we'd end up running and we didn't want to make a massive capital investment in hardware that was just going to sit there. And so it's turned out that we've pretty well right sized it, um, for the season that we started with.
00:40:15
Speaker
And, um, we're doing a refresh shortly and, um, you know, upping, uh, our specs a little bit. Um, but like for the most part, picking a small form factor and keeping it low cost. Um, another advantage to low cost actually is we can do the cattle, not pets thing with our nodes. So if they're misbehaving, I mentioned wiping them and bringing them back.
00:40:35
Speaker
But if they don't cooperate and they have some sort of lingering issue, we just take them out of the cluster and drop ship or replacement. And, you know, if, if we can't warranty it for some reason, we're out like, you know, a few hundred bucks, that's very different than being out $5,000 per that happens. Um, so that's, that was a pretty awesome design decision that really the confluence of our requirements and.
00:40:56
Speaker
constraints and then, you know, what was available at the time kind of set us up to do some of these things. Um, that, you know, I'm not sure what we'd do this forever. If we have to do machine learning, uh, at the edge at some point, we're probably not throwing away, you know, Nvidia boxes or something like that. It worked for a season. We'll see, we'll see what happens in the future. But, uh, those are definitely some of the things that we were managing when we thought about the hardware.
00:41:20
Speaker
Got it. Yeah. So you said a few other things. The database is one I do want to come back to later on, because especially at that time, 2017, which was, I think, if I were to pinpoint a year of big changes in the container ecosystem, 2017 is definitely high on my list. Because at that time, Mesosphere was still doing really well. But then you all these changes from, like you said, Swarm and everything else.
00:41:43
Speaker
You also mentioned you had, I think it was 25,000 to 3,000 restaurants. At this time, it was more like 2,500 I mentioned. Were you rolling all of this out to those 2,500? What did the changes look like from a restaurant that already existed with technology versus if you were to build a Chick-fil-A in the center of Boston?
00:42:07
Speaker
Yeah, actually the both of those were exactly the same, which is part of the design. So we do like phase rollout. So through in 2018, I think throughout the year of 2018 is when we were rolling out, it may have been a little in 17 and maybe a little in 19. I'm trying to remember the exact details, but
00:42:24
Speaker
Um, but yeah, the architecture is the same. So, um, essentially we didn't really hit on this before, but all that has to happen for these to enter into a restaurant is, uh, they show up with a power and ethernet cable and there's a designated port that they get plugged into in each of our, uh, switches, you know, in the network back. Um, so we have, there's one, uh, nook that's in each of the switches. So we have three switches for resiliency for a bunch of other things in store.
00:42:49
Speaker
So we're essentially resilient to node failure, switch failure. We have two routers in the store, so router failure. So there's a ton of resiliency baked into that network stack that we're able to be the beneficiaries of with this architecture.
00:43:00
Speaker
But that's all you have to do is basically plug it in and everything else is taken care of automatically by reaching out to the cloud and through API interactions and stuff like that. So it's pretty straightforward. So in an existing store install, we did a mix of having somebody go in and install them and shipping them to the store and having the team do it. Shipping into the store and having the team do it is easy, but it's another thing to do and capacity, like we said.
00:43:27
Speaker
into doing mostly working with an installer to do that for us. But yeah, it's super simple. And so for a new store, when they install everything, they just plug those in and whenever there's power, they come up and check in, register themselves and off we go. So same approach either way. And we tried to make that, we labored intensively to make that as simple as possible and zero touch as possible for the people at the store. We don't have IT technicians in the field to go manage these things.
00:43:57
Speaker
So the only time somebody's really visiting would be for like an install or, uh, or something like that the first time, but replacements, generally we drop ship replacements and have them plug them in, um, things like that. So we want it to be as easy as possible to make that processes, you know, easy on the people there. Got it. And in terms of the, the actual sort of devices, the IT devices in the restaurant, did they need awareness of now, Hey, this thing lives in my restaurant or just, you know, how did, how did that work?
00:44:24
Speaker
Yeah, no magic. So there's a whole process for bringing the devices in that we're working in, in parallel to the edge. So like I said, we kind of started with IOT driving towards solving a lot of our capacity issues. So.
00:44:38
Speaker
Before we even started on the edge, the first thing we were doing was figuring out how can we connect all these devices that exist. We started working upstream with the manufacturers of all of those to figure out how that might work. And ended up providing, essentially, it's an SDK that walks them through a provisioning and onboarding process. It's OAuth-driven. And there's a bunch of other little details in there we won't get into. But essentially, they onboard into our environment.
00:45:06
Speaker
credentials distributed to them to join the right network where the edge stuff lives. And then they start talking over some standard protocols. And generally speaking, we don't allow devices to talk out over the network, like to the cloud or anything like that. So they broker through the edge. They can fail over if the edge was down completely and still send data out, like telemetry data out and such. But generally speaking, everything points to the edge first. So most of the communications are MQTT driven.
00:45:35
Speaker
as opposed to like a rest API type of world that we're all used to or gRPC or things like that. Um, this is all pretty much message driven with MTT, lightweight messaging protocol. Um, and they, they pass things over that. So it's pub sub. And, uh, then we do, um, you know, we host the broker at the edge and, uh, everybody, you know, kind of talks there to get the messages. And then we also asynchronously send all that stuff up to the cloud so they can make its way into our data wake analytics ecosystem, all that stuff.
00:46:02
Speaker
So we're collecting all the business data at all times and shipping it up whenever we can. For offline, it's sort of a store and forward and ships it out when it comes back. Okay. So Brian, this sounds like a simple architecture, but difficult to design. I wanted to ask you about how did you design to avoid or eliminate single points of failure? Because you said everything eventually makes its way back to your core backend or core data center.
00:46:30
Speaker
How do we avoid any, any, any single point of failures in the stack? Yeah. So, I mean, it starts with the, the three nooks, um, for sure. And then the K threes on their scheduling workloads across nodes that are available. So that's kind of a part one. Um, part two would be definitely that network stack that we described. So having a resilient network is critical to having resilient infrastructure, uh, the compute level. So we've got a lot of resiliency baked in there, um, as well.
00:46:54
Speaker
And then obviously we do have the WAN network that we're dependent on. We have two connections there, but the failover is generally like an LTE type deal, so not something high throughput. So we actually don't usually do any activity, which is kind of a resiliency play for the restaurant. We don't want to saturate that network with IoT related or
00:47:17
Speaker
even like a lot of the point of sale related data or things like that. We really want that for like credit card processing, digital ordering, like things that are critical to the business. And so we're aggregating things in the cluster during that time period and then shipping out later. And then everything that we do on our cloud kind of control plane side, which is where all of our, we'll probably talk about some of the other services there, but we're get up space. So like all the tooling around that runs on the cloud side to orchestrate deployments and canaries and all that fun.
00:47:46
Speaker
the aggregation of operational telemetry, shipping things out via vector to other teams, and then also collecting operational telemetry for the platform team to use. All that stuff runs in Kubernetes proper on the cloud side, so EKS-based Kubernetes cluster, and we've been running that from day one as well. We've had a lot of edge experience. We're also running a decent size cluster on the cloud side to support all of those edge footprints as well.
00:48:15
Speaker
I think those are probably the main things. Maybe the other is just taking advantage of those three nodes with shared database. We've had Mongo as part of the stack from the beginning to back our MQTT broker. So little things like that as well. But that's really the majority of the resiliency story. And then we made some compromises on it too, where our SLA on persistence is best effort. We don't make guarantees. And I think we'll talk about that some more, but we don't guarantee that we'll have
00:48:43
Speaker
Uh, things forever that makes a very complicated architecture, uh, to support in 3000 restaurants, um, you know, with the footprint that we have. So, um, there's a bunch of things like that that we sort of made trade offs on. Um, another would be trying to be highly recoverable. Um, so recovery of a cluster, for example, um, or being able to wipe it away and bring it back as opposed to trying to make it highly available or like never goes down.
00:49:06
Speaker
which is what we would do in the cloud. So you got a lot more resources on the cloud side in terms of like nodes that can be in a separate control plane and all those kinds of things. And we don't have that luxury. So we really traded off a lot of the availability things for recoverability and simplicity, which I think we're really good for in our use cases.
00:49:23
Speaker
And I think that makes sense, right? I think the lunch hour that you mentioned, like if a note goes down, it looks like based on your architecture, the restaurant can continue operating on a two node basis. Even if, I think during our intro call, you said, even if the second note goes down, you can still fully function on a single note, which is great. And as you said, uh, pull the plug, reinstall it, and then it joins the cluster again. So, okay. Do graceful degradations whenever possible. And like, I mean, even when I remember when I started getting into cloud stuff, like the whole story was,
00:49:53
Speaker
you know, cattle, not pets and anticipate things fail. Like don't be friends with her, uh, with your infrastructure and all that. And, um, we tried to apply that as much as possible to this world of having to care about the infrastructure. Still try to not expect anything to be there forever. Try to not depend on it long-term and try to build and design as much as we could for, um, like, uh, ephemerality, um, or less. And, and I, again, with the,
00:50:20
Speaker
footprint that we have, it would have been really hard to support it at scale without a massive team if we hadn't done a lot of those things. So I think they were a good trade off in our situation.
00:50:30
Speaker
I like that distinction between recoverability versus availability, right? And especially dealing with like a physical restaurant, you know, Hurricane Adalia or Lee comes rolling through and blows away a restaurant. I think that makes a big difference, especially when you're working with Compute at the Edge, right? So I think that's just a neat point I want to make. Speaking of what's running at the Edge, let's talk a little bit about more like the application components that make up
00:50:56
Speaker
what's running in each restaurant. So I know I heard Broker, I've heard of her database. I'm sure there's a bunch of other stateless things kind of sending stuff somewhere. So give us a breakdown a little bit about what that looks like.
00:51:08
Speaker
Yeah, absolutely. So there's a number of services that the platform team operates. So I'll start with those and kind of talk about those as the example. So first of all, the ones you mentioned, so MQTT is pretty central to our stack. So we run that broker at the edge. In addition to that,
00:51:28
Speaker
We mentioned the all the devices and all like all the stuff even the applications that run at the edge all kind of follow this based on boarding process they have to get credentials to be allowed to connect to the broker or
00:51:42
Speaker
Talk to cloud services or whatever else. So that's like an important part is managing all that. So, um, we also have, uh, an OAuth server that we wrote actually from scratch and go on the cloud side. Why? Because of, uh, licensing costs at the time. And because we actually wanted some of the, we wanted like OAuth device code flow, which didn't exist in the commercial products at the time.
00:52:05
Speaker
Um, because we wanted like an operator approval process for new devices coming into their store. So trying to make none of the things have to put authentication info on the screen. Instead, it was like, go to this URL and like, you know, do that whole thing. Um, everybody's familiar with on like Apple TVs and Roku and all that stuff. So, um, we sort of use that model. Um, but all of these things have to have credentials. So in the event that we're offline and some, and, um, I should back up and say they're all Jot based, um, JWT. So commissions are embedded in the token. So you can see like all the resources.
00:52:34
Speaker
You know, it's signed by, you know, with the proper certificate on the cloud side.
00:52:38
Speaker
But all that sounds great, but if we're offline at the edge and it's time to refresh your token and you can't reach the cloud, like, are you just down? Like, I knew that'd happen. So there's also a local auth server at the edge, which it doesn't contain. We didn't want to deal with all the complexity of syncing permissions down and trying to keep this up to date. It doesn't even make sense if you're offline anyway. So we did something I think is a cool little trick, which is we have a separate
00:53:06
Speaker
CA at the edge in the restaurant that can re-sign a valid token. So it checks to make sure the current token is valid and then it can re-sign it with a different authority that says, hey, this one was signed by the edge instead. And it just mints the same permissions that already existed. So it effectively is like an extension of the current token. Um, so it can keep working in the store and then hopefully next time they come through, uh, and you know, look for a refresh, you know, uh, whenever that is depends on the device, but whenever they come through for a refresh, hopefully back online.
00:53:35
Speaker
Any permission changes could get propagated and off the go. But, um, we have that running there as well. Um, I guess we'll get into a database really fast. So, um, MongoDB I mentioned backs our broker.
00:53:47
Speaker
Over time, while we don't make persistence guarantees to applications running at the edge, we do offer persistence as a service because for the most part, things last. So in theory, Postgres is there most of the time and it's available to store things or, you know,
00:54:07
Speaker
Really, like we encourage like a hydrate, a rehydrate pattern. So when you're online, send things to the cloud and keep your state up there, should you get blown away and have to be recreated as an application, like rehydrate. So think about like an iPhone, you know, you're new and shows up, you know, you log into your iCloud account and poof, everything is backed right. We encourage that pattern a lot for edge applications. So like edge native applications. So don't really rely on persistence. Do applications teams have to have like the,
00:54:37
Speaker
vetted on how their architecture looks so that it is aware of the fact that we don't guarantee persistence.

Data Management and Software Updates

00:54:43
Speaker
Yeah, there was a ton of direct working with the early customers when we started off and then a lot of documentation, something we don't talk about in technology that much, but a lot of documenting the constraints, the expectations, the good patterns for
00:54:58
Speaker
delivery, et cetera. So we put a lot of effort into that early and I think did a decent job at it. It could have certainly done better, but yeah, it is a little bit of a paradigm shift for teams, especially used to the cloud, where everything's here forever. And yeah, it took a little learning, but I think it's gone decently well.
00:55:19
Speaker
There, there aren't a massive number of needs for a lot of long-term persistence. Like we're facilitating a lot of things that probably matter over a 30 second minute, you know, five minute window, but maybe if.
00:55:33
Speaker
The data probably doesn't matter that much after that for a lot of our use cases. It's communication across things in the restaurant. It's like rolling things up and understanding sort of like demand, things like that. So the demand for fries an hour ago doesn't really matter, right? Like it mattered in the moment, it was useful. So that helps a lot too. Like the nature of the use cases that we're supporting don't necessarily require like this is not our source of truth for restaurant sales, for example. Like we don't do things like that.
00:56:01
Speaker
But nevertheless, Postgres is there for people who want it, who want to dump a bunch of data and use SQL, you know, rolling things out.
00:56:08
Speaker
Um, or things like that. So it's a helpful tool. So that's there as well. Um, another one is, uh, we'll get us into, uh, get ops a little bit. Um, I like to say that we stumbled into get ops. Like we didn't actually know that was a thing, uh, or, um, we just thought, Hey, we have a whole lot of clusters that we're going to have to manage the state of, and we know that we're going to be dealing with a bunch of like.
00:56:31
Speaker
YAML files and things that define the configure the cluster. How could we manage like all of these deployments? We're like, what if we just have a get repository per location? Everybody always asks about that. So I'll come back to it. Um, get repo per location. Every restaurant has one. We always know the state that that cluster should be in by looking at that repo.
00:56:52
Speaker
We can reproduce it from scratch easily by just sucking down that source config again. Awesome. And then we basically just write a little agent that like pulls from get applies the changes and then like closes the feedback loop on everything go well or not. And that's exactly what we ended up doing. So, um, our go CD did not exist when we started and flux was early and didn't really fit what we were looking for. So we wrote our own. Um, this is, this is a theme is we wrote our own a lot, even though we hate writing our own and would rather
00:57:22
Speaker
benefit from the ecosystem. We run our own. It's super simple, lightweight Go app that just does exactly what I described. It pulls its locations repo and applies the changes, feedbacks. And then on the cloud side of that, we have some tooling that is effectively helps teams take a change that they want to make. So let's say application A, they want to roll it out to a group of stores. They can actually set up like a schedule effectively of like
00:57:48
Speaker
Go to these hundred, then these 300, then these 500 and kind of like phase their, uh, rollout. And so they can define groups, um, and then take that change, the YAML change that they need to make for their app. Like right at once. And then we, you know, we stamp copies of it out into all those get repos for them.
00:58:03
Speaker
And as that happens, it gets rolled out. And then if there's feedback that these are failing, that something's wrong, we have some built-in canary processes. So if we see a failure rate of a certain percentage that they define, then it'll stop the rollout and it won't template it out in any more repos and nobody else will apply it. They can go figure out the root cause, fix it, repeat.
00:58:23
Speaker
So you have a super repo that's stamped out to a bunch of smaller repos. Effectively, yeah. That part's meta. It doesn't really exist, but everybody's effectively converging on a gold standard over time. Gotcha. We let people basically deal with it.
00:58:43
Speaker
their, their version of an app. So I think of every app has a folder, uh, in the restaurant repo. Yeah. So what, what varies between, I mean, all these restaurants, um, what varies between the restaurants where you get benefits from having an individual repo per
00:58:58
Speaker
Yeah. So over the long term, nothing. Um, but in the short to medium term, it could be any number of things. So generally, um, you know, like, uh, in a cloud world, we would say it's probably not a good thing if you're not releasing, uh, and getting things into production, you know, with really small batches, small features every day, preferably many times a day. Right. Um, we, we love that paradigm, but when it comes to restaurants and some of the systems that could be involved,
00:59:24
Speaker
There's sometimes more like some of these things require the solutions that use this stack, the business applications. They may require like a hardware installation that goes with them or they may like point of sale related stuff, maybe too high risk to change everywhere at once across all stores. You know, just in the event that something went wrong, like that's our, that's our revenue channel.
00:59:45
Speaker
Um, so there's a lot of things that have varying cadences of rollout. Like some people may hit the whole chain in a day with their change. Um, if they're in a good steady state and some people may be initial rollout and it may take them three months to do that. So over the, uh, on any given day, any given store could look different than another, um, you know, within the, uh, the list of possible apps or maybe have something brand new that others don't have yet, but over time they would all converge on.
01:00:12
Speaker
uh, effectively a gold standard. And we have some stuff to kind of go audit and make sure that there's nobody who's been missed and is lingering and is drifting, um, from a config perspective, which is fairly easy. Like you're just ripping through Git repos and comparing stuff. So, um, it get also made it really easy to write a lot of that tooling. So that's all in a, um, a custom run, not custom, but, uh,
01:00:32
Speaker
self hosted get lab that we run in that Kubernetes cluster on the cloud side. So nice. Nice. Yeah. I could imagine, you know, you get a new, uh, IOT fryer that has chat GPT make your fries for you. That can't be installed on every one of your restaurants at the same time, you know, let alone the tech stack itself. Um, you know, I don't know. Maybe that is a future fryer that we can, we can work on that. We don't have that yet. Yeah. We could, we could definitely work on that. Well, uh,
01:01:01
Speaker
But speaking of the platform itself in terms of OS versions and the actual K3s versions and things like that, that's all handled by the GitOps process as well? That one is not GitOps. Great question. Yeah, so anything inside the cluster is handled by GitOps. So it assumes that the cluster exists. And Vessel is the name of the app. I'll probably end up saying it. Vessel runs in the cluster.
01:01:30
Speaker
its paradigm, its context as it lives in Kubernetes. So everything other than that is handled through that process I described before that allows us to like wipe and recreate the cluster. So if you're familiar with like the cloud init model that Ubuntu uses, it's very similar to that. So effectively,
01:01:49
Speaker
When a node comes online and checks in, it talks to the service called hams. Here's another metadata service. One of the guys on the team thought that was a fun name. So yeah, you check in with hams and then hams basically is like the controller that says, do you need to like wipe yourself back to a clean slate again or not? And if not, it basically dispenses a script to you, which is your thing to run to set yourself up. So basically points, you do it, you download it.
01:02:19
Speaker
S3 base. So essentially we could have any number of those going. Usually there's just one version at any given time, but if we were doing rolling out operating system patching, which should be resilient through wipes and recreates or things like that, we can basically update that instruction set with new things. That could also include update K3s to the latest version or things like that.
01:02:46
Speaker
So that facilitates all of that process. The only thing that that doesn't cover would be like a foundational base operating system upgrade. And our hunch turned out right. We thought with a long-term support addition Ubuntu, we would be able to just continue to apply security patches and get to a hardware refresh before a major OS upgrade was needed. So we had a plan to do it, but we haven't had to do it.
01:03:11
Speaker
So we'll be doing that through hardware refresh. Gotcha. That's awesome. OK, so Brian, we spoke a lot about how things look like at the edge, and then we spoke about how developers are pushing things. But from an operations perspective, or from a control center perspective, how are you managing these 2,800 plus locations? What does the, I don't know, the control room look like?
01:03:36
Speaker
Yeah. Um, it looks like everybody's at home working on their laptop. There's no like overlord knock anywhere. Unfortunately not. Um, kind of be cool, but we have a couple, uh, facilities. So, um, I'll start in the technology side. So, um, one of the things I didn't mention that's in the clusters is Prometheus. So we're collecting metrics, uh, from all the apps that have a slash metrics in point.
01:04:05
Speaker
and pulling all those as well as we ultimately are pulling things via vector from the Kubernetes log API. So basically grabbing all the operational telemetry stuff, both for the platform team, as well as the customers of the platform, and then using vector to ship that stuff out. The out is actually vector again in our Kubernetes cluster on the cloud side.
01:04:26
Speaker
And then we do some fan, some tagging based fan out. So in our environment, we have a number of different application teams that have different preferences in terms of observability tooling. So the core platform team does most of their stuff with Grafana that they have a little bit of a data dog mixed in as well. But we have other teams that use only data dog or some cloud watch here and there.
01:04:48
Speaker
So we wanted to support not forcing them into a particular tool to get their edge telemetry. So Vector fans those things out into those different teams, tool of preference for observability purposes. So the platform team, like I said, mostly Grafana, we get a lot of the telemetry data there. And then in addition to what we get from Vector and Prometheus from the edge, we also have some applications
01:05:17
Speaker
They are apps, but they're really monitoring apps. So we have like a synthetic client that runs at the edge in each store that exercises a lot of our local services.
01:05:25
Speaker
So it'll exercise in QTT, local auth, database, things like that, and make sure that those are, it's getting a good experience from those, connect, and that gives us another data point or set of data points that we send up and aggregate in with that other operational telemetry to get really a picture of if a store is healthy or not. And then we've got some stuff that comes from our network stack from an API perspective, so that like, if a store just disappears, like, did the cluster go down? Is the hardware dead? Is the power out? Is it an internet outage?
01:05:56
Speaker
that, you know, something else we don't necessarily know from afar. So we have to pull in a lot of different data from different places to get a best effort picture. And then we have some support partnerships, our internal help desk that serves restaurants, being one of them. We also have a third party partner who we packaged up a lot of the
01:06:14
Speaker
basic, um, support functions and gave them to them. The main one of those is, is something wrong? Wipe the node, see if, um, but we basically packaged that all up for, we can send an incident server to them and have them do that. So our team could stay more focused on the, uh, engineering tasks, uh, and continuing to evolve the platform and things like that.
01:06:34
Speaker
Yeah, makes sense. And I know, obviously, the growth of the restaurant really skyrocketed in the last five years or so. But do you think changing this architecture has positively, negatively impacted the amount of people that it goes into managing a restaurant, let's say?
01:06:53
Speaker
Uh, it probably hasn't had an impact on the, the number of people to run the restaurant, to be honest. Um, I hope it has made the job of a lot of those people a little bit easier. Um, so some of the, the assistance things, um, we haven't, I mean, I thought we would get into, if you asked me five years ago, if I thought we'd be doing any automated cooking stuff at all today, I would have said yes. Um, but we're actually not doing that yet. Um, some of those have proven to be just more business challenges, I would say than anything else. Like I think it's technically possible, but there's a lot to figure out. So, um,
01:07:23
Speaker
Yeah, we haven't done that. So we haven't really like reduced the need for humans and that's not really our primary goal. It's really make the experience of the humans who are there and who are working as frictionless and easy and comfortable as possible. And I think we've made some strides in that direction, but still have a long way to go in terms of additional things we can do to make the role easier and maybe tackle some of the new problems that are emerging, a lot of supply chain stuff. We really would love to get a much better
01:07:51
Speaker
call it observability picture of the restaurant. I don't mean systems observability, but really like business observability, like, you know, everything about the metrics that happen for our drive-through or, you know, time people spend in queues and all that kind of stuff.

Future of Edge Computing at Chick-fil-A

01:08:05
Speaker
We talked about the fly, that fly is flying right in front of the camera. He's pissing you towards the end here. I know, he's harassing me. Doing a great job. I'm staying focused, but yeah, I think it's helped, but I honestly think it's sort of, I'm gonna steal the Amazon idea, like it's day one. I think it's still day one for us from an edge perspective and there's still a lot more to do and a lot of new challenges to solve where I think it's gonna really help us out. So yeah, it's a good start and we're gonna keep building on top of it.
01:08:35
Speaker
Cool, cool. Well, before we get some information about where people can learn more about what you're up to and some other things, we do always have a segment of the show where we ask chat GPT something about the topic of the show. We let our guests either choose to answer it or let chat GPT answer it itself. So the question we asked this time around was, if you had to build a new ad for Chick-fil-A, like the famous one, the cow sent you, what would it be?
01:09:03
Speaker
Right. Let's hear, I want to hear a chat GPT's answer. All right. So, uh, chat GPT's answer was creating a new ad came from, uh, for Chick-fil-A would require careful consideration of their brand values and messaging. That's why I can't, because while I can't produce visual content, I can provide you with a script for a potential ad. So the title for this script is Chick-fil-A where flavor meets community. Okay.
01:09:31
Speaker
It says opening shot for a Chick-fil-A restaurant with families and friends and employees interacting happily. The voiceover comes over and says, Chick-fil-A, we believe in more than just great food, we believe in community. And then it cuts in close to a close up of a Chick-fil-A employee serving and smiling to a customer. The voiceover also says, our team members aren't just employees, they're your neighbors, your friends, and your biggest fans.
01:09:56
Speaker
And then it goes on to say, it shows various scenes of Chick-fil-A employees engaging with customers, such as high-fiving kids, assisting elderly customers, and sharing smiles. VoiceOver again comes in and says, we're not just a restaurant, we're a gathering place where people come together. And it goes through a bunch more. I'm going to skip some.
01:10:18
Speaker
But then it cuts to a shot of the iconic Chick-fil-A cow mascots. And cow one says, eat more chicken. And cow two says, holding a sign, share more love. Anyway, you get the concept. It actually goes on quite some time. And we can share it in the document in the show notes. But I'll go with the ending here. It says, this concept emphasizes Chick-fil-A's commitment to community, family service, and unique charm of their brand. It combines the elements of great food and a welcoming atmosphere.
01:10:47
Speaker
I think it pulled from the right sources if you ask it. Yeah. That's actually pretty good. This LLM stuff might be something after all. It hit some good elements. I think that's good. It maybe took the cheesy factor to the extreme, but. Yeah, exactly. Yeah. I mean, not a bad first try. Nicely done. This is 3.5 too, so, you know, brother. Yeah.
01:11:14
Speaker
Cool. Well, let's get some more information about any links that you want to talk about, share, we'll put in the description, any blogs, that kind of thing. Where can people find you? Will you be at KubeCon? Anything like that? Yeah, sure. So for Chick-fil-A, in terms of what's going on from a tech perspective, we're on Medium. It's medium.com slash chick-fil-atec, I believe. So people can find that in the notes, I think.
01:11:42
Speaker
Yeah, a lot of stories there about both things that we do from an architecture team perspective as well as stuff others are doing across our engineering community on a bunch of other projects. That's always fun. Definitely recommend that. People can connect with me on LinkedIn. Brian Chambers Chick-fil-A should find me. I'm not in a bike helmet on there anymore like I used to be.
01:12:02
Speaker
That's a baseball hat. Keep it informal in LinkedIn. So you can find me there as well. And then I do some writing just for fun about tech on my substack, which is brianchambers.substack.com. The name of it is the Chamber of Tech Secrets. Thanks to my LinkedIn audience who came up with it and voted for it. So that's fun. So I'm pretty good about writing every week. Haven't done it yet this week.
01:12:28
Speaker
So those are probably the main places. And yeah, I will be at KubeCon, at least that's the plan. And I'm looking forward to connecting with a bunch of folks there. It should be a lot of fun. How about you guys?

Upcoming Events and Audience Engagement

01:12:38
Speaker
Yeah, we'll both be there at KubeCon. I believe, Bob, and that's correct. Yeah, that's true. And what we usually do is actually we bring to our North American ones, we bring all of our podcast stuff with us. We grab people for 15 minutes. So we'll try to do that with you. And we make sort of a live from KubeCon episode. It'll be a lot of fun. Yeah, that'll be awesome.
01:12:58
Speaker
Sweet. Well, Brian, it was a pleasure. I think Bob and mentioned this before we got on the call. We could probably talk about this for another two hours. I know you talk about it a lot. So we're very thankful and appreciate you coming out and talk to us in the audience. Yeah, guys, it's my pleasure. Really appreciate the invite and enjoyed the conversation. Thanks for all the great questions. And yeah, I had a lot of fun. Thanks, Brian. Thanks, y'all.
01:13:20
Speaker
Okay, Bob. And that was a conversation that I feel like we could have had go another two or three hours. I think there's a lot to unpack with what Brian and Chick-fil-A and everyone over there that is part of this kind of rollout and solution. Really interesting stuff. I mean, for me, just the real world scale of managing the nearly 3,000 stores
01:13:42
Speaker
I think was a really interesting point. I know we brought up how they manage state, which is, again, not necessarily what we've been used to and talk about on the show, meaning that they have Postgres there, he said, but they were like, yeah, the data doesn't matter after an hour or 10 minutes or whatever it may be, right? Tens of minutes.
01:14:00
Speaker
which is a concept that I think goes against the grain in my own brain a little bit there, but very interesting kind of point to, yeah, we support it and we can do it even in these small form factor clusters. Speaking of form factors, they chose a very small piece of hardware on purpose. I know they might refresh that or change that in the future, who knows? But that was a very,
01:14:25
Speaker
kind of cool point, the fact that they're kind of hiring sort of installer to go out here and put these things in here. They kind of come online, call home, and they take it from there. They didn't have to have people managing Kubernetes clusters in their stores, which darn, I was really hoping to try that job.
01:14:41
Speaker
Find fast food service job, like managing communities. Yeah, exactly. I think the other thing is really around sort of Edge and GitOps that surprised me, right? They were really early adopters to a lot of this technology. And, you know, with GitOps, they, you know, he mentioned they kind of stumbled into it and they kind of built their own GitOps tool before, you know, the Argo CDs and everything.
01:15:05
Speaker
and Flux came out because they knew they had that need and kind of tackled that. And the same goes with Edge, right? He talked about them sort of adopting fog computing, which I think was actually a little bit of a new term to me, but it totally makes sense, right? Fog at the end. And I know he didn't like the term, but I thought it was kind of interesting. But it just shows, I think,
01:15:29
Speaker
to be where they are, they really had to get in early. And they even mentioned, you know, some of their initial sort of POCs or stores use swarm and stuff, but they clearly saw kind of Kubernetes being the kind of go forward plan. So yeah, lots, lots of interesting. What did you get? Lots of goodness, right? Like, man,
01:15:50
Speaker
just building this out and then scaling it to 3,000 stores, as you said, or 2,200 stores. Yeah, that's just crazy. Like, wow, I don't know how he manages that with like a single digit person platform team. I know they work with vendors that handle support and things like that, but still, that's a lot of responsibility and seeing this in production. I really like the fact that they have thought about like resiliency and reliability at each layer. I know, as you said, they're not doing like traditional data on Kubernetes, like short term data and then shipping
01:16:20
Speaker
If that is the thing, yeah. Yeah. But like, oh, three nodes, two out of the three nodes can go offline, they'll still be up and running, application level resiliency, even network level resiliency, right? Like if the internet connection goes down, they have a failback or failover plan where they have like an LTE connection that can send the important bits of information, maybe the credit card transaction that the stores back to the main location. So I think, yeah, obviously,
01:16:47
Speaker
when doing things in the real world, it takes a lot of thought. Clearly, Brian and team have put that into it. My second takeaway was good for us, dude. I'm excited that we had the MLB episode where we spoke about how Kubernetes or Google Cloud Anthos in their case was being deployed at the edge at each of these stadiums.
01:17:07
Speaker
with 2,800 stores, how K3S clusters are being deployed. I'm more excited to get more and more of these end-user customer stories on our pod and share that with our listeners. I think I'll ask that as a favor. If you are a listener of Kubernetes Bites and you're using Kubernetes in production, DevTest, or figuring out how to use it as part of your day job, or if you know of somebody that we can talk to, please ping us. These are interesting episodes for us to record as well.
01:17:34
Speaker
Yeah. Yeah, I think the end user use cases are really impactful. Yeah. Because a lot of us are working on problems associated with our day jobs, and we can't really share all the details, but the kind of public knowledge sort of use case and real world use cases that we can relate to, I think is the key point there, are really kind of powerful to show you, right, you know, everything you're working on has, you know, a real impact and there's a lot going on there. Yep.
01:18:03
Speaker
Yeah, you don't have to do it at 2800 clusters. If it's just one cluster for your one team, and if you have an interesting use case of how this is impactful, yeah, we love to have you on the board. So fingers. Absolutely. I didn't realize you were right saying both of these were sort of Edge use cases. Yeah. We'll hopefully get away from Edge and make it up a little bit across.
01:18:25
Speaker
some new episodes. Fun episode with Brian, really exciting. Don't forget to join our Slack. Like and subscribe, leave us reviews on YouTube or Apple Podcasts, wherever you can. Also, if you're going to be in KubeCon, Chicago, or even DevOps stays Boston, come say hi to us. We may or may not have all of our podcast equipment in both of those places. We'll try to get them
01:18:50
Speaker
to Chicago, hopefully. And without fail, we'll be able to interview some folks there as well. So I think that brings us to the end of today's episode, Bobbin. I'm Ryan. I'm Bobbin. And thanks for joining another episode of Kubernetes Bites. Thank you for listening to the Kubernetes Bites podcast.