[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations CI/CD beyond YAML: The Evolution Towards Pipelines-as-Code

CI/CD beyond YAML: The Evolution Towards Pipelines-as-Code

Bookmarks
48:03

Summary

Conor Barber explores the evolution of infrastructure, focusing on the shift from YAML configurations to pipelines-as-code, covering modern CI/CD systems like GitHub Actions, GitLab, and CircleCI. This talk was recorded at QCon San Francisco 2023.

Bio

Conor Barber is a software engineer at Airbyte, bringing over a decade of experience in data and infrastructure engineering from leading tech companies. Previously at Apple, Conor developed scalable solutions for complex CI/CD processes involved in managing hundreds of connectors and the ELT platform that the connectors run on at Airbyte.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Barber: We're going to talk about Bart and Lisa. Bart and Lisa, they can be anybody in the software world that's concerned with shipping software. Typically, this might be your engineer or platform engineer, or just like a startup founder. They've been hard at work making some awesome software, and they've got a prototype. They're ready, let's go, let's ship it. We want to build and test and deploy that software. We're going to talk about the stories of Bart and Lisa in that context. Let's start with Bart's story. On day 1, Bart has created an awesome service. We're going to talk a lot about GitHub Actions. We use GitHub Actions at Airbyte, but this talk is broadly applicable to many other CI/CD systems: GitLab, Jenkins. Bart's created a back office service, and he's got a CI for it. He's got a little bit of YAML he created, and he was able to get up and running really quickly. He's got this backup service: it's built, it's tested, it's deployed. Really quick, really awesome. That's day 1. Day 11, so Bart's just been cranking out code, he's doing great. He's got a back office service. He's got a cart service. He's got a config service. He's got an inventory service. He's got a lot of different things going on. He's got several different YAML's now. He's got one to build his back office. He needs to talk to the config service. It's starting to pile up, but it's still manageable. He can manage with these. This is day 101. Bart's got quite a few services now. He has too many things going on, he doesn't really know what to do. Every time that he makes a change, there's just all these different things going on, and he just doesn't feel like he can handle it anymore. What about Lisa? Lisa decided to do things a little bit differently. What Lisa has done is the subject of our talk. The title of our talk is CI/CD beyond YAML, or, how I learned to stop worrying about holding all these YAML's and scale.

I am a senior software engineer. I work on infrastructure at Airbyte. Airbyte is an ELT platform. It is concerned with helping people move data from different disparate sources such as Stripe, Facebook, Google Ads, Snap, into their data lake houses, data warehouses, whatnot. Why CI/CD is important for us at Airbyte is because part of Airbyte's draw is that there are a lot of different connectors, as we call them, a Stripe connector, a Google Ads connector. For each one of these, we build and test and deploy these as containers. The more that we have the heavier of a load on CI/CD that we have to contextualize. I became quite interested in this problem whenever I came to Airbyte, because at the beginning, it seemed almost intractable. There were a lot of different moving gears, and it seemed like there was just no way that we could get a handle on it. I want to tell you the story of how we managed to segment out this problem in such a way that we feel that we can scale our connector situation, and building our platform at Airbyte as well.

Roadmap

We'll start by going through what YAML looks like, at the very beginning of a company, or just in general of a software project, and how it can evolve over time into this day 101 scenario. Then what we'll do is we'll break down some abstract concepts of CI/CD. Maybe not some perfect abstractions, but just some ways to get us thinking about the CI/CD problem, and how to think about how to design these systems that may not involve YAML. Then we'll go into what this actually looks like in code. Then, lastly, we'll go over the takeaways of what we found.

Airbyte was Bart

Airbyte was Bart. We had 7000 lines of YAML across two repositories. We have a quasi monorepo setup at Airbyte. It was a rat's nest of untestable shell scripts. For those of you who are familiar with these systems, you may understand, yes, it's pretty easy to just inline a shell, whether it be Jenkins where you can just paste it into the window, or you can just throw it into your YAML, or you can call it from YAML as we'll see. There was a lot of subjective developer pain. We'd have a lot of people coming and complaining that our CI/CD was not great, it was not fun to use, it slowed everything down, and it was expensive. $5,000 a month may not be a lot for a lot of people in different organizations, but for us at the size of our organization, when we first approached this problem, which was about 40 to 50 engineers strong, it was a large bill. That gave us some motivation to really take a look and take stride at this problem. Then, 45-minute walk clock time, again, for some organizations, maybe this is not a lot. I think infamously it took 5 hours or something to build Chrome at one point. For us, being part of this fast feedback loop was important for us to be able to ship and iterate quickly. We looked at this as one thing that we wanted to look at improving.

YAML Day 1, YAML Day 101

Let's jump into day 1 versus day 101. This is how YAML can start off great and easy, and get out of control after a little while. You start off with a really simple workflow. You just got one thing, and it's one file, and it's doing one thing, and it works great. You built it, and you were able to get it up and running really quickly. It's simple and straightforward. It's YAML. It's easy to read. There's no indirection or anything. By day 101, because of the way that we've built out these actions, and these workflows. I'm using GitHub Actions terminology here, but it's applicable for other systems as well, where you can use YAML to call it, or YAML. Suddenly, you've got a stack. You've got a call stack in YAML, and you don't have jump to definition. It's difficult to work as a software engineer when you don't have jump to definition. You're afforded only the rudimentary tools of browser windows and jumping back and forth, and file names, and it's just a cumbersome experience. This is another example of GitHub Actions, again. It's got these units of code that are pretty great for certain things. This is one we use at Airbyte. It's on the left, you'll see test-reporter. What this does is it parses JUnit files and uploads them somewhere. It's a great piece of code. The problem with this is that there's a lot of complexity in this GitHub Action. There's probably maybe 4000 or 5000 lines of Python in just this GitHub Action, parsing the different coverage reports and stuff. When you have a problem that's within the action, you have to jump into the action itself. You've got to get this black box. Again, when there's only one, there's no problem. When it gets to day 101, and you have several hundred of these, suddenly, you've got a system where you don't really know all these third-party integrations that are around, and the tooling is not there to really tell you. There's just a bunch of black boxes of things that can just break for no reason. One very typical scenario we end up with at Airbyte is like some breaking change gets pushed to one of these actions, and it's not pinned properly, and you don't really have the tools or mechanisms to find it out except after the fact. Your CI/CD is just broken.

Another example. On the left here on day 1, we've got a very straightforward thing that is very realistic. This is what your first Node app build might look like. Again, it's easy to grok what's going on here. We're just installing some dependencies. We're running some tests. We're building a production version of this app. We're deploying it to stage. We're deploying it to prod. Very easy, very straightforward to understand. By day 101, when you're doing a bunch of these, in this example here, we've got 20-plus, it's really hard to understand. Again, you can't really isolate these things. You can't really test whether deploy to staging is working with these YAML scenarios. We don't even know if what we're going to change, what's going to run anymore. If you touch a single file in your repo, let's just say it's a single JavaScript file somewhere, how many of these on the right are even going to run, and where, and why? The tooling to surface this information for you is not there yet. One final example here. We have some simple Boolean logic examples on the left here. This is a very simple example where we just check and see if a JavaScript file was modified, and then we run the CI. On the right, we've got some actual Airbyte code that ran for a long time, to figure out whether or not we should run the deployment. You can see here, what we're trying to do is we're trying to express some complicated Boolean logic concepts in YAML. We end up being limited by the constructs of what this DSL is offering us. It can get very complicated and very hard to understand.

Pulling some of these ideas together, it's like, why is this becoming painful? In general, we have a lack of modularity. We don't have an easy way to isolate these components. We're limited to what YAML is doing, and we're limited to what the specific DSL is doing. They're not unit testable, so we don't really have a way of detecting regressions before we throw them up there. All we can do is run them against this remote system. We have some ecosystem related flaws. Everybody has their own quasi-DSL. If I take GitHub Actions code, and I want to run it in CircleCI, I can't just do that. I have to port it. Maybe some concepts don't exist in CircleCI, maybe some concepts don't exist in GitLab. It's not portable. It's intractable. It's this proprietary system where you don't really see all of the source code of what's going on, and there's these settings and configuration that are magical that come in. It's hard to emulate locally. The point I'll make here that I think it's key when we start thinking about CI/CD systems, is you want to be able to iterate quickly. This ripples out in so many different ways. I'm working in a platform engineering role right now. I want the developers I'm working for to adopt tooling, to be excited about working on tooling. When they can't work with the tools that they have locally, it's a very bad developer experience for them. There is a tool out for GitHub Action, specifically, that gets brought up about this, it's called act. act, it works. It's better than nothing. It has some limitations. Those limitations, they segue back into the fact that it's a proprietary system that it's trying to emulate, so, A, it's always going to be behind the pushes. B, there are just some things that just don't translate very well when you're running things locally.

CI/CD Breakdown

With that said, we've talked enough about pain points, let's break down what CI/CD is, and some maybe not perfect abstractions, but just some high-level system layers to think about. It's like, what is this YAML actually doing? We have this giant pile of YAML we've talked about, what is it actually doing? We've chosen to break this out into what I'm calling layers here. These are the six different roles that a system like GitHub Actions, or a system like GitLab, or a system like Circle is providing for you with all this YAML. It goes back to what Justin was saying about the complexity of CI/CD. There's a lot of stuff buried in here. When somebody talks about GitHub Actions, they could be talking about any one of these things. They could be talking about all of these things, or a subsection of these things. Pulling these out into some sort of layer helps us to make this into a more tractable problem, instead of just being able to say, it's complicated.

What are we calling an event trigger? This is how CI jobs are spawned. We're just responding to an event somewhere. Your CI job has to run somehow. This is pushing to main. This is commenting on a pull request. This is creating a pull request. This is the thing that tells your CI/CD system to do something. You can also think of this as the entry point to your application. Every CI/CD has this. One thing I want to call out, and we'll come back to later, is that this is going to be very CI/CD platform specific. What I mean by that is what GitHub calls an issue is not the same thing as what GitLab calls an issue, is not the same thing as what Jenkins calls their level of abstraction. There are all these terms, and there's no cohesive model where you can say, a GitHub Issue translates to a GitLab issue. Some of these concepts, they exist, and mostly match to each other, like a pull request and a merge request, but not all of them. Then, in the bottom here, I've added a rough estimate of how hard this was for us when we tackled these parts, these layers at Airbyte, so I added this 5% number here. This is an arbitrary number. I want to use it to highlight where we put the most effort and struggle into as we were figuring out the lay of the land on this. We put 5% here.

Authentication. You need to talk to remote systems in a CI/CD. You often need some secret injection. You're usually logging in somewhere. You might be doing a deployment. You might be uploading to some bucket. You might be doing some logging. Your system needs to know about this. This is important because if we're going to design something that's perhaps a little bit more locally faced, we need to have that in our mind when we think about secrets, how they're going to be handled on the local developer's machine, how we're going to do secret stores, and the like. The orchestrator, this is a big part. This is where we tell the executor what to run and where. Some examples are, I want to run some job on a machine. This is what jobs and steps are in GitHub Actions. This is also state management. You have the state passed. You have the state completed. You have the state failed. You have the state skipped. It's usually easy to think about this as a DAG. Some workflow orchestration is happening in your CI/CD, where you've broken out things into steps, into jobs. Then you're doing things based on the results. A very simple example here, again, GitHub Actions, is we've got build_job_1, we've got build_job_2, and then we have a test job that depends on build_job_1 and build_job_2 succeeding. We've got both parallelization and state management in this example.

Caching, this is typically tied to your execution layer. I'm saving nearly the best for last. It's hard. I think that most people that are engineers can resonate with that. It's very difficult to get right. It's important that you choose the right tooling and the right approach to do this. There are two broad categories that you deal with in CI/CD, one is called the build cache. Any good CI/CD system, you want your system to be able to act the same way, if you run it over again with the same inputs. Idempotent is the fancy term for this. If your CI system works like that, and you know what's going in, and you know the output already, because you built it an hour ago, or a week ago, then you don't even have to build it. That's the concept of the build cache. Then, the second part of this is the dependency cache. This is like your typical package manager, npm, PyPi, Maven, whatnot. This is just the concept of, we're going to be downloading from PyPi over again, why don't we move the dependencies either on the same machine so that we're not redownloading them or close by, aka in the same data center so that we can shuffle things into there quicker, and not rely on a third-party system. This is about 10% overall. When you think about it overall, for us, it was about 10% of the overall complexity.

The executor, this is the meat of your CI/CD. This is actually running your build, test, your deploy. This is what it looks like, a very simple example in GitHub Actions, or Circle. You're just executing a shell script here. That shell script may be doing any arbitrary amount of thing. This is a great example because it actually highlights how you can hide very important logical way in a shell script, and you don't really have a good way of jumping into it. A couple other examples is that you might run a containerized build with BuildKit. That's another way to run things. You can delegate to an entirely different task executor via command line call, like Gradle. For those of you who are in the JVM world, you're probably familiar with this tool. This is a very large part of your CI/CD system. We'll go into what we want to think about with this a little bit later. Lastly, is reusable modules. I put this in a layer by itself, because it's kind of the knot that ties the other ones together. It's the aspect of reusability that you need, because everybody is going to do common things. Why are we doing the same work as somebody else doing a common thing? For instance, an example I gave on here, is that everybody needs to upload to an S3 bucket probably. We have an action that you can just go grab off the marketplace, and it's great. You can just upload and you don't have to maintain your own code. Some other examples. This is the GitHub Actions Marketplace. Jenkins has this concept of plugins, which isn't exactly like the marketplace, but it's a similar concept. Then GitLab has some integrations as well that offer this. This is about 10% of the overall complexity of what to think about.

The Lisa Approach: CI/CD as Code

We managed to break it down. We picked some terms and we put things into some boxes so that we can think about CI/CD as something that we can put in some winners. Now we're going to talk about the Lisa approach. What is the Lisa approach? The Lisa approach is to think about CI/CD as something that we don't want to use YAML for. Something that solves these problems of scalability. Something that is actually fun to work on. Something that you don't have to push and pray. Anybody who's worked on GitHub Actions or Circle or Jenkins knows this feedback loop, where you're pushing to a pull request, commit by commit, over and over, just to get this one little CI/CD thing to pass. It's just like, now you've got 30, 40, 50 commits, and they're all little one-line changes in your YAML file. Nobody wants that. What if we had a world without that? How would we even go about doing? The first thing to think about here, tying it back to what we were talking about before, is that we want to think about local first. It comes back to the reasons that we mentioned before. It's that developers want it more. We get faster feedback loops, if we're not having to do the push and pray. As we mentioned earlier, it got pretty pricey for Airbyte to be doing CI/CD over again. If we can do those feedback loops locally on a developer's machine, it's much cheaper than renting machines from the cloud to do it. We get better debugging. We can actually have step through code, which to me is like a fundamental tool that every developer needs to have a proper development environment. If they cannot step through code, then in my opinion, you don't really have a development environment. Then, you get better modularity. You get to extend things and work on things, and going back to our key tool earlier, jump to definition.

One of the key design aspects that I wanted to call out that we learned when we were thinking about this at Airbyte, is that you need to start from the bottom layer, if you're going to design a CI/CD system. The reason for that is because, the caching, the execution layer, they're so critical to the core functionality of what your CI/CD is doing, that everything else at layer by layer on up needs to be informed by that. Just to give you some examples, I mentioned Gradle earlier. If you were going to build your reusable CI/CD system that was modular, on Gradle, you would have to make certain design decisions about the runner infrastructure that you are going to run on, the reusable modules. Are you going to use Gradle plugins? All of that stuff is informed by this bottom layer. That answer is going to be different than if you use, like what we did, is building it on BuildKit. There are different design decisions, different concerns, and they're all informed by whether or not you use Gradle, Bazel, or BuildKit, are the three that we're going to talk about.

The last key point that I want to make here is that, this concept of event triggers of like something is running your CI/CD, those, as I mentioned earlier, they're very platform specific. Again, Jenkins doesn't have the same way of running things and same terminology and same data model as GitLab does, and as GitHub Actions does, but the rest of the system doesn't have to be. We're going to introduce a way to box up our layers. This is going to be a simple way of thinking about it. The idea here is just, wrap them up in a CLI. Why? Because CLIs, developers like them. They're developer friendly. They know how to use them. They can be created relatively easy. There's a lot of good CLI libraries out there already. They're also good for a machine to understand. I'll demonstrate in a bit here that you can design a CI/CD system that basically just clears its throat and says, go CLI. That is the beauty of being able to wrap it up in something that a machine easily understands that's also configurable. This design approach tries to ride the balance between these. How can we make it machine understandable and configurable, but also still happy and available for developers? We'll start from the bottom up, and we'll talk about some of the design decisions we made in the Lisa story here at Airbyte. When we went through this design decision, we came from Gradle land. A lot of Airbyte is built on the Java platform. Gradle is a very powerful execution tool but it didn't work very well for us. We chose to use Dagger instead. What is Dagger? Dagger is a way for you to run execution in containers. It is an SDK on top of BuildKit. It's a way for you to define your execution layer entirely as code. One of the more interesting design decisions here is, because it's backed by BuildKit, it has an opt out caching approach. From a DX perspective, caching, going back to what we were saying before, it's a hard problem. There's a lot of different things that you have to do, to do caching right. We feel that giving developers caching by default is the right way. One of the interesting things to point out here is that when you think about your execution layer, and you think about your caching layer, they must fundamentally go together. That one of the limitations of GitHub Actions is that it's not. Caching is something you bring in entirely separate and it's something that's opt in. Choosing a tool, whether it be Dagger, or BuildKit, or Bazel, or Gradle, needs to have this caching aspect baked into the very core of what it is, for it to be successful. Again, caching, we leverage BuildKit under Dagger. There are a few other mature options. I mentioned them already, Gradle and Bazel. They each have their pros and cons. Gradle eventually ended up not working for us because it was too complicated for our developers to understand.

Let's talk a little bit about orchestration. Orchestration the Lisa way, going back to what orchestration is, this is an arbitrary term that we put on. It can be a general-purpose language construct in your design, or it can also be a specific toolchain. At Airbyte, we came from a very data transfer, data heavy specific background, so we chose Prefect for this. This could also just be your typical asynchronous programming language construct in your language of choice, whether it be Golang, whether it be Python, whether it be JavaScript. There are other tools out there. Gradle and Bazel have this aspect built into their platforms. There's also Airflow. It's a workflow orchestration tool. There are a couple others, Dagster and Temporal. Just a few examples of how you can get some visualization and things like that from your workflow orchestration tool are below. Lastly, the event triggers. These don't change. Because these are the very CI specific part, this is the part of the data model that is very Jenkins specific, very GitLab specific. We push anything that's specific to that model up into that layer, and it becomes a very thin layer. Then, running locally is just executing another event. By doing this, we can actually mock the remote CI locally, by mocking the GitHub Actions, in our case, events, by doing this. Lastly, remote modules. These are just code modules. This gives us all the things that we wanted before, in the Bart approach: jumping to definition, unit testable, no more black boxes. We can use our language constructs that are probably a bit better and a little bit more robust than GitHub Actions' distribution for our packaging. Instead, maybe I want to use Rust crates to do distribution. You can actually implement a shim for migration. One thing I'll bring up here is, maybe you want to move to a system like this incrementally. You can implement a shim that will invoke GitHub Actions' marketplace actions locally. The Dagger team is also working on a solution for this. I'm excited to see what they have to offer. Any code construct that you need to make for your end users, and one of the points that I want to drive home here is that, the important thing when you're thinking about CI/CD is that your end users need to want to be and very comfortable with using it. Being able to work with the language of their choice. The frontend people at Airbyte want to work in TypeScript, because that's what they're familiar with. Thinking about a solution that is comfortable for your end users is going to give you a lot more power, and you're going to see a lot more uptake when designing a CI/CD system for others.

Demo (aircmd)

Let's do a quick demo. What I'll do is I'll show a couple quick runs of the CLI system. We will also do a quick step through of the code just to see what this actually looks like when you start to abstract some of these ideas out. This isn't maybe the only canonical way to approach this, but it is a way. We've got a CLI tool that I built over at Airbyte, it's called aircmd. This is just a click Python wrapper. It just has some very simple commands. It builds itself and it runs its own CI. What we're going to do is we're just going to run a quick command to run CI. The interesting thing here is, when we do this, one thing we can do from a UX perspective is to make these commands composable. What I mean by composable, and we'll look at this in the code, is that there's actually three things happening here: build, test, and CI. The CI does some things but it calls test. The test does some things but it calls build. Your end users can get little stages of progress as they run. If they only care about the build stage, they only have to run the build stage. They don't have to wait for anything else. You can see here that a lot of stuff was cached. What you're seeing here is actually BuildKit commands. We can actually go look at the run. This is the Dagger cloud here. We can see it here. You can see, this is one of the DX things that I think is incredibly important to highlight when it comes to caching, because caching is so difficult. Being able to see what's cached and what's not easily is pretty key to having your end users be able to write performant code. Here, you can see that almost all of this run was cached, so it took almost no time at all. Then, what we can do is we can actually bust the cache. We're going to go take a quick peek at the code over here. We're going to take a quick look at the code. Let's go ahead and start at ci.yaml. This is the event triggers layer that we were talking about earlier. There's some minimal stuff in here. The idea here is that you have some CI/CD specific layer, but all it's doing is essentially calling CI. You can see here that that's what we're doing. We're setting some environment variables. We're clearing our throat, and we're running the CLI. Then once we do that, we get into the orchestration layer. I left auth out of this demo for purposes of brevity, but you can think of auth as just being a module in this setup. In our orchestration layer, we've defined three commands. Those three commands, they're being submitted asynchronously, and they're calling each other. This is just regular Python. We're leveraging some Prefect constructs, but under the hood, it's just regular Python. Then we jump into the execution/caching layer. Then, what's happening here is interesting. Essentially, what we're doing is we're loading, you guys know of them as Docker containers, but OCI compatible client containers. We're loading those, and we're just running commands inside of a container. Now we get unit testability. We get the ability to specify cache layers. One easy way to think about what's going on here is each one of these commands is not perfect, but it's roughly analogous to a Dockerfile run command when it's happening under the hood. You could replicate this with buildx in Docker as well. It's not quite one to one but pretty close. What are we going to do? We're going to do something that busts the cache. What's happening is we're taking files, we're loading them into a container. If I just make a change, we'll just add a whitespace, maybe a comment, it's going to change the inputs to the container. Remember, we go back into what was happening before, we talked about caching. If the inputs change, we're not cached anymore. We're going to rerun the same command after we've added our comment, and we're going to see that it's not cached. It's going to actually do some work here, instead of skipping everything.

Now we'll take a quick look at the visualization tool here. These are hosted in the cloud. They can be local, but they're hosted in the cloud. What's happening is the client is actually streaming this data to them in real time, so we can actually see the job running in real time, as we go along. You can see we've already gotten pretty far in this one, and it's running the pip install. That's the core facet of the demo here is that we want to demonstrate that we have some tools to visualize what's going on. We want to give our end users the ability to peek into the core facet of our CI/CD system, which is the cache layer here. Again, once we're in, we can just jump through all of our code. If we want to see what these commands are doing, we just go to definition, we get our jump to. We have a stronger DevEx than we did with just pure YAML. We can walk all the way down the stack to the BuildKit base.

Key Takeaways

Let's talk about Lisa's light at the end of the tunnel. What does this actually mean for Airbyte? We spent some time and we refactored some stuff, what results were we able to show? How were we able to improve things for our end users? This is the output of the tool clock, see count lines of code. Over the course of three months, we were able to reduce the amount of YAML by 50%. What we did is we worked towards this incrementally. We're still incrementally working towards it. We picked the nastiest, most painful builds first. We wrote up a proof of concept that was a very thin framework that handled those nasty, painful builds. Then we incrementally started bringing things over. We added unit tests every step of the way. One of the key performance indicators that we wanted to make sure was count the regressions. How hard is it to make a pipeline that doesn't regress? We were able to achieve 90% cost savings. Part of this isn't really indicative of the tooling that we chose, but more that we were able to leverage the tooling that we chose and the system that we made to be able to really easily go in. Going back to looking at the caching GUI from earlier, we were able to really easily point in and go identify bottlenecks for where caching was being really painful. To describe exactly what this meant, is, for instance, we were trying to test 70 connectors, which is just a container output that gets tested every single day. In the previous time, in the Bart times, we were spawning one machine in parallel in GitHub Actions for each one of those connectors, which was a very expensive endeavor. After we did this, we were able to leverage caching such that the layers were all shared within each other, and we switched to a single machine approach. In the time it took all 70 connectors to run wall clock, we were able to get that same time on one machine. That was a 70-time reduction in machine time. That's how we were able to achieve such massive cost savings. We're able to test connectors more often. Our developers are much happier because they have an environment where they can actually debug problems. We've gotten away from push and pray, which for us was a major DX win.

Pulling it all together, YAML is a great tool. It works really well when you're doing something simple. In some cases, it's going to be enough. If all you're doing is maybe you've got like a side project, or what you're doing is never going to be very complex, maybe YAML is the answer. When you start to scale, when you start to grow beyond a certain point, it really starts to fall apart. For us, it happened right at about the time that our system became complex enough. Day 101 was probably at year 1.5 for us. It did take a while for us to really sit back and acknowledge it. It's key to recognize when you're starting to get into that area of when it's going to be a little crazy. We saw some major reduction of machine days, and just had some huge cost savings wins, huge performance wins. When you think about this, again, think about it from the ground up. Think about it from the execution and caching layer first. Because how you actually run these things on a remote system has a lot of nuances and caveats as well. It's like, ok, I have a cache, how do I get it onto the machine where it's running in a fast and easy way? Build the tooling and constructs your developers know. This goes back to the Airbyte frontend people want to run in TypeScript. Giving people an environment they're comfortable with is going to give you adoption and mindshare and power for your developers to feel empowered and feel like they want to work on the system instead of it being something that just gets pushed to the wayside in a lot of organizations. Incremental migration is possible. When you take this whole thing, this whole blob, it seems a little intimidating. There's a lot of moving gears in a CI/CD system. It is possible to build a bare bones thing. What we did is we built a bare bones system, and then we ran in parallel the same job for a week or two, and then we switched. The last thing is, get away from push and pray. We want to be able to do this locally.

Questions and Answers

Participant 1: I feel like you alluded this earlier in your talk, but like this tends to be a slippery slope. It's like, you start off with one YAML file, and then there's like some more that occur down to the 101st day, where things are like launching far. At that point, it just always feels like such a lift. It's like, go. It's like, ok, that's it, stop this madness [inaudible 00:39:27]. As if you are just piling on. Organizationally, personally, how do you solve that problem? How do you get buy-in for like, saying, this is madness, this needs to stop. How do you do that?

Barber: This was definitely something that came up at Airbyte as well. The key methodology for success in our case was to demonstrate the win. Especially in this area, there's a lot of like, this is the way that people do CI/CD. When you're coming to the table with a new approach, you need to be able to demonstrate a very dramatic win, like we were able to do. Finding the right proof of concept is pretty key. For us, it was finding the nastiest, slowest, flakiest CI/CD job and saying, let's just take a week, a week and a half and spike out what this can look like. Just getting something out there that's bare bones that shows some very dramatic wins, I think is the key political way to help drive adoption and get people to see the possibility of what you can do.

Participant 2: Was there anyone who actually pushed back and said, "I like YAML."

Barber: I think that some people do appreciate the simplicity of it. There are some benefits to being able to have something that's standardized and quick and easy to grok. Contextually, once you get to a point in the organization where you're doing, in some of our most egregious examples, they were inlining Python code in the YAML. It was very easy to see for people in our organization, how it had gotten out of control. Maybe other organizations haven't gotten to that point yet. When it is bad, it's very obvious that it's bad.

Participant 3: Has the developer experience of not having to repush and iterate on CI, has that changed the developer's thinking about how CI should be?

Barber: Yes. In our organization, there's been a larger adoption, just a big push in general towards local first tooling. This is part of that story. There are some organizations I've worked in, in the past, where it gets shoved over to the DevOps team, the build does. What we're seeing in our organization is that more people are becoming more actively involved in wanting to work on these systems and contribute to them. Really, our pie in the sky goal from the get-go was to bridge these gaps between different organizations and not just have the DevOps guys who do the pipeline. That everybody works on the pipeline together. We're all engineers and we all work on the platform together.

Participant 4: One thing I would like to add is the portability of your solution that you're using. Because at GitLab and Azure DevOps and company 1000 people on one, 1000 people on the other, and try that into migratable platform because they have two teams maintaining the stuff and running the stuff, and putting this into a good CI/CD tool independent approach depends on [inaudible 00:43:58].

Barber: That was something that when we looked at different ways to approach this, we looked at some other platforms and stuff like Earthly and the like. Being able to take a step back and not be married to one specific platform was a big selling factor for us.

Participant 4: The cost of moving from one YAML definition to the other, it's just lost work for the teams, and it's very painful. This just might be a way out of this, and going to the marketplace with this sort of tool is still very volatile. Where Jenkins, or some predecessor [inaudible 00:44:44] different tools, will be imperative to have something in there, vendor independent in place.

Participant 5: What do you see as next steps? You've got Air, what are the pains you still have? What's the roadmap forward?

Barber: Where's the long tail in this? No solution is perfect. Where we have hit the long tail, that we're hoping that there can be some improvement, is especially if you work in a microservices-like setup. I think this is just an unsolved problem in DevOps, CI/CD systems in general is, if you want to have these reusable modules, whether they be your own, because you'll need to make your own or whether they're maintained by somebody else, how do you manage the distribution of all these modules? Let's take a typical microservices example, like Bart's example. He's got 30 different services. How do we manage this tool, and the versions of this tool, and all the packages and all the plugins and all of that, and be able to deploy updates and test those updates, and do that in a robust manner. That's the longer tail challenge of what we're still coming up and against. Trying to tackle that has been one of our priorities, because a lot of work is on the platform engineers to maintain that ecosystem. Being able to solve that problem would take a pretty big load off of the platform engineering orgs.

Participant 2: Is there any disadvantages to using Dagger? I've definitely seen that, you see like YAML and Kubernetes, and something like Pulumi. There's always advantages for code versus declarative information. I'm super [inaudible 00:46:48], but I haven't played with it much yet. Is there any footguns I should be aware of, or any challenges?

Barber: Part of it is that it's a very new technology. It has all the caveats that come with new technology. I think the one thing when it comes to Dagger, and in general just BuildKit that we've run into challenge wise is, it's a new concept for a lot of people. BuildKit itself is relatively new technology as well. Getting people to think about things the container way may be a challenge for some organizations that are unfamiliar with it. For us, coming from Gradle, it felt like less of a lift, because Gradle, for us, it was almost interminable sometimes. The two primary challenges that we had with Dagger were that it's new technology, it has all the facets of that that come with it. Containerized builds like these are just new technology in general and people aren't as familiar with them.

This talk was recorded at QCon San Francisco 2023

 

 

Recorded at:

Jul 31, 2024

BT