I firmly believe that operations main goal is to increase developer velocity. This means focusing on crafting the tools and systems needed to expedite application deployments to production. A lot of this work revolves around removing operations bottlenecks from that deployment pipeline – the less the developer has to ask ops for, the more they can do on their own.
It would only make sense then, that developers should master their own application dockerfiles (images). This would allow them to create and modify their application runtime environment on the fly, not having to submit tickets for operations to upgrade or install dependencies on the host machines. I used to fully support this methodology, and helped enact it at a previous employer. Hell, the dockercon 2017 keynote even pushed the idea.
But, when I started working on the great docker migration of 2016 (GDM) at a new job, I found myself turning to the dark side of managed images.
When the GDM began, I pushed for this paradigm again. It had worked wonderfully previously, and had seriously increased developer productivity. However, as the design process of the GDM progressed, and details were fleshed out, I started to see some serious setbacks to this approach.
Many of the issues that came up were caused by a completely different engineering environment between the two companies. Previously I had worked at a smaller, cloud-based company. Everyone was a “full-stack” engineer, and managed their own application infrastructure via AWS. There were no limitations on tech or language when creating a project – every service was made a-la-carte by the team managing it. My new company was much larger, had application-specific engineers, hosted their own infrastructure, and already had a set of tools in place to automate and restrict both project creation and deployment.
Ultimately, those issues were enough to tip the scales in favor of ops-managed images for the GDM. I still believe that the developer-managed images were the correct solution for the previous company I was at, but I no longer believe they’re the correct solution for 100% of use cases. A brief overview of the issues that caused that change of philosophy is listed below.
Security
Multi-Tenant Environment
If you’re running your containers in the cloud (specifically a third-party provider’s cloud), you can probably skip ahead to the application security section. In this scenario, the burden of protecting a multi-tenant environment falls on the provider, so you don’t have to be nearly as cautious if you were managing your own. #cloudwins
If you’re in the not-so-fortunate boat of hosting your own resources, you have a set of real security concerns you need to contend with.
While many security patches were put in place after Dirty Cow , container escape vulnerabilities have continued to appear. No system is guaranteed perfect, so it’s important to properly manage and restrict container access to the host machine itself. This helps to prevent malicious containers from not only compromising the physical host itself, but from affecting other applications running on that same docker daemon.
One solution to mitigate this risk is to run containers inside of some variation of VM technology. Even if a process escapes a container inside of a VM, it only has access to the VM’s virtualized kernel, and can’t cause harm to other applications. So in this scenario, it’s perfectly fine to let developers throw whatever dockerfiles they want at the host machine – everything is safely isolated.
However, running a container inside of a VM not only violates the fundamental lightweight-runtime-environment docker principle, it also comes with a new set of ops-related problems – How do you manage cgroup resource limitations of a container inside a VM, when those resource limitations apply to the host machine? How do you safely expose the host docker daemon inside the VM? Why are you making us deal with VMs again?
This video does a really great job of explaining the shared-resource issue, and the downsides of a traditional VM based solution. It also happens to be a pitch for Intel’s Clear Containers, but, y’know, you just don’t find many shared-kernel-container-exploits-and-concerns videos nowadays.
By having operations master a set of runtime images, every image can be vetted for vulnerabilities, and patched at anytime. Shellshock 2.0 comes around? Just upgrade your set of managed images and redeploy! No need to hunt down every application team that swore they’d have it upgraded last week, and awkwardly beg them to upgrade immediately even though you have no real way of forcing the upgrade beyond performing it yourself and are like, 3 months into the job, and are only doing this because your team thought it was time for your initiation. Because, y’know, that’s a totally valid and 100% theoretical situation that could happen to anyone.
LinuxKit & oKernel
I wanted to include a brief aside here about the advent of LinuxKit and oKernel, which were released publicly during DockerCon 2017. LinuxKit is a Docker open source project, which allows you to build your own linux OS from scratch, including as much (or as little) of the OS your application needs to run. oKernel is one of those building blocks, and is currently under development by HP. Essentially, oKernel is a two part split kernel, where one half is a virtualized view of the host kernel, and the other half is the actual host kernel itself. Containers then run on the virtualized kernel, so, theoretically, container escapes cannot access the host kernel itself. This limits the surface area of attack to whatever resources are made available through the virtualized kernel. The project is still in beta, but looks to be a promising alternative to the container-in-VM approach.
Developer Velocity
Having Ops-Managed Images seems, at first glance, counter-intuitive to increasing developer velocity. If developers need their application runtime environment modified, they are still dependent on Ops to make and deploy that change for them. It seems like we’re just moving the ops-bottleneck to a new location.
However, it turns out application developers aren’t terribly interested in being responsible for ensuring their service plays nicely with the host its’ running on. They actually like to scope that worry, effort and pager duty hours to the application itself. Weird, right?
There are a few things that must be in place to ensure developer velocity is increased, and not hindered, by ops-managed images:
- local docker development
- Developers must work locally inside the docker container that their application will deployed in. Otherwise, you’re opening up numerous possibilities for things to go wrong on both the dev and ops sides.
- Having an operations “runtime” script that builds and runs the developer’s application image locally, in the same manner that its’ run on both CI and Production servers, is really helpful to bumping that velocity metric. Without a script, you’re slowing development time (devs will have to manually start app containers), and opening up a possibility for human-error that will prevent an application from running correctly in production (it ran on my machine, but I forgot to load this one config in before it started, etc, etc)
- standard project creation
- There must be a standard for project creation. This includes project language, structure, CI pipeline, and execution commands. If these limits are not in place, ops will end up making a large number of images for every team’s custom setup, and ultimately won’t be able to sustain support for every image.
- It helps to have a project creation tool, similar to Netflix’s spinnaker, that can set up an entire deployment pipeline for a particular project type. This increased velocity for obvious reasons, but also helps scope down the number of necessary managed images
- There must be a standard for project creation. This includes project language, structure, CI pipeline, and execution commands. If these limits are not in place, ops will end up making a large number of images for every team’s custom setup, and ultimately won’t be able to sustain support for every image.
Space & Caching
Lastly, a risk you run allowing developers to build their own images is the potential of the creation of many, inefficient, bloated images. Your developers don’t care about the resources their images are running on. They don’t care that your data center syncs are falling way behind trying to push massive images across the network. They care that most of their image is cached locally, and builds quickly for them.
Forcing developers to use a set of maintained images, where the only custom build step in that image is the copying their application binary into the container, ensures that operations can create and manage concise, lightweight images.
It also means that registry storage space is minimized – at a high level, images are stored in the docker registry as a JSON blob of layer hashes. Each of those hashes points to the actual binary for that hash somewhere in that same data store. If multiple images share the same hashes, that binary is not duplicated. Instead, there are just many image hashes pointing to the same binary. So, if every application image is unique by the size of its’ application binary, your registry only grows by the size of those code artifacts (plus any new images operations decides to add in the future).
This also speeds up the syncing of images across data centers (9 times out of ten you’re only copying a code artifact and JSON file across the network), and improves build and deploy times on developer, CI and production machines (if all applications are sharing a common set of images, those common image layers are going to be cached on the CI/production machines almost 100% of the time).
So,
TL;DR
An ops-managed image solution is not the right choice for every company. A developer-managed image solution is not the right choice for every company. Depending on the size, development workflow and systems architecture at a company, one approach may be better than the other.
Either way, we shouldn’t write-off ops-managed images anytime soon.
It seems that there could also be a middle ground approach: Ops defined images that people can use and will work for 90% of use cases, and then developer-defined images that use the Ops image as their base but add/modify a few things.
If a specific team needs some niche piece of software they can make their own Dockerfile based off of an Ops image that installs that niche piece of software. Hopefully the Ops images will be made in such a way that the majority of teams will just use them as-is and not need to make any modifications.
LikeLike