Containerize your Node.js application – Part 1
As one of the most popular backend tech stack at the moment, containerizing Node.js is still a chore for most developers. Here is my attempt to present a guide on how to do it, efficiently enough for production.
The scope for Node.js based application is wide:
- It can be a backend application. Or a React application, because there will be a build step using Node.js.
- TypeScript might be used.
For containerization examples, I will be using Docker, the most popular containerization framework. However, the ideas below will still apply if you use other container tools like Podman, containerd, etc.
Refresher on containerization
Containerization is currently the de facto best method of deploying backend applications now. For newbies to this, containerization feels like doing virtual machines. You start by building an Image, then it is used to create Containers, which are similar to VMs. And for your daily interaction with Containers, this is a fine mental model to keep.
But let's dive in a bit on what container really is:
- A container is a wrapper on a process, which is an invocation of a program. When this process ends, the container stops.
- A container (and, by definition, its image) contains all the files needed for the execution of the wrapped process.
- Containers are isolated using a Linux kernel's mechanism called namespacing, which allows each process to see only resources related to it. This means, from the host machine's perspective, a containerized process isn't much different from a normal process.
Armed with all this definition, let's continue by defining what a well engineered Node.js image/container entails:
- Fast build time.
- Lightweight: only contains the absolute bare minimum number of files required for its execution.
- Is fully configurable from outside the container, just like a normal process.
- Exit cleanly
Below are some of my best practices for achieving those goals.
1st Criteria: Fast Build Time
Optimizing your image build time is a very important but often overlooked way for you to improve operation as a whole. The consequence of a slow image build is far-reaching:
- The development loop became slow.
- CI&CD became slower & more error-prone
- You consume more CI&CD resources
So how do you improve your image build time? Three tenets:
- Prune build context
- Cache ruthlessly
- Only include the files you absolutely need for the next step
Build Context
If you are familiar with Docker, you will know that by running docker build
, the command will start by sending the code folder to the builder. This is called a Build Context. The files in the build context will be available for the COPY
command.
Now, you know that every Dockerfile commands can be cached, using a digest calculated by the command itself, previous command digest & the input files. So, if you run COPY . .
to copy the whole context into the builder, this command will be invalidated every time you change a file inside your context. This means you are not following the 3rd tenet of caching ruthlessly.
Before you do something to optimize the COPY command caching behavior, we can first start by pruning unnecessary files in the context. For a normal Node.js project, the default context may include a lot of unneeded things:
- The
node_modules
folder. This folder shouldn't be included in production image, as we should reproduce the dependencies in build process yourself. - Cache directory for specific tool or framework. This might include:
.nx
,.yarn_cache
, … - Git directory:
.git
You can ignore those files using .dockerignore
, which follows the same syntax as .gitignore
What I find most helpful for writing an efficient .dockerignore
file is the ignore-all-then-negate pattern. This allows you to selectively include only what you need in the build. For example:
*
!src # include only the source directory
!script # include only the custom bash scripts
!migrations # include only database migration files
!package*.json # include only package.json & package-lock.json
This pattern requires you to try the build process a few time until you get a working image, include more files if they are missing in the process.
Cache ruthlessly
Because image build use a cascading cache model: all subsequent steps' caches will be invalidated if a previous step is invalidated, you need to ensure for every build, as much cache is used as possible. What this usually means is you should:
- Process files that are changed more frequently lower down the chain.
- Ensure each task is dependents only on what it needs (tenet #3)
For example:
- Because
package.json
&package-lock.json
(oryarn.lock
) are used for dependency resolving & changes less frequently than your source code, copy only those files at the start of thenpm install
process. - If you have library patch files, apply those in the next step. Don't do this in the same command with the previous step.
- Copy source code in & build/bundle afterward
This can result in your Dockerfile looking something like this:
If you know that the build is handled by BuildKit, you can make use of cache mount. This allows you to improve run speed of commands that use an external cache, like npm
or yarn
. The idea is to specify a privately managed volume by BuildKit, to store only the cache of this command during build time.
RUN --mount=type=cache,target=/root/.npm_cache NPM_CONFIG_CACHE=/root/.npm_cache npm ci
# or if you use yarn
RUN --mount=type=cache,target=/root/.yarn YARN_CACHE_FOLDER=/root/.yarn yarn --frozen-lockfile
Ideas taken from this StackOverflow answer
Include only the files you need for the next step
Now, this point has already been somewhat elaborated on in the previous section, but I can expand on it a bit more, particularly with the package.json
file. This file (along with the dependency lockfile), is of particular importance when it comes to optimizing image building of a Node.js application, because those files are needed for the npm/yarn install
command, which takes a long time to finish.
Naturally, you will want to do this step as early as possible and reduce the cache miss ratio on this step. This means you want to ensure that the npm/yarn install
command is only re-run when it is definitely required to re-run. Those scenarios are:
- Dependencies changes in one of those segments:
dependencies, devDependencies, peerDependencies
- The
postinstall
script is changed, because this script is invoked when runningnpm/yarn install
- Lockfile is changed (of course)
Any other changes to other segments of the package.json
file, should not result in this step being invalidated. One of trick if found very useful for this is to use jq
to extract only the relevant content of the package.json
file. We also need to create a different build stage for this action. This will result in a Dockerfile like below:
# this stage will be used to extract a trimmed package.json file only for dependency installation
FROM endeveit/docker-jq AS deps_extraction
COPY package.json /tmp
RUN jq '{ dependencies, devDependencies, peerDependencies, scripts: (.scripts | { postinstall }) }' < /tmp/package.json > /tmp/deps.json
# now we use the trimmed package.json file in this stage
FROM node:18.10.0-alpine3.15 as builder
WORKDIR /app
COPY --from=deps_extraction /tmp/deps.json ./package.json
COPY package-lock.json ./ # bring in package-lock.json from build context
RUN npm ci # clean-install
# now that npm has successfully pulled all dependencies
# we bring in the source code & the original package.json
COPY . .
# Proceed ...
This simple trick will certainly boost your cache hit ratio on the installation step, which may improve your average build time by a staggering 3–5 minutes (that's how long your installation may run for). Pair this with the npm cache
mount trick in the previous section, you will never even notice npm install
runtime on container build anymore.