No time to waste! How we improved caching and parallelization in Github Actions

You finished writing your code, and everything works as intended. All tests are green. This time you verified that they were failing before implementing the feature. Success!

Time to push the commit and... make a coffee.

Ten minutes later, you return with a fresh cup of coffee. But your pipeline is still running. You wait till all checks are green and quickly get some approvals in code review. You want to hit the merge button.

Suddenly you realize that someone else has just merged their code! Now, you have to update your branch with their changes, and the whole waiting game starts all over again.

There are at least two problems in this situation - the pipeline finish time and general approach to merging changes to the main branch. In this article, I wanted to share how we improved one of them, the slow, unscalable pipeline.

_{This article contains a high-level view with some configuration examples. If you are interested in detailed documentation, you can find it in this
GitHub page.}

Where do we start?

Our main goal was to build and test an application built with Next.js. That's how our workflow setup looked in Github at the beginning:

All steps were in a single queue. Adding anything to it would mean the overall time to finish increases. Not so good - we can do better.

Let's cache!

The first opportunity was to find out which part could be cached. In GitHub Actions, there are a few rules few rules about how cache works. In our case, we were interested in sharing the cache results of the main branch, which was the most popular pull-request target.

To reuse the cached files, instead of installing them again and again, we need to specify the key, which will help us to locate the right place in the cache. Using GitHub cache action, we had to align the cache key to be the same between workflows running in the main branch and any development one. Initially, the cache was not accessible because the cache key contained the hash of the workflow file. As we had different workflow files running on the production branch and development, different hash was generated. This prevented us from having the same cache key and further benefiting from cached results.

Removing the hashing of the workflow file allowed us to share the cache between workflows. That means that if there wasn't any change to yarn.lock file, we could already save around 2 minutes of running time.

Before:

After:

Doing more than one thing at the time

In our workflow, building a docker image was a completely independent process. Everything was happening inside the container. As in GitHub actions, we cannot run two steps of one job at the same time, we are forced to create two separate jobs. As we need to install node modules for both of them it makes sense to separate the installation step into its job.

If building docker is independent, why wait for npm modules installation, you may ask? Based on our knowledge that docker installs modules by itself there is no reason to try to build an image if the general installation might fail.

After this change, the total pipeline time drops to 8 minutes, making the tests our next bottleneck. I also marked the potential improvement in green. Building docker looks like a big long task, but it's still shorter than the longest chain of jobs and steps. Can we parallelize the steps in the testing part?

After separating each test step into its job, every type of test will start once the installation job is complete and its results stored in the cache. As every job runs in its independent environment, it needs to retrieve cached results and then utilize them to perform its actions. It adds a little overhead, so overall running time increases, but the entire pipeline will finish faster.

With every pipeline optimization, our bottleneck might move to the other place, setting new limits to how fast the pipeline will finish. So far, the longest task was to build a docker image, but with buildkit, we can bring some improvement there as well to our caching strategy.

It scales!

Having well-structured, separated jobs, we can repeat what we have done so far while adding more actions. The final example will present how we introduced 3 more changes:

parallelization to our visual tests that, through time, were taking more and more time.
PR comment that ensured our developers will become informed on how their changes impact the amount of JavaScript we ship to customers
new code quality SonarCloud scan, which required the test coverage, but as we tested our code before, there was no need to do it again

Adding every new step was possible without increasing the overall pipeline finish time.

This time we could not follow the same caching settings like with the node modules. Do we want to retrieve unit test results from the main branch? Probably not. We also want to build an application once per commit to make sure it includes all recent changes. We don’t want to get cache results from the main branch in this scenario. We need to define different cache keys as follows:

Instead of a hash of the lockfile we pick the unique commit SHA. With this maneuver, we can save time both for visual tests and for statistics about our bundled code, which is required to build an application first.

That's it! While we added more tasks into our workflow, the pipeline finish time did not increase. With this new setup, we optimized individual parts so we can get feedback on the progress of each step.

With the presented approach, you can analyze your current workflow and try to figure out how to decrease the task repetition. Maybe you can cache some temporary directories so the app builds faster every time? Maybe you don’t need to run some jobs at all in certain situations?

Only you can find out.