Remote caching with sbt and S3 across multiple machines

2022-7-11
Scala
sbt

sbt 1.4.x introduced remote caching, a feature for sharing your build output in order to save time during compilation. You can think of it similarly to how incremental compilation works in that if you change 1 file of a project containing hundreds of files, you'd rather not have to recompile everything. Remote caching takes this one step further. Rather than be limited to your machine, you can do this across multiple machines (or multiple branches on your local machine or what have you).

sbt's remote caching feature is technically still marked experimental, but we've been using it at work for over a year now and haven't run into issues so far. We originally only set up this feature in CI (in our case Jenkins) but recently I thought to myself, "Wouldn't it be nice to also re-use Jenkins' build output on my local machine too so that I never have to do a full recompile on this project ever again?" Considering the project in question is 300,000+ LOC and takes roughly 7 minutes to perform a clean build on my underpowered laptop, I figured this would be a huge win. And spoilers... it was! 😄 Now onto the setup:

Choosing the storage method

You have many options as to where to store the build cache. sbt can cache via Maven repository, so anywhere you can host a Maven repository is an option. That means a private Nexus repository would also work. In fact, here's a blog post from Muki Seiler which goes into that (as well as using minio as an alternative).

In our case, we didn't already have a private Nexus repository, nor were we too keen on setting one up as that meant needing another server running 24/7. We decided to try S3 instead, as it would mean less infrastructure to set up as well as being less costly.

The setup

Luckily there already exists an sbt plugin for resolving and publishing artifacts using S3 called fm-sbt-s3-resolver. Simply add the following to your plugins.sbt (or wherever you store it):

addSbtPlugin("com.frugalmechanic" % "fm-sbt-s3-resolver" % "0.20.0")

Then create an S3 bucket where you intend to store the build cache. For example, let's call the bucket sbt_build_cache.

Once that's done, you can set the pushRemoteCacheTo sbt setting in your build file like so:

pushRemoteCacheTo := Some("S3 Remote Cache" at "s3://your-s3-bucket/sbt_build_cache/")

Note that it's called pushRemoteCacheTo but this works in both directions (push and pull). There is no pullRemoteCacheFrom.

Credentials for S3

fm-sbt-s3-resolver has a section about credentials but personally I didn't see the need for any of that. The default AWSCredentialsProvider should already do everything you need. For your local machine, just make sure you've already configured your credentials file (the one usually located at ~/.aws/credentials). In my case, I already had done this a long time ago which meant I didn't need to do anything else, but if you've never done it before follow the steps described here.

Similarly, our Jenkins instance was basically already configured as it runs on ec2 and has an AWS role attached to it that can access the S3 bucket. The default AWSCredentialsProvider takes it from there and will perform the auth. If you haven't done this before, the following article may be helpful.

And that's it. No need to set up a shared account or commit anything extra to the repository. Each developer should have their own AWS account that can be managed (and revoked) separately that accesses the S3 bucket.

While I've not tried this with GitHub Actions, I imagine the easiest solution would be to use the configure-aws-credentials action and follow the instructions described there.

Actual usage

In CI, when running your sbt command be sure to include the pullRemoteCache and pushRemoteCache tasks. Something like pullRemoteCache; Test/compile; pushRemoteCache; test or however you'd prefer to order the tasks (such as pushing the remote cache only after your tests have passed).

As for usage on my local machine, I simply call pullRemoteCache followed by a Test/compile in sbt. What was once 7 minutes to compile everything now takes 15 seconds. Almost all that time is spent downloading the artifacts. The compile step itself takes 2 seconds.

Note the download from S3 only happens once per content hash. The artifacts are stored in your local sbt cache after the first download. So if you do another clean followed by a pullRemoteCache, it'll be much faster the second time.

The tricky parts

Everything up to this point went smoothly for me. As I mentioned earlier, we've been using sbt remote caching for over a year now in CI. The difference this time is that our local machines use a different environment from Jenkins. So when I pulled the remote cache from S3 (that Jenkins had uploaded) onto my Windows dev machine, I ran into unforeseen issues. While the pullRemoteCache task worked fine, whenever I went to compile anything it would do a full recompile each time.

What I later realized is that all the cache files were getting invalidated by sbt/Zinc during the incremental compilation step. The reason was that what was being passed into scalacOptions differed slightly between my machine and Jenkins. For example, if my machine uses a -Ybackend-parallelism value of 16 while Jenkins uses 4, then Zinc will invalidate all the files and recompile everything.

An even worse example is that on Windows the plugin paths are sent as -Xplugin:target\compiler_plugins\wartremover_2.13.8-3.0.5.jar and not -Xplugin:target/compiler_plugins/wartremover_2.13.8-3.0.5.jar. The backslash difference is enough to cause everything to get invalidated.

As a workaround I did the following:

incOptions := incOptions.value
  .withIgnoredScalacOptions(
    incOptions.value.ignoredScalacOptions ++ Array(
      "-Xplugin:.*",
      "-Ybackend-parallelism [\\d]+"
    )
  )

Luckily this fixed it and everything works fine now.

It took me a while to find but if you're struggling with similar issues, use the following options to enable the logs for debugging purposes:

logLevel := Level.Debug
incOptions := incOptions.value.withApiDebug(true)

You'll be able to see what is causing invalidation to happen. Beware though, the logs are noisy. You may want to pipe it to a file.

Additional thoughts

Consider adding an expiration policy to your S3 bucket so that these cache files don't accumulate forever. We arbitrarily went with a 2 week lifetime, but use whatever value you think makes sense.

As for sbt's incremental compilation options, I feel like it should be more aware of which scalac options are "safe" (i.e. don't ruin repeatable builds) and which are unsafe. Right now it seems way too aggressive to me as any difference will cause an invalidation. At the very least it should normalize paths. But perhaps this was never a concern before the remote caching feature was introduced. Maybe it's something that makes sense to address now.

Conclusion

And that's it! I hope that was helpful and you're able to get remote caching working. It's definitely worth it for large projects. Surprisingly I rarely see this feature mentioned, and that's a shame because it's been tremendously helpful for cutting down our CI times. And now with local compilation too!