Remote caching with sbt and S3 across multiple machines

2022-7-11
Scala
sbt

sbt 1.4.x introduced remote caching, a feature for sharing your build output in order to save time during compilation. You can think of it similarly to how incremental compilation works in that if you change 1 file of a project containing hundreds of files, you'd rather not have to recompile everything. Remote caching takes this one step further. Rather than be limited to your machine, you can do this across multiple machines (or multiple branches on your local machine or what have you).

sbt's remote caching feature is technically still marked experimental, but we've been using it at work for over a year now and haven't run into issues so far. We originally only set up this feature in CI (in our case Jenkins) but recently I thought to myself, "Wouldn't it be nice to also re-use Jenkins' build output on my local machine too so that I never have to do a full recompile on this project ever again?" Considering the project in question is 300,000+ LOC and takes roughly 7 minutes to perform a clean build on my underpowered laptop, I figured this would be a huge win. And spoilers... it was! 😄 Now onto the setup:

Choosing the storage method

You have many options as to where to store the build cache. sbt can cache via Maven repository, so anywhere you can host a Maven repository is an option. That means a private Nexus repository would also work. In fact, here's a blog post from Muki Seiler which goes into that (as well as using minio as an alternative).

In our case, we didn't already have a private Nexus repository, nor were we too keen on setting one up as that meant needing another server running 24/7. We decided to try S3 instead, as it would mean less infrastructure to set up as well as being less costly.

The setup

Luckily there already exists an sbt plugin for resolving and publishing artifacts using S3 called fm-sbt-s3-resolver. Simply add the following to your plugins.sbt (or wherever you store it):

addSbtPlugin("com.frugalmechanic" % "fm-sbt-s3-resolver" % "0.20.0")

Then create an S3 bucket where you intend to store the build cache. For example, let's call the bucket sbt_build_cache.

Once that's done, you can set the pushRemoteCacheTo sbt setting in your build file like so:

pushRemoteCacheTo := Some("S3 Remote Cache" at "s3://your-s3-bucket/sbt_build_cache/")

Note that it's called pushRemoteCacheTo but this works in both directions (push and pull). There is no pullRemoteCacheFrom.

Credentials for S3

fm-sbt-s3-resolver has a section about credentials but personally I didn't see the need for any of that. The default AWSCredentialsProvider should already do everything you need. For your local machine, just make sure you've already configured your credentials file (the one usually located at ~/.aws/credentials). In my case, I already had done this a long time ago which meant I didn't need to do anything else, but if you've never done it before follow the steps described here.

Similarly, our Jenkins instance was basically already configured as it runs on ec2 and has an AWS role attached to it that can access the S3 bucket. The default AWSCredentialsProvider takes it from there and will perform the auth. If you haven't done this before, the following article may be helpful.

And that's it. No need to set up a shared account or commit anything extra to the repository. Each developer should have their own AWS account that can be managed (and revoked) separately that accesses the S3 bucket.

While I've not tried this with GitHub Actions, I imagine the easiest solution would be to use the configure-aws-credentials action and follow the instructions described there.

Actual usage

In CI, when running your sbt command be sure to include the pullRemoteCache and pushRemoteCache tasks. Something like pullRemoteCache; Test/compile; pushRemoteCache; test or however you'd prefer to order the tasks (such as pushing the remote cache only after your tests have passed).

As for usage on my local machine, I simply call pullRemoteCache followed by a Test/compile in sbt. What was once 7 minutes to compile everything now takes 15 seconds. Almost all that time is spent downloading the artifacts. The compile step itself takes 2 seconds.

Note the download from S3 only happens once per content hash. The artifacts are stored in your local sbt cache after the first download. So if you do another clean followed by a pullRemoteCache, it'll be much faster the second time.

The tricky parts

Everything up to this point went smoothly for me. As I mentioned earlier, we've been using sbt remote caching for over a year now in CI. The difference this time is that our local machines use a different environment from Jenkins. So when I pulled the remote cache from S3 (that Jenkins had uploaded) onto my Windows dev machine, I ran into unforeseen issues. While the pullRemoteCache task worked fine, whenever I went to compile anything it would do a full recompile each time.

What I later realized is that all the cache files were getting invalidated by sbt/Zinc during the incremental compilation step. The reason was that what was being passed into scalacOptions differed slightly between my machine and Jenkins. For example, if my machine uses a -Ybackend-parallelism value of 16 while Jenkins uses 4, then Zinc will invalidate all the files and recompile everything.

An even worse example is that on Windows the plugin paths are sent as -Xplugin:target\compiler_plugins\wartremover_2.13.8-3.0.5.jar and not -Xplugin:target/compiler_plugins/wartremover_2.13.8-3.0.5.jar. The backslash difference is enough to cause everything to get invalidated.

As a workaround I did the following:

incOptions := incOptions.value
  .withIgnoredScalacOptions(
    incOptions.value.ignoredScalacOptions ++ Array(
      "-Xplugin:.*",
      "-Ybackend-parallelism [\\d]+"
    )
  )

Luckily this fixed it and everything works fine now.

It took me a while to find but if you're struggling with similar issues, use the following options to enable the logs for debugging purposes:

logLevel := Level.Debug
incOptions := incOptions.value.withApiDebug(true)

You'll be able to see what is causing invalidation to happen. Beware though, the logs are noisy. You may want to pipe it to a file.

Additional thoughts

Consider adding an expiration policy to your S3 bucket so that these cache files don't accumulate forever. We arbitrarily went with a 2 week lifetime, but use whatever value you think makes sense.

As for sbt's incremental compilation options, I feel like it should be more aware of which scalac options are "safe" (i.e. don't ruin repeatable builds) and which are unsafe. Right now it seems way too aggressive to me as any difference will cause an invalidation. At the very least it should normalize paths. But perhaps this was never a concern before the remote caching feature was introduced. Maybe it's something that makes sense to address now.

Conclusion

And that's it! I hope that was helpful and you're able to get remote caching working. It's definitely worth it for large projects. Surprisingly I rarely see this feature mentioned, and that's a shame because it's been tremendously helpful for cutting down our CI times. And now with local compilation too!


Getting to the bottom of Pikachu's boobs

2022-4-29
Translation
Games
Pokemon

Many Pokemon fans may already be familiar with the often cited story that when Pokemon was first pitched to be localized in the US, the American staff proposed replacement designs for Pikachu because they thought Pikachu was "too cute" for the American market. Supposedly they suggested turning Pikachu into "a tiger with huge breasts" as they thought this would appeal more to American kids.

Upon reading this for the first time, I thought to myself, "There's absolutely no way this could be true". However, when I followed the source (from this Did You Know Gaming? YouTube video), to my surprise it was Tsunekazu Ishihara, president of The Pokemon Company, who said this. And I did confirm that the original Japanese interview was not mistranslated (other than potentially translating an implicit "we" as "I").

The first time we showed off some Pokémon in the US, we were told they were 'too cute'. The staff in America submitted their ideas for replacement designs, but we just couldn't believe the kind of stuff they were proposing.

"They turned Pikachu into something like a tiger with huge breasts. It looked like a character from the musical Cats. When I asked 'how is this supposed to be Pikachu??' they said, 'Well look, there's its tail right there'... Seriously, that was the kind of stuff being proposed.

Source (2020/7/9): Did You Know Gaming?

"How could this be!?", I thought to myself. The story makes no sense, yet here it is straight from the president's mouth.

I did some more digging as I wasn't ready to accept this. I found 2 old interviews from Satoru Iwata, from 2008 and 2009 respectively where he clearly states that the Pikachu redesign they received was a large muscular Pikachu, not "a tiger with huge breasts" like the musical Cats.

They saw Pocket Monsters and said, "Creatures this cute can't be called 'monsters'. A monster needs to muscular and more fearsome." When I received their muscular Pikachu redesign I thought to myself, "There's no way we can show this to the people who created Pokemon". I remember this like it was yesterday, even though about 10 years have passed since then.

Source (2008/10/31): Nintendo Official Website

When we were looking to expand Pokemon overseas, I can't forget what we were told. They said, "Something this cute can't be called a 'monster'". They said if you want success in the American market you need to go with a muscular redesign, such as the one they sent us.

The president of Nintendo at the time, Hiroshi Yamauchi, told me, "If there is no example [of a cute design like this] working in America, that's even more reason to give it a try. When it comes to entertainment, there is value to standing out from the rest. If something is unique and sees success, it has the potential to become really big. Go with the Japanese designs as is."

Source (2009/5/18): Toyokeizai Online

So what's going on? How are two official sources published by Nintendo on their own website disagreeing with one another?

The answer likely lies in the Japanese word 胸, which can mean both chest or breasts/boobs. What appears to have happened is that the phrase "big chest" got misinterpreted on the Japanese side first. Iwata meant "big chest" as in large pectoral muscles, but somewhere down the line in retelling the story this turned into "huge boobs", like a bad game of telephone.

I'm inclined to believe Satoru Iwata's telling of the story as he appears to be the primary source. And the fact that he specifically says "even today I remember their words vividly" and "How am I supposed to show these redesigns to the people who created Pokemon?". So it stands to reason a lot of the creators of Pokemon didn't actually see the proposed redesigns and all their information is second-hand, hence why the details are different. In the following Iwata Asks, it is stated that Game Freak was busy with the development of Pokemon Gold and Silver, and that Nintendo along with Iwata (who didn't work for Nintendo at the time but was the go-between for the two) were the ones doing the work to bring Pokemon overseas.

Between Iwata's words and the fact that it's not believable at all that Americans (who have always been sensitive about sexual imagery, especially in the 90s) would suggest such a change for a kid's show. When you consider cartoons airing at the time like Teenage Mutant Ninja Turtles and Street Sharks, it's not difficult to imagine exactly what they were proposing for Pikachu's redesign.

JAWSOME! JAWSOME!

To think what could have been.