Author
Name
Ben Johnson
@benbjohnson
Image by Annie Ruygt
Weâre Fly.io. We run apps for our users on hardware we host around the world. This post isnât about our platform. Rather, itâs an elaborate plot to get you to write some code just for the hell of it.
In the field of computer science, the industry is represented by two separate yet equally important groups: the software developers who build Rails applications and mobile games, and the academics who write theory papers about why the problems those apps try to solve are NP-hard. This is a story about both.
Distributed systems span the practical-academic divide. Reading a stack of MIT PhD dissertations may be a good Friday night, but it wonât equip you for debugging a multi-service outage at 2am. That requires real-world experience.
Likewise, building a fleet of microservices wonât give you the conceptual tools to gracefully & safely handle failure. Many failure scenarios are rare. They donât show up in unit tests. But theyâre devastating when they do show up. Nailing down the theory gives you a fighting chance at designing a correct system in the first place.
The practical and academic tracks seldom converge. To fix this, we teamed up with Kyle Kingsbury, author of Jepsen, to develop a series of distributed systems challenges that combine real code with the academic rigor of Jepsenâs verification system.
We call these challenges the Gossip Glomers.
What the f$#* is a Glomer?
Itâs an elaborate pun about the CAP theorem.
How it works
You know Kyle Kingsbury from his âCall Me Maybeâ blog posts that eviscerate distributed databases. You may also have known about Jepsen, the Clojure-based open-source tooling Kyle uses to conduct these analyses. Well, Kyle also wrote another tool on top of Jepsen called Maelstrom.
Maelstrom runs toy distributed systems on a simulated network. It easily runs on a laptop. Kyle uses it to teach distributed systems. We all thought itâd be neat to build a series of challenges that would teach people around the Internet Maelstrom, and, in turn, some distributed systems theory.
Each challenge is composed of several parts:
- The workload acts as a set of clients to your distributed systems. These clients send different types of messages as defined by the challenge and expect certain constraints to be met. These workloads can vary between a simple distributed counter all the way to multi-operation, transactional database systems.
- The simulated network injects network partitions or slows messages between nodes.
- The verification system uses Jepsen to check consistency and availability constraints required by the challenge.
- And finally, the binary for the node which is written by you!
Pathway to distributed systems enlightenment
Our challenges start off easy and get more difficult as you move along. Theyâre organized into six high-level challenges with many of those having several smaller challenges within them.
First, youâll start with the Echo challenge. This is the âhello worldâ of distributed systems challenges. It gets you up and running and helps you understand how these challenges work.
Next, youâll build a totally-available, distributed unique ID generator. In this challenge, nodes will need to be coordination-free and independently generate a unique identifier for any number of clients.
After that, the difficulty starts to ramp up with the broadcast challenge. In this challenge, youâll need to propagate messages out to all the nodes in the cluster. Youâll need to ensure fault tolerance in the face of network partitions and then work to optimize your message delivery to minimize the number of messages sent within your system.
Once youâve made it past broadcast, youâll implement a grow-only counter, or g-counter. The tricky part with this challenge is that youâll need to build on top of Maelstromâs sequentially consistent key/value store.
Then youâll dive into the world of replicated logs by building a Kafka-like system. This challenge will build on the linearizable key/value store provided by Maelstrom but youâll need to figure out how to not only make it correct but also efficient.
Finally, youâll wrap up with the totally-available transactions challenge where youâll build a transactional database on various consistency levels.
A bit of history
Over the past year, weâve been growing like gangbusters. Thatâs great. But it also means weâve been hiring, and hiring is hard.
We hire resume-blind, based on work-sample tests: we have people write code and design systems, and then score those submissions based on a rubric. Weâve got criteria set up for early-career, mid-level, and team-lead developers. But we didnât have strong criteria for hiring staff engineers.
So we began tossing around ideas. In a previous life, some of us had success with a series of cryptography challenges called Cryptopals, so we figured weâd try something similar, but with a distributed systems flavor.
That sounded great but how do you actually test distributed systems to know if someone passed or failed? For weeks, we wrote up one iteration after another but none of them felt right.
Finally, we had a brilliant idea. Letâs find someone who lives and breathes distributed system validation! That someone is Kyle Kingsbury.
After working on these challenges with Kyle, we realized that they are too much fun to keep to ourselves as an internal evaluation tool. So weâre releasing them for anyone to play with.
But wait⌠thereâs more!
If you scoff in the face of cascading failures, if you bend consistency levels to your will, and if you read k8s.af post-mortems as bedtime stories to your kids, you may be interested in trying our hardest challenge.
We reserved this last challenge for evaluating our staff engineers at Fly.io. So if you think youâd be up to the challenge, weâd love to talk to you.
Next post â
Previous post â