Using TLA⁺ at Work

Designing a snapshot coordination system

Using TLA⁺ at Work

Designing a snapshot coordination system

Here’s a short report of a time I used TLA⁺ at work, with interesting results. TLA⁺ is a formal specification language that is particularly effective when applied to concurrent & distributed systems. TLA⁺ made it tractable for an ordinary software engineer to reason about a tricky distributed systems problem, and it found a bug introduced by an “optimization” I tried to add (classic). The bug required 12 sequential steps to occur and would not have been uncovered by ordinary testing.

The Setting

The year was 2017 and I had just switched teams at Microsoft to work on the Azure DNS backend. Azure DNS was served from an in-memory tree database called RingMaster, with consensus handled by a Paxos implementation called the Replicated State Library (RSL). A RingMaster cluster is a group of five nodes, one of which is elected leader (aka primary). The primary is responsible for handling all reads and writes. Writes are sent through the usual Paxos consensus dance with the other four secondaries. If the primary dies, one of the secondaries is promoted to primary and life goes on.

Every atomic change to the database is backed up to cloud storage in a transaction log. However, since bringing new nodes online takes forever if they first have to replay every transaction since the beginning of time, secondary nodes also occasionally take snapshots of their actual state that can be used to rehydrate new nodes. The rehydrated nodes then replay any transactions occurring after the snapshot to deterministically reach the most-current database state. This snapshot + diff model works well.

In Paxos, you need to perform both reads & writes through the primary to guarantee your reads reflect all committed transactions. Secondaries can lag the primary in both receiving & applying transactions, so arbitrarily-stale reads are a possibility. However, directing all reads through the primary is quite a bottleneck! There are some applications (especially in DNS) where stale reads don’t matter that much, as long as they’re bounded somehow. So there was interest in supporting read operations on secondary nodes.

The Problem

Snapshot generation posed an obstacle to read-from-secondaries working as well as it could. When a secondary decides to take a snapshot, it continues to participate in consensus and buffers transactions but blocks applying those transactions until the database snapshot is serialized & uploaded. Serializing & uploading the database snapshot takes quite a long time for production workloads! So, the secondary taking the snapshot falls far behind the most-current state. It is thus undesirable to read from a secondary while it’s taking a snapshot.

For simplicity secondaries take snapshots at periodic intervals plus some random time offset, uncoordinated. It was therefore possible for multiple secondaries to take a snapshot simultaneously. Ordinarily this wasn’t a big deal but if secondaries started being used for reads it would be very sub-optimal! So I was tasked with developing a coordination & leasing system to ensure only one secondary could take a snapshot at a time.

The Solution

After a number of false starts I ended up with a randomized turn-based design that seemed workable and easily implemented. The design introduced a new transaction type specifying which secondary node had a right to take a snapshot, along with a wall-clock timeout by which the snapshot must be completed or the secondary abort the snapshot process. The timeout window was adaptive, extending if the previous snapshot attempt timed out or contracting if it completed well before the time limit. The snapshot lease transaction originated at the primary node and was sent through the same Paxos consensus mechanism as any other transaction. Secondary nodes reported their snapshot completion status with a 64-bit transaction ID piggybacking on consensus messages denoting the highest transaction ID applied to their last snapshotted database.

A wall clock timeout is an interesting choice within the field of distributed systems, where theorists favor logical clocks composed of monotonically-increasing counters. The ideal distributed system passes the black hole test: it continues operating even if time passes at wildly different rates among its nodes, as though some of them are located near a black hole. This isn’t fully theoretical; I’ve personally debugged issues where nodes lost connection to their NTP servers and their clocks drifted enough to cause issues with auth token expiration. It was considered acceptable to use wall-clock timeouts here because (1) snapshots were not critical to any consensus process and (2) the cluster was programmed to kick out any node whose clock diverged substantially from the cluster mean. Really what this meant is the ultimate fallback was a software engineer (me) getting paged in the middle of the night if the cluster lost quorum, but that hadn’t yet been caused by time drift in several years of continuous operation.

The overall solution sounded simple, but in distributed systems complexity always lurks within the vast combinatorial state explosion of possible execution orders. I worried about two aspects in particular: primary failure during an active snapshot lease, and new nodes being rehydrated & joining consensus using the same snapshots generated by this process that itself used consensus to generate the snapshots. For this reason I chose (and was encouraged) to invest time formally specifying the system in TLA⁺!

The TLA⁺ Specification

I’d taught myself TLA⁺ from Leslie Lamport’s book Specifying Systems during one of Microsoft’s annual week-long hackathons. My formal specification skills had subsequently seen use in small projects but nothing yet of this magnitude. It was a very fun assignment! I modeled the following events in the spec:

  • A node going down, then recovering and rehydrating from the last snapshot
  • Network connections between nodes failing, then recovering
  • The primary failing, then a new primary being elected
  • The primary extending a new snapshot lease
  • Secondaries receiving a snapshot lease, completing the snapshot, or timing out

I leaned heavily on the strong assumptions afforded by building on top of Paxos consensus. Then I defined the system safety invariants, which are the core of any formal specification. Invariants are statements which must remain true in every possible state, validated by the model checker. The invariants of this system were:

  1. The primary never takes a snapshot
  2. Two nodes cannot, simultaneously, believe they are eligible to take a snapshot

The model checker found a case where the second invariant was violated! Simple as my design was, still it found death within the cold unfathomable depths of the combinatorial state space. The bug lay in the critical logic of when the primary should override an existing snapshot lease and issue another; I had specified it should do so if:

  1. the secondary specified in the current lease completed a snapshot & reported this to the primary
  2. the current lease timed out
  3. the current lease was extended to the primary itself

Points 1 and 2 were solid. The issue lay with point 3, which I thought a clever optimization: since the primary never takes a snapshot, if a secondary happened to be elected primary while also being in possession of the snapshot lease it should simply override that lease with another. A fatal error! Here is a twelve-step sequence of events leading to disaster found by the model checker, for a three-node system:

  1. Initial state with nodes n1, n2, and n3
  2. n1 elected primary
  3. n1 sends transaction 1 assigning snapshot lease to n2
  4. n1 dies
  5. n2 executes transaction 1 and marks self as able to take snapshot
  6. n2 elected primary
  7. n2 sees self in possession of snapshot lease, so (using optimization rule) pre-empts it and issues transaction 2 assigning snapshot lease to n3
  8. n2 dies
  9. n2 recovers
  10. n2 executes transaction 1 again, marks self able to take snapshot
  11. n3 executes transaction 1, sees snapshot lease does not apply to itself
  12. n3 executes transaction 2, marks self able to take snapshot

So our invariant was violated, as in steps 10 and 12 we see both n2 and n3 marking themselves as able to take a snapshot. This is all due to our “optimization” used in step 7. Readers might wonder why n2 didn’t just execute transaction 2 and thus see it no longer possessed the lease; recall that secondaries can lag behind arbitrarily in both received & executed transactions. Plugging that hole would just generate a longer error trace where n2 only had possession of transaction 1 and not transaction 2.

Thankfully after removing the “optimization” the TLA⁺ model checker gave the design its stamp of approval! This doesn’t necessarily mean the design is fully correct for all possible cluster sizes and transaction log lengths; we’d need formal proofs to show that (of which TLA⁺ is capable). However, the exhaustive testing of all possible execution orders of the events I’d defined within a finite model gave great confidence of correctness. Worth noting, the aspects of the design I’d worried about before writing the spec were only tangentially related to the areas of actual trouble. With specification complete it was on to implementation!

The Implementation

When people learn TLA⁺ they invariably ask how they can tell whether the final system written in a non-TLA⁺ language (C++, C#, Rust, etc.) actually implements their formal TLA⁺ specification. They want an ironclad chain of reasoning showing one is connected to the other! This is possible, but it carries a heavy price. Formal verification still isn’t widespread; SPARK Ada only sees use in a handful of safety-critical industries. Dafny still struggles to find purchase outside of research projects. However, the wild success of Rust shows there is some appetite among programmers for writing proofs of memory safety, as long as you’re sneaky and don’t call it that. Indeed, there is now work being done with a draft RFC to add SPARK Ada-style contracts to Rust, enabling formal verification of other properties!

It sounds strange, but this gap between formal TLA⁺ specification and implementation can be a strength. Writing a formal specification without having to prove your final implementation refines that spec gives great bang for your buck. Here is the crux: for professional programmers, writing code that implements an unambiguous formal specification is easy. Much of the difficulty when writing code comes from unclear requirements and uncertainty about how the code will interact with the rest of the system. Working from a pre-existing formal specification ensures all that heavy lifting has already been done! You’ll never halt mid-line as you’re struck by some case you didn’t consider. You just write the code as it’s been boiled down to terse logical statements in the spec. You don’t even have to pause to think about whether a comparison should be \(<\) or \(\leq\); that’s already been done for you, and exhaustively checked!

So, actually implementing & testing the specified snapshot coordination design was probably the smoothest development experience of my career. It was novel to be making changes to core systems and feeling confident doing so. It only took a handful of days. All told this project took slightly less than a month from conception to merge.

Conclusion

The project was a very positive experience and cemented TLA⁺ in my mind as the go-to tool when working on concurrent & distributed systems. I’ve continued to use TLA⁺ in the half-decade that has passed since this project, including writing TLA⁺ specs for clients as an independent contractor. If you’re interested in this service or others in my portfolio, please do contact me! I am currently open for work.

The ease of implementing a formal TLA⁺ spec has made me think of the task as more in the domain of fuzzy language processing rather than chains of hard logic. For this reason I would be interested in seeing whether LLMs could be used to check whether a program implements a TLA⁺ specification. I can see this idea being anathema within the formal methods community but believe there is utility to be found. A follow-up post might try to retroactively apply that here.

You can see the spec described in this post and its associated model files here. If you’re interested in learning TLA⁺ yourself, you can find a list of resources here. Thanks for reading!

Discussions