TLA⁺ is more than a DSL for breadth-first search

TLA⁺ is more than a DSL for breadth-first search

Although it isn’t usually taught that way, a lot of TLA⁺ newcomers develop the understanding that TLA⁺ is just a fancy domain-specific language (DSL) for breadth-first search. If you want to model all possible executions of a concurrent system - so the thinking goes - all you have to do is define:

  1. The set of variables modeling your system
  2. The values of those variables in the initial state(s)
  3. Possible actions changing those variables to generate successor states
  4. Safety invariants you want to be true in every state

The model checker will then use breadth-first search (BFS) to churn through all possible states (& thus execution orders) of your system, validating your invariants. If one of your invariants fails, BFS gives you the shortest execution path reaching that state. It even checks for deadlock by finding states with no possible successor states.

You can get a lot of value out of TLA⁺ using it this way! TLA⁺ qua BFS DSL is a perfectly fine way to think of the language. If you specify your system in TLA⁺, you will have a terse, precise, readable specification that can also be checked to ensure it is free of concurrency bugs hidden deep in the wilds of combinatoric state explosion. Most working software engineers have learned BFS at some point so can even understand what the model checker is doing under the hood. However, a BFS DSL is not all that TLA⁺ is. In brief, here are three features of TLA⁺ that require going beyond this limited conception:

  1. Liveness checking
  2. Refinement checking
  3. Formal proofs

The second way of TLA⁺

Like all things worthy of fascination, there are several totally orthogonal ways to look at a TLA⁺ spec. It is worth knowing all of them so you can switch your perspective depending on the context. A finite state machine explored with BFS is the first way. The second way is as a set of infinite system behaviors, or execution traces.

Instead of viewing your system as running around a fixed finite state space like a rabbit, occasionally returning to states previously visited, you instead view it as generating a set of possible execution traces. These execution traces all start in the same state or set of initial states, then evolve according to the rules of your system - forever. Termination just means a given behavior loops endlessly in the same state for eternity.

Why think this way? It is necessary if you want to fully grasp liveness, fairness, and refinement. Liveness in the simplest sense is just specification of reachability properties - does your system always eventually reach some desired state? - using something called a temporal formula. Counterexamples to liveness always take the form of your system getting stuck in an infinite loop and not making progress. Fairness is an assumption that is used to rule out specific loops as counterexamples - for example, you can assert fairness on a fail-retry action that sends a message over the network, to inject the reasonable assumption that your system eventually successfully delivers a message. Both of these things are still easily understood within the conception of your spec as a finite state machine. However, the more complicated your liveness properties become the more you’ll find yourself forced to think in terms of a set of infinite system behaviors instead of a finite state machine.

The most complicated type of liveness property is refinement. Abstraction is a familiar idea to computer scientists, and refinement is how you apply it in TLA⁺. You can write a very abstract spec devoid of implementation details - network, storage, multiple processes - then refine or implement your abstract spec by writing a more concrete spec including those things. You can then use the model checker to validate that your concrete spec refines the abstract spec, and you do this by writing a temporal formula mapping the large behavior set of the concrete spec onto the smaller behavior set of the abstract spec (to be pedantic, both sets probably have countably-infinite cardinality but multiple concrete behaviors will map to each abstract behavior). In the simplest case this is done by ignoring some of your concrete spec’s variables. If you want, you can create a whole hierarchy of specs that are as abstract or refined as you like.

Software engineers tend not to use refinement very often, but it is incredibly powerful - it is simply the only possible way to express some critical non-trivial system properties that apply to your entire system instead of just individual states, and without it a formal specification language is much diminished.

The third way of TLA⁺

The third way to think about TLA⁺ is as a pure symbolic mathematical formula. Here we finally arrive at an understanding of why var' = newval is not just assignment and why every action takes the form of a list of conjuncts. Actions are not a set of instructions modifying system variables - actions are logical formulas that are true or false of a pair of states conjured from the mathematical ether of all possible values assigned to all possible variables. This is how we view TLA⁺ when we are writing a formal proof of an invariant’s correctness, not just for a restricted finite set of values that the model checker can handle, but for all possible values.

TLA⁺ added a machine-checkable formal proof language with the release of version 2 in 2008. A joint Microsoft-Inria project developed an OCaml program called the TLA⁺ Proof Manager (TLAPM) which translates this language into proof obligations for discharge by a set of backend provers including Isabelle, Z3, and Zenon. It also offers easy extension if a researcher wants to add support for their own prover. This meta-proof-language approach is unique and has seen success but also suffers drawbacks; proofs written in TLA⁺ are considerably more human-readable than those written for interactive theorem provers like Rocq or Lean, but often the reason a given proof obligation fails is considerably more opaque than when using an interactive prover. So, it’s harder to write proofs but much easier to read them. I personally quite enjoy this tradeoff, since I can go back and look at proofs I’ve written and actually understand what the hell they’re saying.

Formal proof use in industry is by all accounts even more rare than refinement, and this is not helped by the current state of TLAPM development in 2024. Unfortunately the Inria grant lapsed and the project has been in maintenance mode for the past several years, with updates only from a small number of volunteer contributors. Now that we have the TLA⁺ foundation and its associated grant program, we can hope to see improvements flow again soon! If you’re looking for an underserved & technically dense FOSS project to dedicate yourself to, this is a real plum option in my opinion.

A successor to TLA⁺?

Software engineers often take affront at some aspect of TLA⁺ language design. It has an abstruse mathy vibe that, once the belief that it’s just a BFS DSL is achieved, naturally gives rise to feelings that it is unnecessarily complicated. This might be true! The language was not even designed with model checking in mind; quoting Lamport from the abstract of the 1999 TLC model checker paper:

Despite my warning him that it would be impossible, Yuan Yu wrote a model checker for TLA+ specifications. He succeeded beyond my most optimistic hopes. This paper is a preliminary report on the model checker. I was an author, at Yu’s insistence, because I gave him some advice on the design of the model checker (more useful advice than just don’t do it).

The paper then goes on to say:

The design and implementation of TLC were partly motivated by the experience of the first and third authors, Mark Tuttle, and Paul Harter in trying to prove the correctness of a complicated cache coherence protocol. A two man-year effort generated a 1900-line TLA+ specification of the protocol and about 6,000 lines of proof—mostly, proofs of invariance. We saw that a model checker could be used to check proposed invariants.

So from the start TLA⁺ was intended to be used as the basis for writing proofs. Not even its modern machine-checked formal proofs, just old-fashioned “read it and convince yourself it’s correct” math proofs! In the late 90’s the only tooling that existed was a parser to check basic syntax then convert TLA⁺ to \( \LaTeX \) so it could be published in papers. A model checker was only considered as a way to ensure that the things people wanted to prove were correct (for some finite model) before actually trying to prove them (as Lamport says, it is very difficult to prove something that is not true). Fast forward 25 years and now the model checker is core to the way almost everybody learns & writes TLA⁺. Somehow Lamport hit it out of the park so hard that he developed a language that became popular due to a feature that wasn’t even considered when designing the language!

People in the TLA⁺ community often voice the refrain that “TLA⁺ is math, not code”, which can put them in tension with software engineers who see it as a BFS DSL - not quite code, but pretty close to code. This often crops up when discussing adding a type system to TLA⁺. Almost every TLA⁺ spec includes a type invariant, which asserts the broad contours of variables’ possible values: this one is a natural number, this other one will always be a set, this one is a function from tuples to strings, etc. The objection is that what you can do with TLA⁺ is so unrestrictive - and I do think this is probably true - that any attempt to bolt on a type system is bound to unduly constrain your wild imagination. Might as well just let the model checker validate your type invariant at runtime.

But all these conceptions we have about TLA⁺ might not be equally fundamental to its real value. Languages do not last forever in their popularity; they are usually superseded by something that takes the best of their ideas, refines them, then combines them with other great ideas to make something new. I cannot gatekeep language design by declaring that all future formal languages must contain the three aspects of TLA⁺. Clearly some of its aspects are more popular & useful than others. Most excitingly, perhaps other, as-yet-unconceived aspects exist out in the formal specification design space. This is the task set for other people. I will continue working on the TLA⁺ tooling and writing TLA⁺ specs professionally; room exists for substantial improvements in development tool quality to really take TLA⁺ to the next level, especially for the proof language! I am also interested in the practicality of support for temporal quantification, to make refinement even more powerful.

The release of Crafting Interpreters has bashed open the gates for amateur language designers. Will you be the one to write a language that supersedes TLA⁺?

Discussions