Here I’ll talk about a type of TLA⁺ contract I’ve worked on a few times, and why it didn’t work out as well as hoped. I’m not trying to torpedo other peoples’ contracts here - I just hope to share this experience so others can structure their TLA⁺ contracts differently, hopefully leading to greater success for both parties and industry usage of TLA⁺ as a whole.
The proposal
The contract proposal goes like this: a client wants to build a distributed system, and has read that TLA⁺ is effective at modeling such things. Conventional wisdom holds that it’s always cheaper to catch bugs earlier in the development pipeline, and what’s earlier than the design stage? However, they don’t have any in-house TLA⁺ knowledge and are unsure about whether they want to invest in training or hiring. Instead, they email me a proposal: could I do a short 1-3 month contract where I specify their system in TLA⁺ for them? The benefits of this setup are generally reasoned as follows:
- Client gets an expert TLA⁺-powered audit of their distributed system design with scoped cost & duration, catching issues early
- Client does not lose developer time by training team members on TLA⁺ & getting them to write a newbie-level TLA⁺ spec
- Client gets an unambiguous formal specification of their design for later implementation reference or conformance testing using model-based testing or trace validation
Perceptive readers will immediately notice a tension between the last two points. What is the point of having a TLA⁺ specification if nobody on the development team is trained to use it? It is true that it’s easier to learn how to read TLA⁺ than write it, so the client generally reasons that if this scoped TLA⁺ trial proves valuable then it won’t require much investment to train dev team members to that level. However, this isn’t even the greatest flaw in our premises.
The contract
I accept the contract, because it’s in my wheelhouse and short-term contracts pay well. I have a few initial virtual meetings with the dev team and they send over their plain-language system specification documents. I ask lots of initial questions, both in the meetings and in follow-up emails, then we settle into a weekly or biweekly update cadence. I go through all the ups & downs of the TLA⁺ specification process, oscillating between feeling that a system component is intractable to model and then figuring out a satisfying way to capture it. Often I write an under-specified sub-component of the system in a way that seems reasonable, but turns out to be different from what the team had in mind. The weekly meetings are usually dominated by me asking “what do you do if this happens” and the team being “we hadn’t thought of that, how about this” which seems valuable. The list of identified problems & changes in the original specification grows, which also seems valuable.
The problems
It isn’t hard to detect a certain flagging enthusiasm from the dev team as time passes and the TLA⁺ specification grows in complexity. I will share my screen to show them a section of the spec so we can nail down the logic, and I’ll see their eyes bug out of their heads when they are assaulted by a soup of symbols they haven’t seen since university. Even if I walk them through it I’ll get a lot of glassy “right, yeah"s. My answers to questions like “why don’t we model retries” don’t satisfy them, because they don’t understand the underlying computational model enough to see how the way I’ve specified the network already models retries (and multiple receipt, and message loss, and…) implicitly. The “what do we do in this case” discussions remain productive, but it’s clear when the dev team begins to believe they’ll receive value from the specification process as conducted by me and not the final TLA⁺ specification artifact.
At some point it seems like the TLA⁺ spec has reached a sufficient level of completion and the contract ends. I send a final update, receive payment (usually after some emailed reminders, such is the life of a contractor), and that’s it.
Post-mortem
So overall this seems like a decent contract. The client got an audit of their design and a formal specification of their system. I got paid enough to cover life expenses for the next few months while making FOSS contributions. Why, then, am I writing this post?
I am writing this post because I suspect the TLA⁺ spec I wrote is never looked at again. If I hear from the client it’s because they become interested in hiring me full-time, not because they want to do more TLA⁺ work. And this gets at the core of the problem. While specifying the client’s system, I developed a very detailed theory in my head of how the system worked. By the end of the contract it was not an exaggeration to say I understood the system better than anybody on the dev team. And then I left! Sure, the client got some nonzero value from when I distilled the contents of my head into the final update email. That is nothing compared to the value the client would have received if I had remained embedded in the development process and used the theory in my head to inform the system’s implementation.
So I found myself playing the role of the proverbial forklift in the gym, lifting weights for the dev team. What mattered here wasn’t the weights being moved or the TLA⁺ spec being produced - it was the thinking that writing the TLA⁺ specification required. The important change happened within me. And I can’t even claim ignorance of this! This is something Leslie Lamport has been harping on forever, that the real value of specifying a system in TLA⁺ is that it forces you to think deeply about your system:
This all clicked when I was at Antithesis Bug Bash a few months ago and saw this talk by Ankush Desai, developer of the P specification language:
Ankush repeatedly emphasized that the real value to Amazon engineers of specifying their system in P was that it forced them to think deeply about their system. I had neglected such “soft” considerations because I was so dazzled by the TLA⁺ model-checker! This despite reading & listening to more of Leslie Lamport’s thoughts on TLA⁺ than all but a handful of people on the planet. I think it’s easy for computer scientists to become “tool-brained”, because our field has been so successful at formalizing things so we don’t need to think very hard about them anymore. Thinking, it turns out, remains incredibly powerful, and I should put more trust & value in it.
Alternatives
How could this type of contract have been more successful, given that I would rather not join the company as a full-time developer? Hillel Wayne, who runs TLA⁺ workshops for companies, says he pair-programs TLA⁺ specs with a member of the dev team instead of writing the spec himself. I think this would work although requires clients to commit to investing internal developer time, a harder up-front sell than engaging a detached contractor. That sell may simply be necessary. Contracts beget contracts; companies talk to one another, and if one says “we tried TLA⁺ but didn’t get much out of it” that’s a very strong signal that will put other companies off from trying it. So I might get paid now, but will lose out on future contracts that never materialize, and the reputation of TLA⁺ in industry suffers damage.
I do not plan to take further contracts of this type if offered, and will have to think hard about what ingredients could make similar ones work in the future.
Discussions
Discuss this post here: