A few days ago I published a short post about two bugs I wrote while developing the C++ external scanner for my TLA⁺ tree-sitter grammar. Reactions were mixed! Many people were supportive, but there were of course the usual drive-by claims by developers that the bugs were trivial, they would’ve found & fixed them inside of 20 minutes, and I was laughably incompetent for having written them in the first place. Maybe so! I’m a fan of formal methods primarily so I don’t have to be a genius to write correct code. In that same vein of building tools to save us from ourselves, one user suggested building the tree-sitter grammar with the LLVM address & undefined behavior sanitizers enabled. I’d used valgrind a long time ago but had never played around with sanitizers. I was also doing some closely-associated work to build the grammar for fuzzing with LLVM’s libFuzzer, so it seemed a fun detour to check whether those sanitizers would have saved me days of debugging pain!
The build process
Unless you want to enable this on your own tree-sitter grammar you can skip this section.
Usually tree-sitter grammars are built automatically when running tree-sitter test
or tree-sitter parse
.
However, the compilation flags are hardcoded in the CLI so I needed to write a script to build it with the flags I wanted.
Thankfully tree-sitter grammars consist of only three relevant files: a gigantic generated src/parser.c
file, a hand-written src/scanner.cc
file, and a header file src/tree_sitter/parser.h
.
These are all compiled into a shared library tlaplus.so
which is loaded by the consuming program.
It was easy enough to compile tlaplus.so
with clang, along with the -fsanitize=address,undefined -fno-omit-frame-pointer
flags.
Supposedly the environment variables TREE_SITTER_DIR
or TREE_SITTER_LIBDIR
can be used to set where tree-sitter CLI looks for the tlaplus.so
file, but I couldn’t get those to work so just copied it to the default location ~/.cache/tree-sitter/lib/
(as a funny aside, while writing this up in the build script I managed to instead create a directory named ~
so had to figure out how to delete that without nuking home).
Time to let it rip!
I ran node_modules/.bin/tree-sitter test
, and saw:
Error opening dynamic library "/home/ahelwer/.cache/tree-sitter/lib/tlaplus.so"
Caused by:
/home/ahelwer/.cache/tree-sitter/lib/tlaplus.so: undefined symbol: __asan_report_store4
Alas! The tree-sitter CLI is a rust program that dynamically loads the tlaplus.so
file, and it wasn’t loading ASAN.
I added the tree-sitter repo itself as a submodule of my grammar and set to fixing it.
Unfortunately after bashing my head against it for over an hour I couldn’t figure out how to load ASAN from rust; if you know, please answer this StackOverflow question!
Instead, I whipped up a quick C++ program that links against the tree-sitter C library directly and is itself compiled with sanitizers.
This is unfortunate; the tree-sitter CLI contains some finicky logic for parsing & running the test corpus, which I am not going to replicate in C++, so my program only reads a TLA⁺ file, parses it, and prints out the parse tree.
You can see the build script and C++ front end here.
Initial cleanup
Before I could get started replicating my bugs I had to fix all the other bugs the sanitizers found!
One class of issue came from this line in the std::vector<T>::data()
docs:
Notes: If size() is 0, data() may or may not return a null pointer.
Indeed my empty vector was returning a null pointer, which I then passed into memcpy
as variously the source or the target.
This only happened when the number of bytes to copy was also zero, so no harm done I thought.
On the memcpy
docs page though, we see:
If either dest or src is an invalid or null pointer, the behavior is undefined, even if count is zero.
Well thank goodness it didn’t decide the vector should contain the entire contents of my address space again.
Wrapping memcpy
in an if-statement fixed the problems.
Surprisingly, that was the end of it! I ran my program against the entire tlaplus/examples test corpus without any additional issues popping up. Time to reproduce those bugs.
Bug #1: the perils of null-terminated strings
Recall this one happened because I passed a std::vector<char>.data()
pointer into atoi()
, and std::vector<char>
doesn’t null-terminate its data (why would it?).
My hypothesis was this would be easily detected even when adjacent memory was unused, because the null byte atoi()
reads to terminate the “string” is itself uninitialized memory.
To fix this bug I had written my own version of atoi()
to parse the raw_level
char vector:
level = 0;
int32_t multiplier = 1;
for (size_t i = 0; i < raw_level.size(); i++) {
const size_t index = raw_level.size() - i - 1;
const int8_t digit_value = raw_level.at(index) - 48;
level += digit_value * multiplier;
multiplier *= 10;
}
At this point in the code the contents of raw_level
were all guaranteed to be ASCII numbers from 0-9, so not too bad.
I replaced it with our cursed line of code:
level = atoi(raw_level.data());
It’s worth looking at how this bug manifests; we can parse the following TLA⁺ proof:
---- MODULE Proof ----
THEOREM P => Q
PROOF
<1> P
<1> Q
PROOF BY P
<1> QED
======================
Our buggy code is concerned with parsing the number in the <1>
proof step IDs, which is the proof level.
Proofs in TLA⁺ are hierarchical with various steps themselves having sub-proofs to prove their correctness.
Ordinarily, this should give the following parse tree:
(source_file (module (header_line) (identifier) (header_line)
(theorem (bound_infix_op (identifier_ref) (implies) (identifier_ref))
(non_terminal_proof
(proof_step (proof_step_id (level) (name)) (suffices_proof_step (identifier_ref)))
(proof_step (proof_step_id (level) (name)) (suffices_proof_step (identifier_ref)
(terminal_proof (use_body (use_body_expr (identifier_ref))))
))
(qed_step (proof_step_id (level) (name)))
)
)
(double_line)))
However, about one in twenty times it generates a parse error as one of the proof steps is a completely different level.
At the time my initial thoughts were along two tracks: one, the logic in the newly-written proof step ID handler was quite complicated and an obvious source of possible bugs; two, the external scanner had previously exhibited nondeterministic behavior when incorrectly initializing its state during deserialization.
Unfortunately both of these hypotheses were way off the mark which led to a lot of frustration & wasted time.
Could sanitizers have come to the rescue?
It should point us to line 275 in src/scanner.cc
.
Let’s check!
$ test/sanitize/out/parse_tlaplus Test.tla
==929719==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000000031 at pc 0x55ba5836b6f4 bp 0x7ffdd637d2b0 sp 0x7ffdd637ca70
READ of size 2 at 0x602000000031 thread T0
#0 0x55ba5836b6f3 in StrtolFixAndCheck(void*, char const*, char**, char*, int) asan_interceptors.cpp.o
#1 0x55ba5836bbd4 in __interceptor_strtol (/home/ahelwer/src/tlaplus/tree-sitter-tlaplus/test/sanitize/out/parse_tlaplus+0xffbd4) (BuildId: 1061028f7f02d346004ffa4692ca74e9d92b5cad)
#2 0x55ba5845e1f2 in atoi /usr/include/stdlib.h:364:16
#3 0x55ba5845e1f2 in (anonymous namespace)::ProofStepId::ProofStepId(std::vector<char, std::allocator<char>> const&) /home/ahelwer/src/tlaplus/tree-sitter-tlaplus/
src/scanner.cc:275:17
...
Easy-peasy. Sanitizers would have led me right to it!
Bug #2: undefined behavior creating a black hole
Recall this one resulted from calling std::vector<T>::pop_back()
on an empty vector, which is undefined behavior.
The undefined behavior was given a somewhat hilarious definition by declaring the vector to then include the entire computer address space.
The actual crash happened later on in a memcpy
when I tried to serialize this impossibly-large vector, now part of the external scanner state.
The circumstances leading to calling pop_back()
on an empty vector resulted from insufficiently-hardening my external scanner against invalid syntax.
Tree-sitter grammars are designed to be error-tolerant so they can keep functioning even while the user is in the middle of typing some code.
This extends to your external scanner code, so you have to be careful making assumptions about parser state when you encounter a given keyword.
In this case the bug came from this function:
bool handle_qed_keyword_token(TSLexer* const lexer) {
last_proof_level = get_current_proof_level();
proofs.pop_back();
lexer->result_symbol = QED_KEYWORD;
lexer->mark_end(lexer);
return true;
}
This would generate undefined behavior on this input file:
---- MODULE Test ----
op == [ ]
THEOREM TRUE
<*> A
<*> QED
====
The invalid syntax of op == [ ]
would lead to a desperate error-recovery attempt that ultimately sent the QED
keyword careening into that function when the proofs
vector was empty.
The fix was to check whether tree-sitter was expecting a QED
keyword before taking any rash actions:
bool handle_qed_keyword_token(
TSLexer* const lexer,
const bool* const valid_symbols
) {
if (valid_symbols[QED_KEYWORD]) {
last_proof_level = get_current_proof_level();
proofs.pop_back();
}
lexer->result_symbol = QED_KEYWORD;
lexer->mark_end(lexer);
return true;
}
This one didn’t take as long to find as the first one, because it was easy to see from parser debug output that the external scanner saw QED
shortly before crashing.
A quick hop into this function with gdb
led to an absurd sight (one can easily imagine that array including some nice private keys!):
$ gdb node_modules/tree-sitter-cli/tree-sitter
(gdb) break handle_qed_keyword_token
(gdb) run parse test/crash_regressions/QEDErrorRecovery.tla
Thread 1 "tree-sitter" hit Breakpoint 1, (anonymous namespace)::Scanner::handle_qed_keyword_token (valid_symbols=0x7ffff7aafe84 <ts_external_scanner_states+68>, lexer=0x555556127210, this=0x555556120518) at /home/ahelwer/src/tlaplus/tree-sitter-tlaplus/src/scanner.cc:1539
1539 last_proof_level = get_current_proof_level();
(gdb) next
(gdb) print proofs
$1 = std::vector of length -1, capacity 1 = {0, 5, 0, 0, 7566700, 21845, 33, 0,
1444139744, 21845, 14, 16, 7566700, 0, 129, 0, 55013537, 21840, -1925187476,
1208781847, 0, 0, 97, 0, 1444025872, 21845, -134776064, 32767, 0, 0, 65, 0,
1444025872, 21845, -134776064, 32767, 0, 0, 33, 0, 1444025872, 21845,
-134776064, 32767, 144, 0, 65, 0, 1836017711, 1751199589, 1702325349,
1663971186, 1701340001, 1701999663, 1769155941, 1919251572, 1651076143,
1634497536, 1937075312, 7303982, -43521, 0, 353, 0, 55014209, 21840,
-1925187476, 1208781847, 9, 0, 9, 0, 257, 0, 0, 0, 124, 0, 31, 0, 1444029712,
21845, 6, 0, 22, 0, 1444035344, 21845, 28, 0, 22, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65, 0,
1836017711, 1751199589, 1702325349, 1920151410, 1819553635, 1970040929,
1920216947, 1932354917, 1702130793, 1819553138, 1970040929, 32627, 64, 0, 65,
0, 1836017711, 1751199589, 1702325349, 1663971186, 1701340001, 1701999663,
1769155941, 1919251572, 1651076143, 1634497583, 1937075312, 7303982,
692846591, 0, 353, 0, 55014881, 21840, -1925187476, 1208781847, 5, 0, 5, 0,
514, 0, 0, 0, 26, 0, 31, 0...}
As an aside to fledgling programmers who are reading this post - be nice to yourself, learn how to use the debugger for the language you’re working in. Don’t do debug-by-print-statement. I’ve been down that road and learned my lesson! It takes a bit longer to set up your debugger than to write another print statement, but you should be optimizing for iteration time not setup time. And you only have to pay the setup time once. If you’re debugging, use the right tool for the job - a debugger!
On to the main event.
Will sanitizers catch this?
Confidence is high!
We’re looking for line 1540 of src/scanner.cc
:
$ test/sanitize/out/parse_tlaplus test/crash_regressions/QEDErrorRecovery.tla
==932149==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020000000b4 at pc 0x560eb67428fe bp 0x7ffcd192f4d0 sp 0x7
ffcd192ec90
READ of size 4294967292 at 0x6020000000b4 thread T0
#0 0x560eb67428fd in __asan_memcpy (/home/ahelwer/src/tlaplus/tree-sitter-tlaplus/test/sanitize/out/parse_tlaplus+0x1558fd)
(BuildId: e87ee56651ecec128eaaf4d960ff363fcf8aabdb)
#1 0x560eb67de62a in (anonymous namespace)::Scanner::serialize(char*, bool) /home/ahelwer/src/tlaplus/tree-sitter-tlaplus/
src/scanner.cc:951:40
...
Alas the error points us to the serialization function, not the undefined behavior.
I would have hoped UBSan would catch that.
I was compiling with -fsanitize=undefined
, which includes nearly all the checks; I tried adding some others but no change.
Oh well, one out of two ain’t bad.
If you know how to tweak the settings so it’s detected, please let me know!
Update!!! someone did let me know - lobste.rs user Sam James suggested building with -D_GLIBCXX_ASSERTIONS
, which Gentoo uses for their hardened profiles.
Re-running the above, we see (simplified):
$ test/sanitize/out/parse_tlaplus test/crash_regressions/QEDErrorRecovery.tla
.../stl_vector.h:1319: void std::vector<int>::pop_back() [...]:
Assertion '!this->empty()' failed.
Aborted (core dumped)
We can get the stack trace with gdb
:
$ gdb test/sanitize/out/parse_tlaplus
(gdb) run test/crash_regressions/QEDErrorRecovery.tla
.../stl_vector.h:1319: void std::vector<int>::pop_back() [...]:
Assertion '!this->empty()' failed.
(gdb) backtrace
...
#5 0x000055555574d9a7 in std::vector<int, std::allocator >::pop_back (this=this@entry=0x6070000000c0) at .../stl_vector.h:1319
#6 0x0000555555749a1e in (anonymous namespace)::Scanner::handle_qed_keyword_token (this=, lexer=lexer@entry=0x61b000000080, valid_symbols=) at
src/scanner.cc:1540
...
There it is, line 1540 of src/scanner.cc
!
Conclusion
Bit of a mixed bag. I certainly don’t feel as safe writing this code as I would in rust, even with these sanitizers enabled. Hopefully tree-sitter will add support for writing external scanners in rust at some point, but primary maintainer Max Brunsfeld is mostly occupied with the Zed editor these days (I appreciate the name, as a Canadian!) so it’ll be up to us if we want it. An easier short-term goal would be better support from the tree-sitter CLI for compiling with sanitizers or other sorts of warnings. As mentioned in the introduction, this post is a digression from my current project of subjecting the grammar to fuzzing with libFuzzer. This is my first time fuzzing anything, and it was surprisingly easy to set up - the most difficult part is figuring out what the heck is going on with the weird crashes it generates!