Here’s a short post about two bugs I wrote while writing C++ code for the external scanner of my TLA⁺ tree-sitter grammar. External scanners use handwritten C or C++ code to parse the non-context-free parts of your language. I’ll try not to dump on C++ too hard but both of the bugs are highly ridiculous and exemplify why I hope to write as little of the language as possible for the rest of my career. These aren’t bugs with C or C++ themselves (although honestly this point could be argued) but I share them in the hopes someone finds entertainment in my misery.
Bug #1: the perils of null-terminated strings
This one came from using the C standard library function atoi
, which takes a C string and parses it into an int.
One great feature of tree-sitter grammars is they are easily compiled with emscripten into webassembly for web demos; however, the way this is implemented in the tree-sitter build-wasm
command is by hardcoding a fixed list of C and C++ standard library dependencies.
So when you’re writing your external scanner you have to be very particular about what standard library functions you use.
All of this is to say I stored the string I wanted to parse in a std::vector<char>
rather than a std::string
, then used the std::vector<char>::data()
function to get a pointer to the underlying data to pass to atoi()
.
Now, if you know anything about C at all you know that strings are “null-terminated”, which means they don’t have their length stored at the start of the string data structure like most modern languages.
Instead, C strings are just a pointer to some memory address, and the string is just whatever you read starting at that address until hitting a byte with value 0
.
If you think this sounds like an unbelievably massive footgun your intuition is correct.
My tactic of passing the .data()
pointer to atoi()
worked… until it didn’t.
You see, unused memory is usually initialized to 0
.
The memory adjacent to the end of my std::vector<char>
was usually unused and thus 0
, so atoi()
stopped reading there.
But sometimes the memory next to my vector wasn’t unused.
Other data was stored there.
Non-zero data.
So rarely, atoi()
would return an int value totally unrelated to the string value I was giving it!
The atoi()
function would start reading at the .data()
pointer address I passed in, happily trucking along arbitrarily far into totally unrelated memory until finally being halted by a 0
byte.
I was using this atoi()
call to determine the hierarchy level of TLA⁺’s formal proof constructs, so as you can imagine it led to some very weird randomly-occurring parse errors.
It took two full days for me to track down the root cause and another two days to recover from the realization of what game I was playing.
Bug #2: undefined behavior creating a black hole
This one came from the behavior of the std::vector<T>::pop_back()
function, which removes the last element of a vector.
The documentation says that calling pop_back
on an empty container results in undefined behavior.
What is undefined behavior?
Anything at all!
In the implementations I’ve seen though, calling pop_back
on an empty vector just leaves the vector unchanged but decrements the size of the vector.
The size of an empty vector is zero.
One less than zero is -1
.
A vector with size -1
will, when std::vector<T>::size()
is called, underflow and return an unsigned size_t
with the largest possible value of size_t
.
Your vector now includes absolutely everything in memory after its starting point.
Imagine you reached into a bag of chips only to find you’d already eaten the last one. However, your futilely grasping hand somehow triggers the incursion of every object in the known universe into the chip bag. Within nanoseconds a black hole forms and you are annihilated. This is roughly the experience of writing C++.
Anyway, calling pop_back
on an empty vector was the result of a logic bug.
However, it took a while to track this down because the error manifested much later on, when I was trying to serialize the external scanner state and return control to the main grammar code.
I was calling memcpy
trying to cram the entire contents of my computer’s address space into a poor buffer that had no idea what was about to hit it.
Thankfully a segfault inside memcpy
saved it from this impossible obligation.
Conclusion
From now on I will go to tremendous lengths to only ever write low-level system code in Rust or SPARK Ada. I suppose none of this code is “modern” C++ but it’s nevertheless C++ code I wrote in the modern day. If you love C++ then all the more opportunities for you; I never again want to worry about my program’s pointers rambling off into the memory hills like a runaway truck with a brick on the gas pedal.