[info] it's a good thing I like cold turkey
Eugen Leitl
<eugen at leitl.org> on
Tue Dec 11 17:56:00 UTC 2007
http://blog.kickin-the-darkness.com/2007/12/coding-war-story-whats-your-point.html
A Coding War Story: What's Your Point?
I had been assigned the task of porting a fairly large (about 400 KSLOC)
missile launch command and control system to an upgraded OS version and new
compiler and language version. Specifically, from Solaris 2.5.1 to Solaris 7,
and from the Verdix Ada Development System (VADS), which was Ada 83, to
Rational Apex Ada, which was Ada 95. VADS had been bought out by Rational,
and its product obsoleted, although Rational did a pretty good job
implementing compatible versions of VADS-specific packages to ease the
transition to the Apex compiler.
Three other guys helped with the initial compilations, just to get clean
compiles of the code, which took about two weeks, and then I was on my own to
actually make the whole system work. Long story short, it was the worst
design and implementation of a software system I'd ever seen, and so took
about two more months to successfully complete the port. It was then handed
over for formal testing, which took several months as well. I fairly steadily
fixed the bugs that were found as testing got going, but that rate quickly
declined as it progressed (the original code was a production system after
all, so its functionality was pretty solid, I just had to kill the bugs that
came about due to adapting to the new compiler). Eventually I was reassigned
to another project once everything appeared to be working as well as the
original.
Then came the phone call on the Friday before Thanksgiving.
There was a missile test scheduled in about three weeks, and during a lab
countdown test the command sequencing had locked up. In real life this would
cause a test abort, and if this lock-up occurred within seconds of ignition,
a number of irreversible actions would have taken place in support systems,
causing a lengthy--and expensive--delay for reprocessing the missile. The
missile would not have launched, but there would have been many, many very
unhappy people seriously distressed over issues of time and much, much money.
(Don't let anyone ever tell you that the Defense Department is cavalier about
spending money--I've yet to meet a contract manager for whom budget wasn't
their number 1 or 2 priority, with schedule being the other.)
Now this countdown test and many variations of it had been run hundreds of
times in the preceding months, with only a handful of minor glitches. So this
problem had a very low probability of occurrence, but unfortunately possessed
a very high cost of occurrence. Multiply those together and the product was a
bad Thanksgiving week for me and dozens of other engineers and managers.
As the guy who did the port this put the spotlight right on me.
Like most safety-critical defense systems like this, a lot of logging is
captured, so it was fairly easy to locate the handful of lines of code that
had been most recently executed when the system froze. And of course there
was absolutely nothing questionable in those lines of code, and these same
statements had already successfully executed literally thousands of times
during that same run.
We put the Apex guys at Rational on notice, since it was their compiler and
some of their vendor-supplied routines were being called in this area, and it
was impressed on them (and everyone) that this was a problem of literally
national importance that had to be tracked down. So they got their
Thanksgiving week trashed as well.
Since the logs could only tell us so much, we needed to try to repeat the
problem in the local lab. For something that pops up in only 1 in a 1000 test
runs that's not going to be easy. Amongst the conjectures as to root cause
was that a call into a vendor-supplied mutex (part of a VADS migration
package) Unlock function was not unlocking. The processing thread that made
this call was handling a heartbeat message that nominally arrived every few
seconds. So we tried upping the rate on that heartbeat to 10 Hz, i.e., 10 per
second, and kicked it off. About an hour later the system locked up. And,
when reviewing the logs we saw that the same sequence of logged messages was
occurring as had taken place in the failed run. Several runs were made, and
it would consistently lock up sometime between 45 and 90 minutes after
starting, and each time had the same log trace. So even though we were not
now technically running the same code--because of the increased heartbeat
rate--the behavior was consistent and so we had high confidence that this
stressing scenario was triggering the same problem.
The trick now was to figure out exactly where in the sequence of candidate
statements the lock up was occurring.
The implementation of this system used Ada tasking, and used it
extraordinarily poorly. Tasking is Ada's high-level concurrency construct,
sorta like threads, only built into the language itself. When two tasks
communicate, they do it by "rendezvousing", at which time they should
exchange any data of interest, and then break the rendezvous and resume their
independent executions. This system wasn't implemented that way. Instead,
once rendezvous had been made with a target task, that target task would then
rendezvous with another task, which in turn would rendezvous with another
task, and so on, until eventually some processing would get done, after which
all the rendezvous would be broken and each of the tasks would go on their
merry way. So what you ended up with was the world's most expensive function
calls, bringing an entire, "multi-tasking" process to a halt while it
processed a piece of incoming data. It was only because the normal throughput
was so low that this hadn't caused performance problems in the past.
The point of this digression about tasking, though, is that when a rendezvous
is requested or awaited upon, a "task switch" can occur. This means that the
CPU can start processing a different task that's ready to run. So when one
task becomes ready to rendezvous with another, a different task may jump in
line and get executed, with control eventually getting passed back around to
the rendezvousing tasks. Now there are other events that can also cause a
task switch, one of which is calling an OS function, like what happens with
printing or performing a mutex.
So in tracking down exactly which line was causing the problem I had to find
a way to record the progress through the sequence of statements--while not
triggering a task switch, which could prevent the problem from occurring. So
doing Put_Line() was not an option, no system I/O of any sort could be done.
I could set a counter variable or something like that, but how do I see what
its value is to tell me how far it got, since I can't print it out?
Now one thing that had been observed in the log files about this executable
was that while this heartbeat processing froze--which ultimately led to the
process' I/O getting all blocked up, and preventing other necessary
processing from occurring--other independent tasks within the executable
continued to run. So the process as a whole wasn't getting blocked, just a
(critical) task chain within it.
This was the wedge needed to get at locating the offending statement.
I created an Ada package containing an enumeration type, a global variable of
that type, and a task. The enumeration literals were keyed to the specific
statements in the problematic code sequence (like
"Incrementing_Buffer_Index", "Locking_Mutex", "Mutex_Unlocked", etc.) and
then into that sequence were inserted assignment statements that assigned the
corresponding enumeration to the global variable. Because the object code for
this was nothing more than storing a constant into a memory location it was
extremely unlikely that a task switch could occur by executing such a
statement. In fact, our primary suspicions centered on those statements that
involved task switches, since the locking up behavior was consistent with
execution not resuming (for some reason) after a task switch back.
The monitoring task then itself did nothing more than loop and periodically
check to see if the global variable had changed value. Every time it did, it
printed out the value to a file. It then delayed for a small interval, and
made its next cheecifically each of those that the variable last held before
a task switch occurred. And when the freeze happened, that global variable
value should no longer change and the last one recorded will indicate from
exactly which statement execution never resumed.
Ran the instrumented executable. It froze up. And the monitoring worked like
a charm.
The logging of the progress monitoring variable displayed exactly the
anticipated sequence, which eventually ceased with a value corresponding to
having made a call to the Mutex Unlock function, with the value that should
have been stored signaling the resumption of the task never showing up--like
it had in the thousands of previous invocations.
So over to you Rational. The Apex engineers during this time had been
feverishly analyzing their code and had found a place in the mutex code where
it could theoretically block for good, but the odds of that happening were
very remote because of everything that had to happen with the right
sequencing and timing. Murphy's Law, guys, Murphy's Law.
What I did to work-around this was to replace the calls to the vendor's mutex
functions (which were built atop the OS' mutex functionality) protecting this
particular sequence of code with a quick little native Ada mutex package,
using that to control mutex access to the relevant area.
I put this into the code and reran the test. Seven hours later it was still
running.
My mutex package code was given to Rational who compiled and disassembled it
and verified that it was not using the same approach that the problematic
mutex functions were using.
I then had the most well attended code inspection of my career :-) There were
nearly a dozen engineers and managers in the room with me, and at least
another dozen dialed in from all over the country, all to inspect about 20
lines of code.
It passed, the new executables were formally built, and it was handed over to
the test organization for formal regression testing. A couple weeks later the
missile countdown proceeded flawlessly and away it went.
It's a good think I like cold turkey.
------------------------------------------------------------------------
Okay, this is all well and fine, but what's really the point of a coding
working on this problem, though having done the original port I was of course
the primary focus. But even though I did the porting, that doesn't mean I had
intimate knowledge of hundreds of thousands of lines of code--or even a
decent overview of it. Other engineers around the country were looking
through the code and the logs as well, but I found that when they proposed a
hypothesis to me about a root cause, it never took more than 30 seconds on my
part to dismiss it, likewise when I was requested to provide various analyses
I would shove it off on to someone else because it was clear to me they were
on the wrong track. Sound like arrogance on my part? Well, yeah, it does, but
that's not why I dismissed these hypotheses and requests.
It was because I knew what the nature of the problem was. I didn't know
exactly where it was occurring, nor why it was occurring, but I did know what
was happening.
I've built up a lot of experience and knowledge over the years--I was an
early adopter of Ada, understand concurrency and its pitfalls, I know how Ada
runtime libraries handle tasking and concurrency, and I understand low-level
programming at the level of raw memory, registers, and assembly language. In
other words, I have deep knowledge of my niche of the industry. All of that
was brought to bear in successfully tracking down this problem--not just
working around the bug, but understanding how to put together an approach to
finding the bug in a very sensitive execution environment.
The specifics of a coding war story probably aren't all that interesting to
those who aren't familiar with the particulars of its nature and environment,
but they are useful for gleaning an understanding of what it takes to solve
really difficult problems.
To solve the really difficult problems you need to be more than a coder, you
have to understand the "fate" of that code, how it interacts with its
environment, and how its environment itself operates.
Then you too can get your Thanksgiving holiday all messed up.
More information about the info
mailing list