Mittwoch, 15. Oktober 2008

atexit handlers in mixed-mode applications

Recently I came across a strange problem in a mixed-mode application. It seemed that some handlers registered with atexit were not called any more. When debugging through all those destructors of global and static objects which are called at application shutdown, at some seemingly arbitrary point of time the loop calling the atexit handlers was aborted, without a code instruction seeming to do that, and without any exception being thrown. Some handlers were still reached in one debugging session and were not reached any more in another.

Perhaps you guessed it; it was a kind of threading problem. But wait: why should there be several threads active when the application is shut down? Certainly our code didn't do this. No, as I found out, the .NET framework code is responsible. As Chris Brumme wrote a few years ago in an excellent blog post:

"We run most of the above shutdown under the protection of a watchdog thread. By this I mean that the shutdown thread signals the finalizer thread to perform most of the above steps. Then the shutdown thread enters a wait with a timeout. If the timeout triggers before the finalizer thread has completed the next stage of the managed shutdown, the shutdown thread wakes up and skips the rest of the managed part of the shutdown. It does this by calling ExitProcess."

So, the atexit-handlers are called by the finalizer thread. This was the first surprise for me. It seems to be made this way on purpose (the CRT exit function is called in a handler for a AppDomain.DomainUnload), but I wonder if it is correct. Isn't it allowed to use thread-local storage in an atexit-Handler (or destructor of a global) in native code? Well, I know now that it isn't allowed in mixed-mode assemblies, though that wasn't our problem.

The other suprise is that the application is shot by that watchdog thread after some timeout. I honestly can't imagine the reasoning behind this. It's horrible for debugging and error detection. If some application takes too long to shut down, well, then it can be killed from outside, but it should never arbitrarly be killed by the framework infrastructure.

In my case, it turned out that one developer had written an unregistration mechanism which was used by code in a destructor of a function-static object. The unregistration used a hand-written loop and he simply forgot to increment the loop variable. (That's why I tell people not to write loops by hand. Use STL algorithms or boost.foreach.)

Normally, such an error is easily detected, because the application simply hangs. In shutdown, the watchdog killed the application, meaning it just took a bit longer to close, which wasn't really noticable unless you looked for it. And because that watchdog is even active while debugging, it wasn't easy to find that loop either.

  1. Don't write your own loops.
  2. Be careful what you do in destructors.
  3. Don't do unnecessary cleanup in global or function-static objects. The application is going down anyway, so why unregister at all?
  4. Don't do application logic in atexit-handlers. That's very late and risky; e.g. you can't use global objects any more. Better call that code once from your own "application shutdown" logic.
  5. Debugging errors which occur at application shutdown is hard.