Release of Quay 1.1 has been stuck at the “one more bug” stage, as I’ve mentioned. Today I’ve finally unterstood what is happening, and a fix should be easy – I only have to rebuild most of my inter-thread communication. Herewith the cautionary tale.

Quay 1.01 was (as its help file boasted) “just a simple Cocoa app”. Indeed, clicking on a Quay item in the Dock opened the “QuayMenu” background application (which uses the NSUIElement flag). QuayMenu was optimized for quick startup – it read that item to get the folder it was pointing to, read the folder contents, showed a popup menu with them, and quit again when the menu was closed. Nothing easier; a very linear process.

Then in 1.1, all this changed. Now QuayMenu runs all the time in the background. There are two threads (plus a few more that don’t matter for this discussion). One thread listens (via Quartz Event Taps) for clicks on the Dock’s icons, finds out which icon it was, and if Quay should handle it, does a performSelectorOnMainThread: passing data about the icon to the main thread.

The main thread, in its turn, waits until it gets such an event and then pretty much does what the original 1.01 did: puts up a menu with the folder contents. It also shows an arrow over the clicked icon, both to show which icon it was and to allow the user to distinguish Quay menus from Dock menus.

Of course the arrow has to be in a window to be shown onscreen. 1.01 had an invisible window, since a popup menu has to be associated with a window – but now the window has to be shown only when the menu is onscreen, and hidden again when the menu is taken down.

Then around 1.1b2 I got complaints that, after a menu was canceled, the application that was active at the time of the initial wasn’t properly restored to the background – easy to fix by hiding the QuayMenu app after the menu went down. Soon after that, I thought that it would be prudent to imitate the Dock behavior when successive icons were clicked; that is, clicking on one icon, then on another, would produce two successive menus without any need to cancel the first one. And it also would be nice to be able to cancel a long menu while it was being built.

OK, so I introduced some inter-thread communication – mostly some flags and a cancel message – to take care of that. And after some fiddling it seemed to work, some conveniently ignored exceptions excepted. Then my MacBook Air arrived (my first multi-core Mac), and I noticed that the exceptions were more frequent; in fact, about 1 in 10 times, when I clicked on a second icon the first memu was taken down, and the second one either didn’t appear at all or it only flickered onscreen very briefly.

Well, this of course indicated a threading problem – as my experience with interthread communication was a little limited, that was to be expected – and I tried to find out where I was doing something dumb like not properly locking a shared resource or whatever. Since the bug never happened when I tried to break somewhere and single-step from there, it was indeed related to timing. But where?

Tons of NSLog()s were sprinkled about and everything seemed to be working properly. The messages from the first thread left at the proper time, the main thread caught them, the first menu was closed when necessary, the second menu was popped up as expected, but it mysteriously closed down immediately (1 in 10 attempts) or not (9 in 10 attempts). Everything else was as expected. Moreover, no canceling message was sent, or left over, that could close the second menu! It… just curled up and closed. Argh.

Finally – and it was really dumb of me not to think of that immediately – it occurred to me to look at the value returned to my Carbon Menu event handler for the kEventMenuEndTracking event. (Yes, I have a Carbon event handler because for complex UI reasons I need more control – and a shorter build time – than a Cocoa menu could give me.) Turns out that 9 out of 10 times the menu gets closed by kHIMenuDismissedBySelection (when the user picks a menu item); by kHIMenuDismissedByUserCancel (when the user closes the menu without selecting any item); or by kHIMenuDismissedByCancelMenuTracking (when my first thread canceled the menu because the user clicked on another Dock icon).

However – and here was the smoking gun – 1 out of 10 times the menu turned out to close by kHIMenuDismissedByAppSwitch (meaning QuayMenu went into hiding to put another application into the foreground). Now, that should indeed be happening briefly after the first menu was closed, but before the second menu was opened QuayMenu should jump into the foreground again. Now that I considered it, that’s a somewhat costly process that should be optimized out. But why was the application hiding after the second menu popped up instead of before? The sequence of events shouldn’t allow that at all!

Well, at least now I had a clue of where to put my NSLog()s and after some hours of sweating I had the answer: putting all that window hiding and showing in with application hiding and showing, and trying to optimize that by balancing the load over two threads – in brief, putting all urgent stuff into the first thread and all the potentially slow stuff and UI into the main thread – was fundamentally broken. It shouldn’t have worked at all, ever! So why was it failing only 1 in 10 times?

At this point I was visiting my mother and I was attempting to explain to her what was going on. Now, my mom is 95 – she’s still very sharp but knows absolutely zilch about programming. On the other hand, she’s an excellent work optimizer. I suppose 80+ years of kitchen duty will do that to you. At any rate, I was trying to explain what I was doing by resorting to a metaphor.

Assume two blind cooks working in a restaurant. The specialty is pancakes. An order for a pancake comes in to the first cook. He checks out (by Braille I suppose) what goes into the pancake, throws the ingredient into his pan, waves them briefly over his stove, then with a practiced flip of his hand he flicks the pancake to the second blind cook who’s supposed to finish the pancake.

Since they’ve practiced this a lot, and worked out some sort of signals, the second cook knows when to put up his frying pan just in time and in the right position to catch the pancake flying into his direction, and after that it’s easy. He tosses the pancake to a waiter who in turn serves it to the client.

Now, what was happening is that when a second order comes in before the first pancake is finished the second cook has to clean his pan. OK, we may have ridden the metaphor here until it breaks down (hehe) but let’s assume that he usually manages to do so, and proceeds to finish the second pancake in peace. Maybe the client has changed his mind about what sort of pancake he wanted, or something.

Only 1 times in 10 something weird happens and the second pancake is also thrown out unfinished. The first cook notices nothing amiss; he’s getting orders, setting up, and throwing the pancakes. The second cook notices nothing amiss; he’s catching the pancakes like he’s supposed to, cleaning his pan and all when necessary. It’s just 1 in 10 clients that complain because they never get their pancake!

Well, said my mom, so who was stealing the pancake then? The only explanation I found was that the kitchen shouldn’t have worked at all – it was working just by coincidence! In other words, the second cook wasn’t tossing the pancake to the waiter; he was tossing it out of the window! It just happened that at that crucial time a providential windmill wing swung by in front of the window, the pancake bounced off that and, who’da thought, right on the waiter’s plate!

Of course such a thing can’t work reliably and now and then one of the cook’s timing was off just enough to make the pancake miss the windmill wing, and the pancake would be gone with none of them noticing.

So my mom and I had a good laugh (“I can’t believe you actually think of your work that way!”, she said) and I went home. Tomorrow I must redo the whole kitchen routine. Stand by for news.