This is a story about a recent issue discovered in pacman, the Arch Linux package manager, and the difficulties we had hunting it down…  The story is long, but so was the process of finding the bug.
It all started on a warm summer’s night (in my timezone and location… – it was probably cold and daytime for the other main pacman developers) with the reporting of FS#27805: “[pacman] seg faults when removing firefox”.  Of course, my initial reaction was “bull shit” as we all know there are no bugs in the pacman code.  But this was only a couple of weeks since pacman-4.0 was moved into the Arch Linux [core] repo so there was an ever so slight possibility it was real.
Luckily for us, the user reporting the bug was very helpful and installed a version of pacman with debugging symbols and gave us a full backtrace.  It was very clear where the segfault was occuring:
#0 0xf7fbd4e7 in _alpm_pkg_cmp (p1=0x8128aa0, p2=0x0) at package.c:644
That function is called in the package removal process when we check that a file that is going to be removed with a package is not also owned by another package (which would require someone using -Sf when they should not).  If the package in the local database is the same as the one being removed, we do not need to run this check, and hence the test.  As you can see above, for some reason _alpm_pkg_cmp is being passed a null pointer as the package from the local database and KABOOM!
So the question was, how do we get a null value for the package from our local database?  Given pacman runs through the list of local packages on each package removal, this null entry must have been generated on the removal of the previous package.  Here is a bit of background on how package information is stored in pacman.  Package information is stored in a hash table that also provides access to the data as a linked list.  This provides us with fast look-up by a package’s name but also allows us to loop through the (generally sorted) package list.  Now the hash table code is fairly new (first introduced in pacman-3.5) and the removal of items from a hash with collision resolution done by linear probing is not straight forward, so there could be a bug. Dan pointed his finger my way as I wrote the original hash table code and I pointed my finger his way as he made optimizations to the removal part.  But it turns out that both of us were not thinking too hard.  It is the list that is being corrupted and that has items removed using code that has been around for years.  Despite that, the whole hash table and linked list removal code got an in depth review and no issues were found.
We were stumped. Looking at the the debug output from pacman, we could see that a file that actually did not exist on the system was being “removed” right before the crash, but that is not uncommon and appeared to be handled correctly so was unlikely to be the cause.  So back to the reporter to see if we could get more information to replicate.  He was very helpful and provided us with a copy of his local package database.  We created a chroot with exactly the same packages and had no luck replicating.  The user even provided us with a complete copy of his chroot where the error was occurring, but again there was no luck replicating.  It must be something specific to that users system.  Right?  Well, even re-extracting the tarball of the chroot the user provided us onto his own system made the bug go away.  All in all, a great candidate for being “not a bug”….
Until on another warm summers evening, while being my usual extremely helpful self on IRC, someone mentioned they were getting a segfault while removing packages.  A bug report was filed and, again, the user was extremely helpful and the backtrace provided was exactly the same.  A core dump showed us there was definitely something wrong with the linked list.  Well…  bugger!  This bug appears real.  Again the red-herring of the removal of a non-existent file was shown in the debug log, but it would be very, very strange for that to break the linked list of package information so was ruled out.
It was time to find a reproducer!  So I created a chroot and set this script running:
ret=0
while (( ! ret )); do
  pkg=$(pacman -Sql extra | shuf -n1)
  pacman -S --noconfirm $pkg
  pacman -R --noconfirm -s $pkg
  ret=$?
done
Within five minutes I could replicate the segfault.  (It turns out I was very lucky as I ran the same script again for over four hours and did not strike the issue.)  Now it was time to get debugging!
The first thing I did was print some debugging info in the linked list node removal code, but for some reason the node removal just before the segfault did not print anything.  I was only printing information when removing a node from the middle of the list (because that is where the package causing this issue was located), but just to be sure I also added debug statements for the case of removing the head and tail nodes.  And then pacman told me it was removing a node from the end of the list…  “Why do you think that package is a the end of the list pacman?”, I asked.  “Because the head node’s prev entry tells me it IS the end of the list”, replied pacman.  “Oh, crap”, I said. “So it does!”  Something was clearly wrong here.
It was time to investigate all removal operations on that list.  So I printed the entire linked list before and after each package removal and found the error actually occurred before the removal operation even started.  The initial list of the local database passed to the removal operation was already broken with the pointer to the tail entry not pointing to the tail.  That was good to know as we had thoroughly reviewed the removal code and not found any issues.
This lead me to believe that the error must occur when reading in the local database.  Next step: print out the linked list at the end of reading in the local database.  But that was completely fine.  So somewhere between reading in the local database and using it, things got broken.  And, what do we do with the local database between reading it in and removing items from it?  The only place where we modify the local database between those points is when it gets sorted by the package names.  Sure enough, the pointer to the tail of the linked list is good going into the sort and bad coming out.
This limited the error to two functions: alpm_list_msort or alpm_list_mmerge. These implement a merge sort. Essentially alpm_list_msort recursively calls itself, dividing the list up into smaller pieces until it can not be divided any further and they are then they are merged in sorted order by alpm_list_mmerge. I had just started staring at the code when I saw something that seemed too obvious for such a hard to track down bug.  My exact words on IRC were “I think I can fix this…”.  And sure enough I could.
It turns out that when alpm_list_msort split a list into two, it did not set the pointer to the tail nodes in the two new lists correctly (or at all…).  So a two line addition and we have the bug fixed.  It turns out this bug had been present since the start of 2007.  So I am still slightly amazed that we did not see it before now and when it did appear that we got a second report of it so quickly.
And why could we not reproduce the issue even with a copy of a chroot where it was occurring?  It is entirely dependent on the order the directory entries are returned from the disk.  This determined which package was pointed to as the “tail” of the sorted package list.  The package incorrectly referred to as the tail had to be removed during a removal operation, and also not be the last package removed, to expose the bug.  Given most systems will have many hundreds of packages on them and removal operations tend to involve one or a few packages, this is a fairly rare occurrence.  But even if it occurred only a fraction of a percent of removal operations, I think we should have ran into this bug before now. I guess more people probably did experience the issue, but then could not immediately replicate and did not experience the issue again so did not report it.
And that is the end of the story of one of the most frustrating bugs I have ever managed to track down. A big thank you to the two users who installed versions of pacman with debug symbols and provided us backtraces, coredumps and entire chroots!  Without their help, we would probably still be not entirely convinced that the bug was real and it would still be hiding away in the pacman source code.