Advantage of a Simple “Database” Format

One fairly common criticism of the pacman package manager is that is very slow due to not using some sort of binary database as its backend. I found suggestions to use sqlite dating back to 2005 (although I am sure they go back further) and mailing list activity peaked around late 2007. Speed is one of pacman’s main features – and it beats the competition by a wide margin according to Linux Format – but I guess people want it even faster.

The problem is that we use a filesystem based “database” where each package has its information stored in multiple files. This means that we can get fragmentation of our “database” and the reading of all these files from the filesystem can be quite slow. Usually most of this is cached by the kernel after the first read so speed improves markedly after the first usage.

This was improved a lot in the pacman 3.5 release (March 2011). The sync databases started to be read directly from the downloaded tarball and the local database had the “desc” and “depends” files for each package merged into one file. This increased the speed of reading from the sync databases massively and was a reasonable improvement to the local database too.

So the local package “database” could be improved by reducing it to one or a few files. But every time I think about changing it, I am reminded why I like the plain text file format. I was updating a reasonably out of date computer when I had an issue with the python-pygame package being renamed to python2-pygame. All packages needing in the Arch Linux repos were rebuilt with the new dependency name, so it did not need a provides entry. But my solarwolf package from the AUR still depended on the old name:

$ pacman -S python2-pygame resolving dependencies... looking for inter-conflicts... :: python2-pygame and python-pygame are in conflict. Remove python-pygame? [y/N] y error: failed to prepare transaction (could not satisfy dependencies) :: solarwolf: requires python-pygame

As we have a file based database, adjusting the dependency is easy without rebuilding the package. Just open the relevant file and edit away (or use sed…)

$ vim /var/lib/pacman/local/solarwolf-1.5-5/desc

Now I can see my local database has an issue using the handy testdb tool – solarwolf depends on python2-pygame, but that is not installed.

$ testdb missing python2-pygame dependency for solarwolf

But now I update as usual, installing python2-pygame which removes python-pygame, and my local pacman database is fully consistent.

I am sure all of this would still be possible if the database was in some other format, but it would have required more tools than a simple text editor. Of course, most people should never need to edit their local database, but I have introduced changes to it several times during pacman development and I consider being able to easily fix or revert these in the category of a “good thing”. And yes, I develop and test directly on my production system…

Of course, it is better to use a real database in performance critical situations. But pacman really does not fall into that category.

9 thoughts on “Advantage of a Simple “Database” Format”

Hi Allan.

I put pacman database (/var/lib/pacman) on its own partition. It uses reiserFS and is 500MB large (only 50MB is occupied).
I am really pleased with pacman speed.
~ $ time pacman -Qi >/dev/null
pacman -Qi > /dev/null 0,39s user 0,09s system 76% cpu 0,624 total

P.S.: This is my first comment here.
Thanks for all you do. 😉

Don’t worry, I think pacman is fast enough for most of us. =)

Rather that trying to speed up pacman’s database, it would help much more to streamline it’s UI…

If I do a full package update after I haven’t done one in a while, I sometimes get into the following situation:

1) After clicking “y” lots of times (“do you really want to replace X with Y”, etc.), and downloading all packages, pacman takes a lot of time to do the “checking package integrity and file conflicts” step. After a while it finds a conflict, and aborts with and error.
2) I manually fix the file conflict (usually caused by an AUR package).
3) I have to click “y” lots of times again, and again wait a long time while the “checking package integrity and file conflicts” is executed again from the beginning. It finds second file conflict, and aborts again.
4) I manually fix the second file conflict.
5) …and so on…

This could be streamlined:
a) Make it list all file conflicts in one go, so you only have to re-run it once (at the most).
b) If package integrity has already been checked, don’t check again if the database’s mtime has not changed since.

And these are just two examples, there are more possible UI improvements that would make pacman faster to use in practice, which would make a much bigger difference than speeding up database access.

Allan on 2012/12/17 at 9:52 AM said:

Hmm… I seem to remember a patch that performed complete conflict checking before aborting. So I would guess that improvement is going to be in pacman-4.1.

As for not rechecking the checksums, I’d say that is never going to happen. We would have to store a list of package files and the and the database they were checked with and then verify those files have not been tampered with in the mean time… which brings us back to checking their integrity.
- Pierre on 2012/12/17 at 2:06 PM said:
  
  There’s no point in checking local files. If they were tampered with locally, the attacker can modify the live system too. You only need to check at least once after downloading (or during, as one might argue).

I was reading this post a few hours ago, and now I’ve just received the good news that you will be attending a conference in my university (Instituto Superior Técnico, Lisbon, Portugal) in next February. Will definitely be there to ear what you have to say! Tks!

Best regards,

Guilherme

Allan on 2012/12/17 at 9:53 AM said:

Yay – it has been announced! I will have to figure out what I am talking about…
- Guilherme de Sousa on 2012/12/17 at 12:03 PM said:
  
  Just try to put some sense into the non linux crowd 😀
- diaz on 2012/12/19 at 3:18 PM said:
  
  Wait, what, you comming here and you weren’t saying anything? C’mon. I’ll need to check it out 😛
  
  (I’m not from IST but that doesnt’ matter 😛 )

PotatoesMaster on 2012/12/17 at 2:33 AM said:

Hi Allan.

I put pacman database (/var/lib/pacman) on its own partition. It uses reiserFS and is 500MB large (only 50MB is occupied).
I am really pleased with pacman speed.
~ $ time pacman -Qi >/dev/null
pacman -Qi > /dev/null 0,39s user 0,09s system 76% cpu 0,624 total

P.S.: This is my first comment here.
Thanks for all you do. 😉
cippaciong on 2012/12/17 at 3:52 AM said:

Don’t worry, I think pacman is fast enough for most of us. =)
ben on 2012/12/17 at 6:06 AM said:

Rather that trying to speed up pacman’s database, it would help much more to streamline it’s UI…

If I do a full package update after I haven’t done one in a while, I sometimes get into the following situation:

1) After clicking “y” lots of times (“do you really want to replace X with Y”, etc.), and downloading all packages, pacman takes a lot of time to do the “checking package integrity and file conflicts” step. After a while it finds a conflict, and aborts with and error.
2) I manually fix the file conflict (usually caused by an AUR package).
3) I have to click “y” lots of times again, and again wait a long time while the “checking package integrity and file conflicts” is executed again from the beginning. It finds second file conflict, and aborts again.
4) I manually fix the second file conflict.
5) …and so on…

This could be streamlined:
a) Make it list all file conflicts in one go, so you only have to re-run it once (at the most).
b) If package integrity has already been checked, don’t check again if the database’s mtime has not changed since.

And these are just two examples, there are more possible UI improvements that would make pacman faster to use in practice, which would make a much bigger difference than speeding up database access.
- Allan on 2012/12/17 at 9:52 AM said:
  
  Hmm… I seem to remember a patch that performed complete conflict checking before aborting. So I would guess that improvement is going to be in pacman-4.1.
  
  As for not rechecking the checksums, I’d say that is never going to happen. We would have to store a list of package files and the and the database they were checked with and then verify those files have not been tampered with in the mean time… which brings us back to checking their integrity.
  - Pierre on 2012/12/17 at 2:06 PM said:
    
    There’s no point in checking local files. If they were tampered with locally, the attacker can modify the live system too. You only need to check at least once after downloading (or during, as one might argue).
Guilherme de Sousa on 2012/12/17 at 9:31 AM said:

I was reading this post a few hours ago, and now I’ve just received the good news that you will be attending a conference in my university (Instituto Superior Técnico, Lisbon, Portugal) in next February. Will definitely be there to ear what you have to say! Tks!

Best regards,

Guilherme
- Allan on 2012/12/17 at 9:53 AM said:
  
  Yay – it has been announced! I will have to figure out what I am talking about…
  - Guilherme de Sousa on 2012/12/17 at 12:03 PM said:
    
    Just try to put some sense into the non linux crowd 😀
  - diaz on 2012/12/19 at 3:18 PM said:
    
    Wait, what, you comming here and you weren’t saying anything? C’mon. I’ll need to check it out 😛
    
    (I’m not from IST but that doesnt’ matter 😛 )

Allan McRae

Inventor of the word "plagiarism"

Advantage of a Simple “Database” Format

9 thoughts on “Advantage of a Simple “Database” Format”