2 I am providing here two additional modules and two patches for the
5 Those two modules are ZODBhash and MKhash. They provide dbm-like
6 interface based on ZODB and MetaKit. They are intended to be used by
7 anydbm, so I am also providing corresponding patches for anydbm.py and
10 Download mzhash.zip - it contains modules, patches and simple tests.
12 Also I made a patch for the shelve.py module. I created two additional
13 shalve - CompressedShelf and CompressedKeysShelf. These shelve use zlib to
14 compress/decompress data. CompressedShelf compresses only data, and
15 CompressedKeysShelf compresses both data and keys.
19 Below is a long story why I created all this and how I compared them.
21 I started with the need to create ispell-like hash with all forms of
22 every word. I needed this for full-text search. (BTW, I think it'd be nice
23 to include this kind of search into ZCatalog; I'll think about it later). I
24 looked into ispell and htdig sources and manuals, and found that I'd better
25 write my own programs and libraries instead of trying to wrap those
28 I found (in ispell manual) I can generate simple text file with all
29 neccessary information: ispell -e <russian.dict | sort >ruusian.words. So
30 the task is to construct a hash for fast access to this information.
32 Very easy, thanks Python! Just read every line, split it and put into
33 disk-based hash (anydbm!).
35 I wrote the program in a minute. The program generates two hashes. One
36 hash, words2root maps every word to its normal form ("passing" => "pass").
37 Another, root2words maps normal form to the list of all forms ("pass" =>
38 ["pass", "passed", "passing", "passes", "passable", "impassable"]). The
39 hashes are named after htdig, of course.
41 The first run was a surprise. It was running for 5 hours, swapping a
42 lot, and finally it generates two 85-megabytes files (Berkeley DB hashes).
43 170 megs from 10M text file! Wow!!!
45 So I started to think I want to experiment with other disk-based hashes,
46 and I wanted to find a way to speed things up and lower disk requirements.
48 Next thing I tried - ZODB. ZODB is itself hash (a sort of), so I easily
49 wrote ZODBhash wrapper. I reran my program. It failed. ZODB ate /tmp very
50 fast - 700 megabatyes by one hour. I tried to commit subtransactions or
51 even transactions during write (__setitem__), but this was of not much
52 help, and my program stopped by IOError, "no space left on device" :(
54 Then I tried to to write compressed data to the hash. I created two
55 shelve - CompressedShelf and CompressedKeysShelf and tried them with bsddb.
56 I cleared my computer from all jobs, stopped XWindows, etc - and reran the
57 program two times - with Shelf and CompressedKeysShelf. Shelf created 2 85
58 megs files in 3 hours, and CompressedShelf created 2 files - one 85 and the
59 other 21 megs - in 3.5 hours. Win in disk space (not much) and loose in
62 I tried to use gdbm instead of bsddb. Again, I ran the program two
63 times. Result: Shelf - 120 and 50 megs in 5 hours, CompressedKeysShelf -
64 120 and 13 megs in 4 hours. Some win and some loose. During the runs my
65 computer swapped a bit less than when I used Berkeley DB, so it seems gdbm