m_lib/hash/www/xxx.txt

   1
   2    I am providing here two additional modules and two patches for the
   3 standard library.
   4
   5    Those two modules are ZODBhash and MKhash. They provide dbm-like
   6 interface based on ZODB and MetaKit. They are intended to be used by
   7 anydbm, so I am also providing corresponding patches for anydbm.py and
   8 whichdb.py.
   9
  10    Download mzhash.zip - it contains modules, patches and simple tests.
  11
  12    Also I made a patch for the shelve.py module. I created two additional
  13 shalve - CompressedShelf and CompressedKeysShelf. These shelve use zlib to
  14 compress/decompress data. CompressedShelf compresses only data, and
  15 CompressedKeysShelf compresses both data and keys.
  16
  17    Download mshelve.zip.
  18
  19    Below is a long story why I created all this and how I compared them.
  20
  21    I started with the need to create ispell-like hash with all forms of
  22 every word. I needed this for full-text search. (BTW, I think it'd be nice
  23 to include this kind of search into ZCatalog; I'll think about it later). I
  24 looked into ispell and htdig sources and manuals, and found that I'd better
  25 write my own programs and libraries instead of trying to wrap those
  26 complex ones.
  27
  28    I found (in ispell manual) I can generate simple text file with all
  29 neccessary information: ispell -e <russian.dict | sort >ruusian.words. So
  30 the task is to construct a hash for fast access to this information.
  31
  32    Very easy, thanks Python! Just read every line, split it and put into
  33 disk-based hash (anydbm!).
  34
  35    I wrote the program in a minute. The program generates two hashes. One
  36 hash, words2root maps every word to its normal form ("passing" => "pass").
  37 Another, root2words maps normal form to the list of all forms ("pass" =>
  38 ["pass", "passed", "passing", "passes", "passable", "impassable"]). The
  39 hashes are named after htdig, of course.
  40
  41    The first run was a surprise. It was running for 5 hours, swapping a
  42 lot, and finally it generates two 85-megabytes files (Berkeley DB hashes).
  43 170 megs from 10M text file! Wow!!!
  44
  45    So I started to think I want to experiment with other disk-based hashes,
  46 and I wanted to find a way to speed things up and lower disk requirements.
  47
  48    Next thing I tried - ZODB. ZODB is itself hash (a sort of), so I easily
  49 wrote ZODBhash wrapper. I reran my program. It failed. ZODB ate /tmp very
  50 fast - 700 megabatyes by one hour. I tried to commit subtransactions or
  51 even transactions during write (__setitem__), but this was of not much
  52 help, and my program stopped by IOError, "no space left on device" :(
  53
  54    Then I tried to to write compressed data to the hash. I created two
  55 shelve - CompressedShelf and CompressedKeysShelf and tried them with bsddb.
  56 I cleared my computer from all jobs, stopped XWindows, etc - and reran the
  57 program two times - with Shelf and CompressedKeysShelf. Shelf created 2 85
  58 megs files in 3 hours, and CompressedShelf created 2 files - one 85 and the
  59 other 21 megs - in 3.5 hours. Win in disk space (not much) and loose in
  60 time.
  61
  62    I tried to use gdbm instead of bsddb. Again, I ran the program two
  63 times. Result: Shelf - 120 and 50 megs in 5 hours, CompressedKeysShelf -
  64 120 and 13 megs in 4 hours. Some win and some loose. During the runs my
  65 computer swapped a bit less than when I used Berkeley DB, so it seems gdbm
  66 uses less memory.