Mifluz
******

`Mifluz' is a full text indexing library.

1 Introduction
**************

First of all, `mifluz' is at beta stage.

   This program is part of the GNU project, released under the aegis of
GNU.

   The purpose of `mifluz' is to provide a C++ library to store a full
text inverted index. To put it briefly, it allows storage of
occurrences of words in such a way that they can later be searched. The
basic idea of an inverted index is to associate each unique word with a
list of documents in which they appear. This list can then be searched
to locate the documents containing a specific word.

   Implementing a library that manages an inverted index is a very easy
task when there is a small number of words and documents. It becomes a
lot harder when dealing with a large number of words and documents.
`mifluz' has been designed with the further upper limits in mind : 500
million documents, 100 giga words, 18 million document updates per day.
In the present state of `mifluz', it is possible to store 100 giga
words using 600 giga bytes. The best average insertion rate observed as
of today 4000 key/sec on a 1 giga byte index.

   `mifluz' has two main characteristics : it is very simple (one might
say stupidly simple :-) and uses 100% of the size of the indexed text
for the index. It is simple because it provides only a few basic
functions. It does not contain document parsers (HTML, PDF etc...). It
does not contain a full text query parser. It does not provide result
display functions or other user friendly stuff. It only provides
functions to store word occurrences and retrieve them. The fact that it
uses 100% of the size of the indexed text is rather atypical. Most well
known full text indexing systems only use 30%. The advantage `mifluz'
has over most full text indexing systems is that it is fully dynamic
(update, delete, insert), uses only a controlled amount of memory while
resolving a query, has higher upper limits and has a simple storage
scheme. This is achieved by consuming more disk space.

2 Architecture
**************

In the following figure you can see the place of `mifluz' in an
hypothetical full text indexing system.

`Query'
     Resolve full text queries. The optimization makes sure the least
     frequent terms are scanned first and that redundant query
     specifications are merged together.

`Mifluz'
     Manage efficient storage of the inverted index permanent data.

`Parser Switch'
     Transform raw documents into list of terms.

`Indexer'
     Call the Parser Switch to get a list of terms and feed it to
     `mifluz'.


3 Constraints
*************

The following list shows all the constraints imposed by `mifluz'.  It
can also be seen as a list of functions provided by `mifluz' that is
more general than the API specification.

`Now Available'
        * In-place dynamic update of the index.

        * Use in memory cache to perform heavy index updates without
          stressing the disk too much.

        * The library can be linked in an C or C++ application,
          dynamically or statically.

        * The memory usage is completely controlled. The application
          can specify the maximum total memory usage. The application
          can specify that the memory cache will be shared among
          processes.

        * The library is thread safe.


`Future'
        * Transaction logs for backup recovery.

        * Index integrity check and repair function.

        * Indexing up to 500 million documents and support up to 18
          million document updates per 24h. The average size of a
          document is 4 kilo bytes and contains 200 indexable words.


`Constraints and Limitations'
        * No atomic data is bigger than a size known in advance.  This
          postulate is essential for disk storage optimization.  If an
          atomic data may have a size of 10Mb, it is impossible to
          guarantee that a query/indexing process controls the memory
          it's using.

          An atomic datum is something that must be manipulated as
          whole, with no possibility of splitting it into smaller
          parts. For instance a posting (Word, document identifier and
          position) is an atomic datum: to manipulate it in memory it
          has to reside completely in memory.  By contrast a postings
          list is not atomic. Manipulating a postings list can be done
          without loading all the postings list in memory.

        * The cost of an update is O(log m(N)) where m is the average
          number of entries in a page and N the total number of pages.
          This figure has to be considered when the pages are in memory
          or on disk.

        * The inverted index data is sorted to fit the most typical
          search pattern. The structure of the inverted index key can
          be defined at run time to fit a usage pattern.

        * No lock mechanism is provided beyond an individual word
          occurrence. It is assumed that the library is linked in a
          central server that serializes all the requests or in a
          program that provides its own lock mechanism.



4 Document name scheme
**********************

In all of the literature dealing with full text indexing a collection of
documents is considered to be a flat set of documents containing words.
Each document has a unique name. The inverted index associates terms
found in the documents with a list of unique document names.

   We found it more interesting to consider that the document names
have a hierarchical structure, just like path names in file systems.
The main difference is that each component of the document name (think
path name in file system) may contain terms.

   As shown in the figure above we can consider that the first
component of the document name is the name of a collection, the second
the logical name of a set of documents within the collection, the third
the name of the document, the fourth the name of a part of the document.

   This logical structure may be applied to URLs in the following way :
there is only one collection, it contains servers (document sets)
containing URLs (documents) containing tags such as TITLE (document
parts).

   This logical structure may be also be applied to databases in the
following way : there is one collection for each database, it contains
tables (document set) containing fields (document) containing records
(document part).

   What does this imply for full text indexing ? Instead of having only
one dictionary to map the document name to a numerical identifier (this
is needed to compress the postings for a term), we must have a
dictionary for each level of the hierarchy.

   Using the database example again:

   * A dictionary for database names

   * A dictionary for table names

   * A dictionary for field names

   * Since records are already identified by a number, no dictionary is
     needed.


   When coding the document identifier in the postings for a term, we
have to code a list of numerical identifiers instead of a single
numerical identifier. Alternatively one could see the document
identifier as an aribtrary precision number sliced in parts.

   The advantage of this document naming scheme are:

   * A `uniq' query operator can be trivially implemented. This is
     mostly useful to answer a query such as : I want URLs matching the
     word foo but I only want to see one URL for a given server (avoid
     the problem of having the first 40 URLs for a request on the same
     server).

   * The posting lists are traditionally ordered according to the
     document number.  This is a must to have an efficient query
     mechanism. With a hierachical document name, each level of the
     hierarchy is sorted. Therefore the postings are sorted in multiple
     ways: sorted by collection first, then document set, then document
     part.

   * Searching document paths is facilitated by the structure of the
     key.  For instance: I only want to search TITLEs.


   Of course, the suggested hierarchy semantic is not mandatory and may
be redefined according to sorting needs. For instance a relevance
ranking algorithm can lead to a relevance ranking number being inserted
into the hierarchy.

   The space overhead implied by this name scheme is quite small for
databases and URL pools.  The big dictionary for URL pools maps URL to
identifiers. The dictionary for tags (TITLE etc..) is only 10-50 at
most. The dictionary for site names (www.domain.com) will be ~1/100 of
the dictionary for URLs, assuming you have 100 URLs for a given site.
For databases the situation is even better: the big dictionary would be
the dictionary mapping rowids to numerical identifiers. But since
rowids are already numerical we don't need this.  We only need the
database name, field name and table name dictionaries and they are
small. Since we are able to encode small numbers using only a few bits
in postings, the overhead of hierarchical names is acceptable.

5 Data Storage Spec
*******************

Efficient management of the data storage space is an important issue of
the management of inverted indexes. The needs of an inverted index are
very similar to the needs of a regular file system. We need:

   * A cache associated with an LRU list to keep the most frequently
     used entries in memory.

   * To group postings into pages of fixed size to optimize I/O on disk.

   * A locking mechanism to prevent race conditions between threads or
     multiple processes accessing the same data.

   * A transaction system to ensure data integrity and atomicity of
     logical operations.

   * Transparent compression of pages to reduce I/O bottleneck for large
     volumes of data and reduce disk usage as a bonus.

   * To create indexes using up to 1 tera bytes.


   All these functionalities are provided by file systems and kernel
services. Since we also wanted the `mifluz' library to be portable we
chose the Berkeley DB library that implements all the services above.
The transparent compression is not part of Berkeley DB and is
implemented as a patch to Berkeley DB (version 3.1.14).

   Based on these low level services, Bekeley DB also implements a Btree
structure that `mifluz' used to store the postings. Each posting is an
entry in the Btree structure. Indexing 100 million words implies
creating 100 million entries in the Btree. When transparent compression
is used and assuming we have 6 byte words and a document identifier
using 7 * 8 bits, the average disk size used per entry is 6 bytes.

   Unique word statistics are also stored in the inverted index.  For
each unique word, an entry is created in a dictionnary and associated
with a serial number (the word identifier and the total number of
occurrences.

6 Cache tuning
**************

The cache memory used by `mifluz' has a tremendous impact on
performance.  It is set by the *wordlist_cache_size* attribute (see
WordList(3) and mifluz(3)).  It holds pages from the inverted index in
memory (uncompressed if the file is compressed) to reduce disk access.
Pages migrate from disk to memory using a LRU.

   Each page in the cache is really a node of the B-Tree used to store
the inverted index entries. The internal pages are intermediate nodes
that `mifluz' must traverse each time a key is searched. It is
therefore very important to keep them in memory.  Fortunately they only
count for 1% of the total size of the index, at most.  The size of the
cache must at least include enough space for the internal pages.

   The other factors that must be taken into account in sizing the
cache are highly dependant on the application. A typical case is
insertion of many random words in the index.  In this case two factors
are of special importance:

`repartition of unique words'
     When filling an inverted index it is very likely that the
     dictionary of unique words occuring in the index is limited. Let's
     say you have 1 000 000 unique words in a 100 000 000 occurrences
     index. Now assume that 90 000 000 occurrences are only using 20
     000 unique words, that is 90% of the index is filled with 2% of
     the complete vocabulary. If you are in this situation, the
     indexing process will spend 90% of its time updating 20 000 pages.
     If you can afford 20 000 * pagesize bytes of cache, you will have
     the maximum insertion rate.

     The general rule is : estimate or calculate how many unique words
     fill 90% of your index. Multiply this number by the pagesize and
     increase your cache by that amount.  See *wordlist_page_size*
     attribute in WordList(3) or mifluz(3).

`order of numbers following the key'
     The cache calculation above is fine as long as the words inserted
     are associated with increasing numbers in the key. If the numbers
     following the word in the key are random, the cache efficiency
     will be reduced. Where possible the application should therefore
     make sure that when inserting two identical words, the first is
     followed by a number that is lower than the second. In other
     words, insert

          foo 100
          foo 103

     rather than

          foo 103
          foo 100


   This hint must not be considered in isolation but with careful
analysis of the distribution of the key components (word and numbers).
For instance it does not matter much if a random key follows the word
as long as the range of values of the number is small.

   The conclusion is that the cache size should be at least 1% of the
total index size (uncompressed) plus a number of bytes that depends on
the usage pattern.

7 Key Specification
*******************

The key structure is what uniquely identifies each word that is inserted
in the inverted index. A key is made of a string (which is the word
being indexed), and a document identifier (which is really a list of
numbers), as discussed above.

   The exact structure of the inverted index key must be specified in
the configuration parameter `"wordlist_wordkey_description"'. See the
WordKeyInfo(3) manual page for more information on the format.

   We will focus on three examples that illustrate common usage.

   First example: a very simple inverted index would be to associate
each word occurrence to an URL (coded as a 32 bit number). The key
description would be:

     Word 8/URL 32

   Second example: if building a full text index of the content of a
database, you need to know in which field, table and record the word
appeared. This makes three numbers for the document id.

   Only a few bits are needed to encode the field and table name (let's
say you have a maximum of 16 field names and 16 table names, 4 bits
each is enough). The record number uses 24 bits because we know we
won't have more than 16 M records.

   The structure of the key would then be:

     Word 8/Table 4/Field 4/Record 32

   When you have more than one field involved in a key you must chose
the order in which they appear. It is mandatory that the *Word* is
first.  It is the part of the key that has highest precedence when
sorting. The fields that follow have lower and lower precedence.

   Third example: we go back to the first example and imagine we have a
relevance ranking function that calculates a value for each word
occurrence. By inserting this relevance ranking value in the inverted
index key, all the occurrences will be sorted with the most relevant
first.

     Word 8/Rank 5/URL 32

8 Internals
***********

8.1 Compression
===============

Compressing the index reduces disk space consumption and speeds up the
indexing by reducing I/O.

   Compressing at the `mifluz' level would imply choosing complicated
key structures, slowing down and complexifying insert and delete
operations. We have chosen to do the compression within Berkeley DB in
the memory pool subsystem. Berkeley DB keeps fixed size pages in a
memory cache, when it is full it writes the least recently used pages to
disk. When a page is needed Berkeley DB looks for it in memory and
retrieves it from disk if its not in memory. The
compression/uncompression occurs when a page moves between the memory
pool and the disk.

 8.1.1 Compression inside Berekeley DB
-------------------------------------

Berkeley DB uses fixed size pages.  Suppose, for example that our
compression algorithm can compress by a factor of 8 in most cases, we
use a disk page size that's 1/8 of the memory page size.  However there
are exceptions. Some pages won't compress well and therefore won't fit
on one disk page. Extra pages are therefore allocated and are linked
into a chained list. Allocating extra pages implies that some pages may
become free as a result of a better compression.

    8.1.2 Page compression in Mifluz
--------------------------------

The `mifluz' classes WordDBCompress and WordBitCompress do the
compression/decompression work. From the list of keys stored in a page
it extracts several lists of numbers. Each list of numbers has common
statistical properties that allow good compression.

   The  WordDBCompress_compress_c and WordDBCompress_uncompress_c
functions are C callbacks that  are called by the the page compression
code  in  BerkeleyDB. The  C  callbacks  then  call the  WordDBCompress
compress/uncompress  methods. The  WordDBCompress creates a
WordBitCompress object that acts as a buffer holding the compressed
stream.

   Compression algorithm.

   Most DB pages contain redundant data because `mifluz' chose to store
one word occurrence per entry.  Because of this choice the pages have a
very simple structure.

   Here is a real world example of what a page can look like: (key
structure: word identifier + 4 numerical fields)

     756     1 4482    1  10b
     756     1 4482    1  142
     756     1 4484    1   40
     756     1 449f    1  11e
     756     1 4545    1   11
     756     1 45d3    1  545
     756     1 45e0    1  7e5
     756     1 45e2    1  830
     756     1 45e8    1  545
     756     1 45fe    1   ec
     756     1 4616    1  395
     756     1 461a    1  1eb
     756     1 4631    1   49
     756     1 4634    1   48
     .... etc ....

   To compress we chose to only code differences between adjacent
entries.  A flag is stored for each entry indicating which fields have
changed.  When a field is different from the previous one, the
compression stores the difference which is likely to be small since the
entries are sorted.

   The basic idea is to build columns of numbers, one for each field,
and then compress them individually. One can see that the first and
second columns will compress very well since all the values are the
same. The third column will also compress well since the differences
between the numbers are small, leading to a small set of numbers.

9 Development
*************

The development of `mifluz' is shared between `Senga' (www.senga.org)
and the `Ht://dig' Group (dev.htdig.org). Part of the distribution
comes from the `Ht://dig' CVS tree and part from the `Senga' CVS tree.
The idea is to share efforts between two development groups that have
very similar needs. Since `Senga' and `Ht://dig' are both developped
under the GPL licence, such cooperation occurs naturally.

   To compile a program using the `mifluz' library use something that
looks like the following:

     gcc -o word -I/usr/local/include -L/usr/local/lib -lmifluz word.cc

10 Reference
************

10.1 htdb_dump
==============

10.1.1 htdb_dump NAME
---------------------

dump the content of an inverted index in Berkeley DB fashion

10.1.2 htdb_dump SYNOPSIS
-------------------------


     htdb_dump [-klNpWz] [-S pagesize] [-C cachesize] [-d ahr] [-f file] [-h home] [-s subdb] db_file

10.1.3 htdb_dump DESCRIPTION
----------------------------

htdb_dump is a slightly modified version of the standard Berkeley DB
db_dump utility.

   The htdb_dump utility reads the database file *db_file * and writes
it to the standard output using a portable flat-text format understood
by the `htdb_load ' utility. The argument *db_file * must be a file
produced using the Berkeley DB library functions.

10.1.4 htdb_dump OPTIONS
------------------------

`'
     *-W *

     Initialize WordContext(3) before dumping. With the *-z * flag
     allows to dump inverted indexes using the mifluz(3) specific
     compression scheme. The MIFLUZ_CONFIG environment variable must be
     set to a file containing the mifluz(3) configuration.

`'
     *-z *

     The *db_file * is compressed. If *-W * is given the mifluz(3)
     specific compression scheme is used. Otherwise the default gzip
     compression scheme is used.

`'
     *-d *

     Dump the specified database in a format helpful for debugging the
     Berkeley DB library routines.
    `'
          a

          Display all information.

    `'
          h

          Display only page headers.

    `'
          r

          Do not display the free-list or pages on the free list.  This
          mode is used by the recovery tests.
     The output format of the *-d * option is not standard and may
     change, without notice, between releases of the Berkeley DB
     library.

`'
     *-f *

     Write to the specified *file * instead of to the standard output.

`'
     *-h *

     Specify a home directory for the database.  As Berkeley DB
     versions before 2.0 did not support the concept of a `database
     home.  '

`'
     *-k *

     Dump record numbers from Queue and Recno databases as keys.

`'
     *-l *

     List the subdatabases stored in the database.

`'
     *-N *

     Do not acquire shared region locks while running.  Other problems
     such as potentially fatal errors in Berkeley DB will be ignored as
     well.  This option is intended only for debugging errors and
     should not be used under any other circumstances.

`'
     *-p *

     If characters in either the key or data items are printing
     characters (as defined by *isprint *(3)), use printing characters
     in *file * to represent them.  This option permits users to use
     standard text editors and tools to modify the contents of
     databases.

     Note, different systems may have different notions as to what
     characters are considered `printing characters ', and databases
     dumped in this manner may be less portable to external systems.

`'
     *-s *

     Specify a subdatabase to dump.  If no subdatabase is specified, all
     subdatabases found in the database are dumped.

`'
     *-V *

     Write the version number to the standard output and exit.

   Dumping and reloading Hash databases that use user-defined hash
functions will result in new databases that use the default hash
function.  While using the default hash function may not be optimal for
the new database, it will continue to work correctly.

   Dumping and reloading Btree databases that use user-defined prefix or
comparison functions will result in new databases that use the default
prefix and comparison functions.  *In this case, it is quite likely
that the database will be damaged beyond repair permitting neither
record storage or retrieval.  *

   The only available workaround for either case is to modify the
sources for the `htdb_load ' utility to load the database using the
correct hash, prefix and comparison functions.

10.1.5 htdb_dump ENVIRONMENT
----------------------------

*DB_HOME * If the *-h * option is not specified and the environment
variable DB_HOME is set, it is used as the path of the database home.

   *MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.2 htdb_stat
==============

10.2.1 htdb_stat NAME
---------------------

displays statistics for Berkeley DB environments.

10.2.2 htdb_stat SYNOPSIS
-------------------------


     htdb_stat [-celmNtzW] [-C Acfhlmo] [-d file [-s file]] [-h home] [-M Ahlm]

10.2.3 htdb_stat DESCRIPTION
----------------------------

htdb_stat is a slightly modified version of the standard Berkeley DB
db_stat utility which displays statistics for Berkeley DB environments.

10.2.4 htdb_stat OPTIONS
------------------------

`'
     *-W *

     Initialize WordContext(3) before gathering statistics. With the *-z
     * flag allows to gather statistics on inverted indexes generated
     with the mifluz(3) specific compression scheme. The MIFLUZ_CONFIG
     environment variable must be set to a file containing the
     mifluz(3) configuration.

`'
     *-z *

     The *file * is compressed. If *-W * is given the mifluz(3)
     specific compression scheme is used. Otherwise the default gzip
     compression scheme is used.

`'
     *-C *

     Display internal information about the lock region.  (The output
     from this option is often both voluminous and meaningless, and is
     intended only for debugging.)
    `'
          *A *

          Display all information.

    `'
          *c *

          Display lock conflict matrix.

    `'
          *f *

          Display lock and object free lists.

    `'
          *l *

          Display lockers within hash chains.

    `'
          *m *

          Display region memory information.

    `'
          *o *

          Display objects within hash chains.

`'
     *-c *

     Display lock region statistics.

`'
     *-d *

     Display database statistics for the specified database.  If the
     database contains subdatabases, the statistics are for the
     database or subdatabase specified, and not for the database as a
     whole.

`'
     *-e *

     Display current environment statistics.

`'
     *-h *

     Specify a home directory for the database.

`'
     *-l *

     Display log region statistics.

`'
     *-M *

     Display internal information about the shared memory buffer pool.
     (The output from this option is often both voluminous and
     meaningless, and is intended only for debugging.)
    `'
          *A *

          Display all information.

    `'
          *h *

          Display buffers within hash chains.

    `'
          *l *

          Display buffers within LRU chains.

    `'
          *m *

          Display region memory information.

`'
     *-m *

     Display shared memory buffer pool statistics.

`'
     *-N *

     Do not acquire shared region locks while running.  Other problems
     such as potentially fatal errors in Berkeley DB will be ignored as
     well.  This option is intended only for debugging errors and
     should not be used under any other circumstances.

`'
     *-s *

     Display database statistics for the specified subdatabase of the
     database specified with the *-d * flag.

`'
     *-t *

     Display transaction region statistics.

`'
     *-V *

     Write the version number to the standard output and exit.

   Only one set of statistics is displayed for each run, and the last
option specifying a set of statistics takes precedence.

   Values smaller than 10 million are generally displayed without any
special notation.  Values larger than 10 million are normally displayed
as *<number>M *.

   The htdb_stat utility attaches to one or more of the Berkeley DB
shared memory regions.  In order to avoid region corruption, it should
always be given the chance to detach and exit gracefully.  To cause
htdb_stat to clean up after itself and exit, send it an interrupt
signal (SIGINT).

10.2.5 htdb_stat ENVIRONMENT
----------------------------

*DB_HOME * If the *-h * option is not specified and the environment
variable DB_HOME is set, it is used as the path of the database home.

   *MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.3 htdb_load
==============

10.3.1 htdb_load NAME
---------------------

displays statistics for Berkeley DB environments.

10.3.2 htdb_load SYNOPSIS
-------------------------


     htdb_load [-nTzW] [-c name=value] [-f file] [-h home] [-C cachesize] [-t btree | hash | recno] db_file

10.3.3 htdb_load DESCRIPTION
----------------------------

The htdb_load utility reads from the standard input and loads it into
the database *db_file *.  The database *db_file * is created if it does
not already exist.

   The input to htdb_load must be in the output format specified by the
htdb_dump utility, or as specified for the *-T * below.

10.3.4 htdb_load OPTIONS
------------------------

`'
     *-W *

     Initialize WordContext(3) before loading. With the *-z * flag
     allows to load inverted indexes using the mifluz(3) specific
     compression scheme. The MIFLUZ_CONFIG environment variable must be
     set to a file containing the mifluz(3) configuration.

`'
     *-z *

     The *db_file * is compressed. If *-W * is given the mifluz(3)
     specific compression scheme is used. Otherwise the default gzip
     compression scheme is used.

`'
     *-c *

     Specify configuration options for the DB structure ignoring any
     value they may have based on the input.  The command-line format is
     *name=value *.  See `Supported Keywords ' for a list of supported
     words for the *-c * option.

`'
     *-f *

     Read from the specified *input * file instead of from the standard
     input.

`'
     *-h *

     Specify a home directory for the database.  If a home directory is
     specified, the database environment is opened using the
     `DB_INIT_LOCK ', `DB_INIT_LOG ', `DB_INIT_MPOOL ', `DB_INIT_TXN '
     and `DB_USE_ENVIRON ' flags to DBENV->open. This means that
     htdb_load can be used to load data into databases while they are
     in use by other processes. If the DBENV->open call fails, or if no
     home directory is specified, the database is still updated, but
     the environment is ignored, e.g., no locking is done.

`'
     *-n *

     Do not overwrite existing keys in the database when loading into an
     already existing database.  If a key/data pair cannot be loaded
     into the database for this reason, a warning message is displayed
     on the standard error output and the key/data pair are skipped.

`'
     *-T *

     The *-T * option allows non-Berkeley DB applications to easily
     load text files into databases.

     If the database to be created is of type Btree or Hash, or the
     keyword *keys * is specified as set, the input must be paired
     lines of text, where the first line of the pair is the key item,
     and the second line of the pair is its corresponding data item.
     If the database to be created is of type Queue or Recno and the
     keywork *keys * is not set, the input must be lines of text, where
     each line is a new data item for the database.

     A simple escape mechanism, where newline and backslash (\)
     characters are special, is applied to the text input.  Newline
     characters are interpreted as record separators.  Backslash
     characters in the text will be interpreted in one of two ways: if
     the backslash character precedes another backslash character, the
     pair will be interpreted as a literal backslash.  If the backslash
     character precedes any other character, the two characters
     following the backslash will be interpreted as hexadecimal
     specification of a single character, e.g., \0a is a newline
     character in the ASCII character set.

     For this reason, any backslash or newline characters that naturally
     occur in the text input must be escaped to avoid misinterpretation
     by htdb_load

     If the *-T * option is specified, the underlying access method type
     must be specified using the *-t * option.

`'
     *-t *

     Specify the underlying access method.  If no *-t * option is
     specified, the database will be loaded into a database of the same
     type as was dumped, e.g., a Hash database will be created if a
     Hash database was dumped.

     Btree and Hash databases may be converted from one to the other.
     Queue and Recno databases may be converted from one to the other.
     If the *-k * option was specified on the call to htdb_dump then
     Queue and Recno databases may be converted to Btree or Hash, with
     the key being the integer record number.

`'
     *-V *

     Write the version number to the standard output and exit.

   The htdb_load utility attaches to one or more of the Berkeley DB
shared memory regions.  In order to avoid region corruption, it should
always be given the chance to detach and exit gracefully.  To cause
htdb_load to clean up after itself and exit, send it an interrupt
signal (SIGINT).

   The htdb_load utility exits 0 on success, 1 if one or more key/data
pairs were not loaded into the database because the key already existed,
and >1 if an error occurs.

10.3.5 htdb_load KEYWORDS
-------------------------

The following keywords are supported for the *-c * command-line option
to the htdb_load utility. See DB->open for further discussion of these
keywords and what values should be specified.

   The parenthetical listing specifies how the value part of the
*name=value * pair is interpreted.  Items listed as (boolean) expect
value to be *1 * (set) or *0 * (unset).  Items listed as (number)
convert value to a number.  Items listed as (string) use the string
value without modification.
`bt_minkey (number)'
     The minimum number of keys per page.

`db_lorder (number)'
     The byte order for integers in the stored database metadata.

`db_pagesize (number)'
     The size of pages used for nodes in the tree, in bytes.

`duplicates (boolean)'
     The value of the DB_DUP flag.

`h_ffactor (number)'
     The density within the Hash database.

`h_nelem (number)'
     The size of the Hash database.

`keys (boolean)'
     Specify if keys are present for Queue or Recno databases.

`re_len (number)'
     Specify fixed-length records of the specified length.

`re_pad (string)'
     Specify the fixed-length record pad character.

`recnum (boolean)'
     The value of the DB_RECNUM flag.

`renumber (boolean)'
     The value of the DB_RENUMBER flag.

`subdatabase (string)'
     The subdatabase to load.

10.3.6 htdb_load ENVIRONMENT
----------------------------

*DB_HOME * If the *-h * option is not specified and the environment
variable DB_HOME is set, it is used as the path of the database home.

   *MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.4 mifluzdump
===============

10.4.1 mifluzdump NAME
----------------------

dump the content of an inverted index.

10.4.2 mifluzdump SYNOPSIS
--------------------------


     mifluzdump file

10.4.3 mifluzdump DESCRIPTION
-----------------------------

mifluzdump writes on *stdout * a complete ascii description of the *file
* inverted index using the `WordList::Write ' method.

10.4.4 mifluzdump ENVIRONMENT
-----------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.5 mifluzload
===============

10.5.1 mifluzload NAME
----------------------

load the content of an inverted index.

10.5.2 mifluzload SYNOPSIS
--------------------------


     mifluzload file

10.5.3 mifluzload DESCRIPTION
-----------------------------

mifluzload reads from *stdout * a complete ascii description of the
*file * inverted index using the `WordList::Read ' method.

10.5.4 mifluzload ENVIRONMENT
-----------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.6 mifluzsearch
=================

10.6.1 mifluzsearch NAME
------------------------

search the content of an inverted index.

10.6.2 mifluzsearch SYNOPSIS
----------------------------


     mifluzsearch -f words [options]

10.6.3 mifluzsearch DESCRIPTION
-------------------------------

mifluzsearch searches a mifluz index for documents matching a Alt*Vista
expression (simple syntax).

   Debugging information interpretation. A cursor is open in the index
for every word and they are stored in a list. The list of cursors is
always processed in the same order, as a single link list. With -v,
each block is an individual action on behalf of the word shown on the
first line. The last line of the block is the conclusion of the action
described in the block. REDO means the same cursor must be examined
again because the conditions have changed. RESTART means we go back to
the first cursor in the list because it may not match the new
conditions anymore. NEXT means the cursor and all the cursors before it
match the conditions and we may proceed to the next cursor. ATEND means
the cursor cannot match the conditions because it is at the end of the
index.

10.6.4 mifluzsearch ENVIRONMENT
-------------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.7 mifluzdict
===============

10.7.1 mifluzdict NAME
----------------------

dump the dictionnary of an inverted index.

10.7.2 mifluzdict SYNOPSIS
--------------------------


     mifluzdict file

10.7.3 mifluzdict DESCRIPTION
-----------------------------

mifluzdict writes on *stdout * a complete ascii description of the *file
* inverted index using the `WordList::Write ' method.

10.7.4 mifluzdict ENVIRONMENT
-----------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  *

10.8 WordContext
================

10.8.1 WordContext NAME
-----------------------

read configuration and setup mifluz context.

10.8.2 WordContext SYNOPSIS
---------------------------


     #include <mifluz.h>

     WordContext context;

10.8.3 WordContext DESCRIPTION
------------------------------

The WordContext object must be the first object created.  All other
objects (WordList, WordReference, WordKey and WordRecord) are allocated
via the corresponding methods of WordContext (List, Word, Key and
Record respectively).

   The WordContext object contains a *Configuration * object that holds
the configuration parameters used by the instance.  If a configuration
parameter is changed, the `ReInitialize ' method should be called to
take them in account.

10.8.4 WordContext CONFIGURATION
--------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_monitor {true|false} (default false)'
     If true create a `WordMonitor ' instance to gather statistics and
     build reports.

10.8.5 WordContext METHODS
--------------------------

`WordContext()'
     Constructor. Read the configuration parameters from the
     environment.  If the environment variable *MIFLUZ_CONFIG * is set
     to a pathname, read it as a configuration file. If *MIFLUZ_CONFIG
     * is not set, try to read the `~/.mifluz ' configuration file or
     `/usr/etc/mifluz.conf '. See the mifluz manual page for a complete
     list of the configuration attributes.

`WordContext(const Configuration &config)'
     Constructor. The *config * argument must contain all the
     configuration parameters, no configuration file is loaded from the
     environment.

`WordContext(const ConfigDefaults *array)'
     Constructor. The *array * argument holds configuration parameters
     that will override their equivalent in the configuration file read
     from the environment.

`void Initialize(const Configuration &config)'
     Initialize the WordContext object. This method is called by every
     constructor.

     When calling *Initialize * a second time, one must ensure that all
     WordList and WordCursor objects have been destroyed. WordList and
     WordCursor internal state depends on the current WordContext that
     will be lost by a second call.

     For those interested by the internals, the *Initialize * function
     maintains a Berkeley DB environment (DB_ENV) in the following way:

     First invocation:
          Initialize -> new DB_ENV (thru WordDBInfo)

     Second invocation:
          Initialize -> delete DB_ENV -> new DB_ENV (thru WordDBInfo)

`int Initialize(const ConfigDefaults* config_defaults = 0)'
     Initialize the WordContext object.  Build a `Configuration '
     object from the file pointed to by the MIFLUZ_CONFIG environment
     variable or ~/.mifluz or /usr/etc/mifluz.conf.  The
     *config_defaults * argument, if provided, is passed to the
     `Configuration ' object using the *Defaults * method.  The
     *Initialize(const Configuration &) * method is then called with the
     `Configuration ' object.  Return OK if success, NOTOK otherwise.
     Refer to the `Configuration ' description for more information.

`int ReInitialize()'
     Destroy internal state except the `Configuration ' object and
     rebuild it. May be used when the configuration is changed to take
     these changes in account.  Return OK if success, NOTOK otherwise.

`const WordType& GetType() const'
     Return the *WordType * data member of the current object as a
     const.

`WordType& GetType()'
     Return the *WordType * data member of the current object.

`const WordKeyInfo& GetKeyInfo() const'
     Return the *WordKeyInfo * data member of the current object as a
     const.

`WordKeyInfo& GetKeyInfo()'
     Return the *WordKeyInfo * data member of the current object.

`const WordRecordInfo& GetRecordInfo() const'
     Return the *WordRecordInfo * data member of the current object as
     a const.

`WordRecordInfo& GetRecordInfo()'
     Return the *WordRecordInfo * data member of the current object.

`const WordDBInfo& GetDBInfo() const'
     Return the *WordDBInfo * data member of the current object as a
     const.

`WordDBInfo& GetDBInfo()'
     Return the *WordDBInfo * data member of the current object.

`const WordMonitor* GetMonitor() const'
     Return the *WordMonitor * data member of the current object as a
     const.  The pointer may be NULL if the word_monitor attribute is
     false.

`WordMonitor* GetMonitor()'
     Return the *WordMonitor * data member of the current object.  The
     pointer may be NULL if the word_monitor attribute is false.

`const Configuration& GetConfiguration() const'
     Return the *Configuration * data member of the current object as a
     const.

`Configuration& GetConfiguration()'
     Return the *Configuration * data member of the current object.

`WordList* List()'
     Return a new *WordList * object, using the WordList(WordContext*)
     constructor. It is the responsibility of the caller to delete this
     object before the WordContext object is deleted. Refer to the
     *wordlist_multi * configuration parameter to know the exact type
     of the object created.

`WordReference* Word()'
     Return a new *WordReference * object, using the
     WordReference(WordContext*) constructor. It is the responsibility
     of the caller to delete this object before the WordContext object
     is deleted.

`WordReference* Word(const String& key0, const String& record0)'
     Return a new *WordReference * object, using the
     WordReference(WordContext*, const String&, const& String)
     constructor. It is the responsibility of the caller to delete this
     object before the WordContext object is deleted.

`WordReference* Word(const String& word)'
     Return a new *WordReference * object, using the
     WordReference(WordContext*, const String&) constructor. It is the
     responsibility of the caller to delete this object before the
     WordContext object is deleted.

`WordRecord* Record()'
     Return a new *WordRecord * object, using the
     WordRecord(WordContext*) constructor. It is the responsibility of
     the caller to delete this object before the WordContext object is
     deleted.

`WordKey* Key()'
     Return a new *WordKey * object, using the WordKey(WordContext*)
     constructor. It is the responsibility of the caller to delete this
     object before the WordContext object is deleted.

`WordKey* Key(const String& word)'
     Return a new *WordKey * object, using the WordKey(WordContext*,
     const String&) constructor. It is the responsibility of the caller
     to delete this object before the WordContext object is deleted.

`WordKey* Key(const WordKey& other)'
     Return a new *WordKey * object, using the WordKey(WordContext*,
     const WordKey&) constructor. It is the responsibility of the
     caller to delete this object before the WordContext object is
     deleted.

`static String ConfigFile()'
     Return the full pathname of the configuration file. The
     configuration file lookup first searches for the file pointed by
     the *MIFLUZ_CONFIG * environment variable then *~/.mifluz * and
     finally */usr/etc/mifluz.conf *. If no configuration file is found,
     return the empty string.

10.8.6 WordContext ENVIRONMENT
------------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  * or */usr/etc/mifluz.conf *

10.9 WordList
=============

10.9.1 WordList NAME
--------------------

abstract class to manage and use an inverted index file.

10.9.2 WordList SYNOPSIS
------------------------


     #include <mifluz.h>

     WordContext context;

     WordList* words = context->List();

     delete words;

10.9.3 WordList DESCRIPTION
---------------------------

WordList is the `mifluz ' equivalent of a database handler. Each
WordList object is bound to an inverted index file and implements the
operations to create it, fill it with word occurrences and search for
an entry matching a given criterion.

   WordList is an abstract class and cannot be instanciated.  The *List
* method of the class WordContext will create an instance using the
appropriate derived class, either WordListOne or WordListMulti. Refer
to the corresponding manual pages for more information on their
specific semantic.

   When doing bulk insertions, mifluz creates temporary files that
contain the entries to be inserted in the index. Those files are
typically named `indexC00000000 '. The maximum size of the temporary
file is *wordlist_cache_size * / 2. When the maximum size of the
temporary file is reached, mifluz creates another temporary file named
`indexC00000001 '. The process continues until mifluz created 50
temporary file. At this point it merges all temporary files into one
that replaces the first `indexC00000000 '. Then it continues to create
temporary file again and keeps following this algorithm until the bulk
insertion is finished. When the bulk insertion is finished, mifluz has
one big file named `indexC00000000 ' that contains all the entries to
be inserted in the index. mifluz inserts all the entries from
`indexC00000000 ' into the index and delete the temporary file when
done. The insertion will be fast since all the entries in
`indexC00000000 ' are already sorted.

   The parameter *wordlist_cache_max * can be used to prevent the
temporary files to grow indefinitely. If the total cumulated size of the
`indexC* ' files grow beyond this parameter, they are merged into the
main index and deleted. For instance setting this parameter value to
500Mb garanties that the total size of the `indexC* ' files will not
grow above 500Mb.

10.9.4 WordList CONFIGURATION
-----------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_extend {true|false} (default false)'
     If *true * maintain reference count of unique words. The
     *Noccurrence * method gives access to this count.

`wordlist_verbose <number> (default 0)'
     Set the verbosity level of the WordList class.

     1 walk logic

     2 walk logic details

     3 walk logic lots of details

`wordlist_page_size <bytes> (default 8192)'
     Berkeley DB page size (see Berkeley DB documentation)

`wordlist_cache_size <bytes> (default 500K)'
     Berkeley DB cache size (see Berkeley DB documentation) Cache makes
     a huge difference in performance. It must be at least 2% of the
     expected total data size. Note that if compression is activated
     the data size is eight times larger than the actual file size. In
     this case the cache must be scaled to 2% of the data size, not 2%
     of the file size. See *Cache tuning * in the mifluz guide for more
     hints.  See WordList(3) for the rationale behind cache file
     handling.

`wordlist_cache_max <bytes> (default 0)'
     Maximum size of the cumulated cache files generated when doing bulk
     insertion with the *BatchStart() * function. When this limit is
     reached, the cache files are all merged into the inverted index.
     The value 0 means infinite size allowed.  See WordList(3) for the
     rationale behind cache file handling.

`wordlist_cache_inserts {true|false} (default false)'
     If true all *Insert * calls are cached in memory. When the
     WordList object is closed or a different access method is called
     the cached entries are flushed in the inverted index.

`wordlist_compress {true|false} (default false)'
     Activate compression of the index. The resulting index is eight
     times smaller than the uncompressed index.

10.9.5 WordList METHODS
-----------------------

`inline WordContext* GetContext()'
     Return a pointer to the WordContext object used to create this
     instance.

`inline const WordContext* GetContext() const'
     Return a pointer to the WordContext object used to create this
     instance as a const.

`virtual inline int Override(const WordReference& wordRef)'
     Insert *wordRef * in index. If the `Key() ' part of the *wordRef *
     exists in the index, override it.  Returns OK on success, NOTOK on
     error.

`virtual int Exists(const WordReference& wordRef)'
     Returns OK if *wordRef * exists in the index, NOTOK otherwise.

`inline int Exists(const String& word)'
     Returns OK if *word * exists in the index, NOTOK otherwise.

`virtual int WalkDelete(const WordReference& wordRef)'
     Delete all entries in the index whose key matches the `Key() '
     part of *wordRef *, using the `Walk ' method.  Returns the number
     of entries successfully deleted.

`virtual int Delete(const WordReference& wordRef)'
     Delete the entry in the index that exactly matches the `Key() '
     part of *wordRef.  * Returns OK if deletion is successfull, NOTOK
     otherwise.

`virtual int Open(const String& filename, int mode)'
     Open inverted index *filename.  * *mode * may be `O_RDONLY ' or
     `O_RDWR.  ' If mode is `O_RDWR ' it can be or'ed with `O_TRUNC '
     to reset the content of an existing inverted index.  Return OK on
     success, NOTOK otherwise.

`virtual int Close()'
     Close inverted index.  Return OK on success, NOTOK otherwise.

`virtual unsigned int Size() const'
     Return the size of the index in pages.

`virtual int Pagesize() const'
     Return the page size

`virtual WordDict *Dict()'
     Return a pointer to the inverted index dictionnary.

`const String& Filename() const'
     Return the filename given to the last call to Open.

`int Flags() const'
     Return the mode given to the last call to Open.

`inline List *Find(const WordReference& wordRef)'
     Returns the list of word occurrences exactly matching the `Key() '
     part of *wordRef.  * The `List ' returned contains pointers to
     `WordReference ' objects. It is the responsibility of the caller
     to free the list. See List.h header for usage.

`inline List *FindWord(const String& word)'
     Returns the list of word occurrences exactly matching the *word.
     * The `List ' returned contains pointers to `WordReference '
     objects. It is the responsibility of the caller to free the list.
     See List.h header for usage.

`virtual List *operator [] (const WordReference& wordRef)'
     Alias to the *Find * method.

`inline List *operator [] (const String& word)'
     Alias to the *FindWord * method.

`virtual List *Prefix (const WordReference& prefix)'
     Returns the list of word occurrences matching the `Key() ' part of
     *wordRef.  * In the `Key() ', the string (accessed with `GetWord()
     ') matches any string that begins with it. The `List ' returned
     contains pointers to `WordReference ' objects. It is the
     responsibility of the caller to free the list.

`inline List *Prefix (const String& prefix)'
     Returns the list of word occurrences matching the *word.  * In the
     `Key() ', the string (accessed with `GetWord() ') matches any
     string that begins with it. The `List ' returned contains pointers
     to `WordReference ' objects. It is the responsibility of the
     caller to free the list.

`virtual List *Words()'
     Returns a list of all unique words contained in the inverted
     index. The `List ' returned contains pointers to `String '
     objects. It is the responsibility of the caller to free the list.
     See List.h header for usage.

`virtual List *WordRefs()'
     Returns a list of all entries contained in the inverted index. The
     `List ' returned contains pointers to `WordReference ' objects. It
     is the responsibility of the caller to free the list. See List.h
     header for usage.

`virtual WordCursor *Cursor(wordlist_walk_callback_t callback, Object *callback_data)'
     Create a cursor that searches all the occurrences in the inverted
     index and call *ncallback * with *ncallback_data * for every match.

`virtual WordCursor *Cursor(const WordKey &searchKey, int action = HTDIG_WORDLIST_WALKER)'
     Create a cursor that searches all the occurrences in the inverted
     index and that match *nsearchKey.  * If *naction * is set to
     HTDIG_WORDLIST_WALKER calls *searchKey.callback * with
     *searchKey.callback_data * for every match. If *naction * is set to
     HTDIG_WORDLIST_COLLECT push each match in *searchKey.collectRes *
     data member as a *WordReference * object. It is the responsibility
     of the caller to free the *searchKey.collectRes * list.

`virtual WordCursor *Cursor(const WordKey &searchKey, wordlist_walk_callback_t callback, Object * callback_data)'
     Create a cursor that searches all the occurrences in the inverted
     index and that match *nsearchKey * and calls *ncallback * with
     *ncallback_data * for every match.

`virtual WordKey Key(const String& bufferin)'
     Create a WordKey object and return it. The *bufferin * argument is
     used to initialize the key, as in the WordKey::Set method.  The
     first component of *bufferin * must be a word that is translated
     to the corresponding numerical id using the WordDict::Serial
     method.

`virtual WordReference Word(const String& bufferin, int exists = 0)'
     Create a WordReference object and return it. The *bufferin *
     argument is used to initialize the structure, as in the
     WordReference::Set method.  The first component of *bufferin *
     must be a word that is translated to the corresponding numerical
     id using the WordDict::Serial method.  If the *exists * argument
     is set to 1, the method WordDict::SerialExists is used instead,
     that is no serial is assigned to the word if it does not already
     have one.  Before translation the word is normalized using the
     WordType::Normalize method. The word is saved using the
     WordReference::SetWord method.

`virtual WordReference WordExists(const String& bufferin)'
     Alias for Word(bufferin, 1).

`virtual void BatchStart()'
     Accelerate bulk insertions in the inverted index. All insertion
     done with the *Override * method are batched instead of being
     updating the inverted index immediately.  No update of the
     inverted index file is done before the *BatchEnd * method is
     called.

`virtual void BatchEnd()'
     Terminate a bulk insertion started with a call to the *BatchStart
     * method. When all insertions are done the *AllRef * method is
     called to restore statistics.

`virtual int Noccurrence(const String& key, unsigned int& noccurrence) const'
     Return in *noccurrence * the number of occurrences of the string
     contained in the `GetWord() ' part of *key.  * Returns OK on
     success, NOTOK otherwise.

`virtual int Write(FILE* f)'
     Write on file descriptor *f * an ASCII description of the index.
     Each line of the file contains a `WordReference ' ASCII
     description.  Return OK on success, NOTOK otherwise.

`virtual int WriteDict(FILE* f)'
     Write on file descriptor *f * the complete dictionnary with
     statistics.  Return OK on success, NOTOK otherwise.

`virtual int Read(FILE* f)'
     Read `WordReference ' ASCII descriptions from *f *, returns the
     number of inserted WordReference or < 0 if an error occurs.
     Invalid descriptions are ignored as well as empty lines.

10.10 WordDict
==============

10.10.1 WordDict NAME
---------------------

manage and use an inverted index dictionary.

10.10.2 WordDict SYNOPSIS
-------------------------


     #include <mifluz.h>

     WordList* words = ...;
     WordDict* dict = words->Dict();

10.10.3 WordDict DESCRIPTION
----------------------------

WordDict maps strings to unique identifiers and frequency in the
inverted index. Whenever a new word is found, the WordDict class can be
asked to assign it a serial number. When doing so, an entry is created
in the dictionary with a frequency of zero. The application may then
increment or decrement the frequency to reflect the inverted index
content.

   The serial numbers range from 1 to 2^32 inclusive.

   A WordDict object is automatically created by the WordList object and
should not be created directly by the application.

10.10.4 WordDict METHODS
------------------------

`WordDict()'
     Private constructor.

`int Initialize(WordList* words)'
     Bind the object a WordList inverted index. Return OK on success,
     NOTOK otherwise.

`int Open()'
     Open the underlying Berkeley DB sub-database. The enclosing file
     is given by the `words ' data member. Return OK on success, NOTOK
     otherwise.

`int Remove()'
     Destroy the underlying Berkeley DB sub-database. Return OK on
     success, NOTOK otherwise.

`int Close()'
     Close the underlying Berkeley DB sub-database. Return OK on
     success, NOTOK otherwise.

`int Serial(const String& word, unsigned int& serial)'
     If the *word * argument exists in the dictionnary, return its
     serial number in the *serial * argument. If it does not already
     exists, assign it a serial number, create an entry with a frequency
     of zero and return the new serial in the *serial * argument.
     Return OK on success, NOTOK otherwise.

`int SerialExists(const String& word, unsigned int& serial)'
     If the *word * argument exists in the dictionnary, return its
     serial number in the *serial * argument. If it does not exists set
     the *serial * argument to WORD_DICT_SERIAL_INVALID.  Return OK on
     success, NOTOK otherwise.

`int SerialRef(const String& word, unsigned int& serial)'
     Short hand for Serial() followed by Ref().  Return OK on success,
     NOTOK otherwise.

`int Noccurrence(const String& word, unsigned int& noccurrence) const'
     Return the frequency of the *word * argument in the *noccurrence *
     argument.  Return OK on success, NOTOK otherwise.

`int Normalize(String& word) const'
     Short hand for words->GetContext()->GetType()->Normalize(word).
     Return OK on success, NOTOK otherwise.

`int Ref(const String& word)'
     Short hand for Incr(word, 1)

`int Incr(const String& word, unsigned int incr)'
     Add *incr * to the frequency of the *word *.  Return OK on
     success, NOTOK otherwise.

`int Unref(const String& word)'
     Short hand for Decr(word, 1)

`int Decr(const String& word, unsigned int decr)'
     Subtract *decr * to the frequency of the *word *. If the frequency
     becomes lower or equal to zero, remove the entry from the
     dictionnary and lose the association between the word and its
     serial number.  Return OK on success, NOTOK otherwise.

`int Put(const String& word, unsigned int noccurrence)'
     Set the frequency of *word * with the value of the *noccurrence *
     argument.

`int Exists(const String& word) const'
     Return true if *word * exists in the dictionnary, false otherwise.

`List* Words() const'
     Return a pointer to the associated WordList object.

`WordDictCursor* Cursor() const'
     Return a cursor to sequentially walk the dictionnary using the
     *Next * method.

`int Next(WordDictCursor* cursor, String& word, WordDictRecord& record)'
     Return the next entry in the dictionnary. The *cursor * argument
     must have been created using the `Cursor ' method. The word is
     returned in the *word * argument and the record is returned in the
     *record * argument.  On success the function returns 0, at the end
     of the dictionnary it returns DB_NOTFOUND. The *cursor * argument
     is deallocated when the function hits the end of the dictionnary
     or an error occurs.

`WordDictCursor* CursorPrefix(const String& prefix) const'
     Return a cursor to sequentially walk the entries of the dictionnary
     that start with the *prefix * argument, using the *NextPrefix *
     method.

`int NextPrefix(WordDictCursor* cursor, String& word, WordDictRecord& record)'
     Return the next prefix from the dictionnary. The *cursor * argument
     must have been created using the `CursorPrefix ' method. The word
     is returned in the *word * argument and the record is returned in
     the *record * argument. The *word * is guaranteed to start with
     the prefix specified to the *CursorPrefix * method.  On success
     the function returns 0, at the end of the dictionnary it returns
     DB_NOTFOUND. The *cursor * argument is deallocated when the
     function hits the end of the dictionnary or an error occurs.

`int Write(FILE* f)'
     Dump the complete dictionary in the file descriptor *f.  * The
     format of the dictionary is `word serial frequency ', one by line.

10.11 WordListOne
=================

10.11.1 WordListOne NAME
------------------------

manage and use an inverted index file.

10.11.2 WordListOne SYNOPSIS
----------------------------


     #include <mifluz.h>

     WordContext context;

     WordList* words = context->List();
     WordList* words = WordListOne(context)

10.11.3 WordListOne DESCRIPTION
-------------------------------

WordList is the `mifluz ' equivalent of a database handler. Each
WordList object is bound to an inverted index file and implements the
operations to create it, fill it with word occurrences and search for
an entry matching a given criterion.

   The general behavious of WordListOne is described in the WordList
manual page. It is prefered to create a WordListOne instance by setting
the `wordlist_multi ' configuration parameter to false and calling the
*WordContext::List * method.

   Only the methods that differ from WordList are listed here.  All the
methods of WordList are implemented by WordListOne and you should refer
to the manual page for more information.

   The *Cursor * methods all return a WordCursorOne instance cast to a
WordCursor object.

10.11.4 WordListOne METHODS
---------------------------

`WordListOne(WordContext* ncontext)'
     Constructor. Build inverted index handling object using run time
     configuration parameters listed in the *CONFIGURATION * section of
     the *WordList * manual page.

`int DeleteCursor(WordDBCursor& cursor)'
     Delete the inverted index entry currently pointed to by the
     *cursor.  * Returns 0 on success, Berkeley DB error code on error.
     This is mainly useful when implementing a callback function for a
     *WordCursor.  *

10.12 WordKey
=============

10.12.1 WordKey NAME
--------------------

inverted index key.

10.12.2 WordKey SYNOPSIS
------------------------


     #include <WordKey.h>

     #define WORD_KEY_DOCID    1
     #define WORD_KEY_LOCATION 2

     WordList* words = ...;
     WordKey key = words->Key("word 100 20");
     WordKey searchKey;
     words->Dict()->SerialExists("dog", searchKey.Get(WORD_KEY_WORD));
     searchKey.Set(WORD_KEY_LOCATION, 5);
     WordCursor* cursor = words->Key(searchKey);

10.12.3 WordKey DESCRIPTION
---------------------------

Describes the key used to store a entry in the inverted index.  Each
field in the key has a bit in the *set * member that says if it is set
or not. This bit allows to say that a particular field is `undefined '
regardless of the actual value stored. The methods *IsDefined,
SetDefined * and *Undefined * are used to manipulate the `defined '
status of a field. The *Pack * and *Unpack * methods are used to
convert to and from the disk storage representation of the key.

   Although constructors may be used, the prefered way to create a
WordKey object is by using the *WordContext::Key * method.

   The following constants are defined:
`WORD_KEY_WORD'
     the index of the word identifier with the key for Set and Get
     methods.

`WORD_KEY_VALUE_INVALID'
     a value that is invalid for any field of the key.

10.12.4 WordKey ASCII FORMAT
----------------------------

The ASCII description is a string with fields separated by tabs or
white space.
     Example: 200 <UNDEF> 1 4 2
     Field 1: The word identifier or <UNDEF> if not defined
     Field 2 to the end: numerical value of the field or <UNDEF> if
                         not defined

10.12.5 WordKey METHODS
-----------------------

`WordKey(WordContext* ncontext)'
     Constructor. Build an empty key.  The *ncontext * argument must be
     a pointer to a valid WordContext object.

`WordKey(WordContext* ncontext, const String& desc)'
     Constructor. Initialize from an ASCII description of a key.  See
     `ASCII FORMAT ' section.  The *ncontext * argument must be a
     pointer to a valid WordContext object.

`void Clear()'
     Reset to empty key.

`inline int NFields() const'
     Convenience functions to access the total number of fields in a
     key (see `WordKeyInfo(3) ').

`inline WordKeyNum MaxValue(int position)'
     Convenience functions to access the maximum possible value for
     field at *position.  * in a key (see `WordKeyInfo(3) ').

`inline WordContext* GetContext()'
     Return a pointer to the WordContext object used to create this
     instance.

`inline const WordContext* GetContext() const'
     Return a pointer to the WordContext object used to create this
     instance as a const.

`inline WordKeyNum Get(int position) const'
     Return value of numerical field at *position * as const.

`inline WordKeyNum& Get(int position)'
     Return value of numerical field at *position.  *

`inline const WordKeyNum & operator[] (int position) const'
     Return value of numerical field at *position * as const.

`inline WordKeyNum & operator[] (int position)'
     Return value of numerical field at *position.  *

`inline void Set(int position, WordKeyNum val)'
     Set value of numerical field at *position * to *val.  *

`int IsDefined(int position) const'
     Returns true if field at *position * is `defined ', false
     otherwise.

`void SetDefined(int position)'
     Value in field *position * becomes `defined.  ' A bit is set in
     the bit field describing the defined/undefined state of the value
     and the actual value of the field is not modified.

`void Undefined(int position)'
     Value in field *position * becomes `undefined.  ' A bit is set in
     the bit field describing the defined/undefined state of the value
     and the actual value of the field is not modified.

`int Set(const String& bufferin)'
     Set the whole structure from ASCII string in *bufferin.  * See
     `ASCII FORMAT ' section.  Return OK if successfull, NOTOK
     otherwise.

`int Get(String& bufferout) const'
     Convert the whole structure to an ASCII string description in
     *bufferout.  * See `ASCII FORMAT ' section.  Return OK if
     successfull, NOTOK otherwise.

`String Get() const'
     Convert the whole structure to an ASCII string description and
     return it.  See `ASCII FORMAT ' section.

`int Unpack(const char* string, int length)'
     Set structure from disk storage format as found in *string *
     buffer or length *length.  * Return OK if successfull, NOTOK
     otherwise.

`inline int Unpack(const String& data)'
     Set structure from disk storage format as found in *data * string.
     Return OK if successfull, NOTOK otherwise.

`int Pack(String& data) const'
     Convert object into disk storage format as found in and place the
     result in *data * string.  Return OK if successfull, NOTOK
     otherwise.

`int Merge(const WordKey& other)'
     Copy each `defined ' field from other into the object, if the
     corresponding field of the object is not defined.  Return OK if
     successfull, NOTOK otherwise.

`int PrefixOnly()'
     Undefine all fields found after the first undefined field. The
     resulting key has a set of defined fields followed by undefined
     fields.  Returns NOTOK if the word is not defined because the
     resulting key would be empty and this is considered an error.
     Returns OK on success.

`int SetToFollowing(int position = WORD_FOLLOWING_MAX)'
     Implement ++ on a key.

     It behaves like arithmetic but follows these rules:
          . Increment starts at field <position>
          . If a field value overflows, increment field
          *position
          * - 1
          . Undefined fields are ignored and their value untouched
          . When a field is incremented all fields to the left are set to 0
     If position is not specified it is equivalent to NFields() - 1.
     It returns OK if successfull, NOTOK if *position * out of range or
     WORD_FOLLOWING_ATEND if the maximum possible value was reached.

`int Filled() const'
     Return true if all the fields are `defined ', false otherwise.

`int Empty() const'
     Return true if no fields are `defined ', false otherwise.

`int Equal(const WordKey& other) const'
     Return true if the object and *other * are equal.  Only fields
     defined in both keys are compared.

`int ExactEqual(const WordKey& other) const'
     Return true if the object and *other * are equal.  All fields are
     compared. If a field is defined in *object * and not defined in
     the object, the key are not considered equal.

`int Cmp(const WordKey& other) const'
     Compare *object * and *other * as in strcmp. Undefined fields are
     ignored. Returns a positive number if *object * is greater than
     *other *, zero if they are equal, a negative number if *object *
     is lower than *other.  *

`int PackEqual(const WordKey& other) const'
     Return true if the object and *other * are equal.  The packed
     string are compared. An `undefined ' numerical field will be 0 and
     therefore undistinguishable from a `defined ' field whose value is
     0.

`int Outbound(int position, int increment)'
     Return true if adding *increment * in field at *position * makes
     it overflow or underflow, false if it fits.

`int Overflow(int position, int increment)'
     Return true if adding positive *increment * to field at *position
     * makes it overflow, false if it fits.

`int Underflow(int position, int increment)'
     Return true if subtracting positive *increment * to field at
     *position * makes it underflow, false if it fits.

`int Prefix() const'
     Return OK if the key may be used as a prefix for search.  In other
     words return OK if the fields set in the key are all contiguous,
     starting from the first field.  Otherwise returns NOTOK

`static int Compare(WordContext* context, const String& a, const String& b)'
     Compare *a * and *b * in the Berkeley DB fashion.  *a * and *b *
     are packed keys. The semantics of the returned int is as of strcmp
     and is driven by the key description found in `WordKeyInfo.  '
     Returns a positive number if *a * is greater than *b *, zero if
     they are equal, a negative number if *a * is lower than *b.  *

`static int Compare(WordContext* context, const unsigned char *a, int a_length, const unsigned char *b, int b_length)'
     Compare *a * and *b * in the Berkeley DB fashion.  *a * and *b *
     are packed keys. The semantics of the returned int is as of strcmp
     and is driven by the key description found in `WordKeyInfo.  '
     Returns a positive number if *a * is greater than *b *, zero if
     they are equal, a negative number if *a * is lower than *b.  *

`int Diff(const WordKey& other, int& position, int& lower)'
     Compare object defined fields with *other * key defined fields
     only, ignore fields that are not defined in object or *other.  *
     Return 1 if different 0 if equal.  If different, *position * is
     set to the field number that differ, *lower * is set to 1 if Get(
     *position *) is lower than other.Get( *position *) otherwise lower
     is set to 0.

`int Write(FILE* f) const'
     Print object in ASCII form on *f * (uses `Get ' method).  See
     `ASCII FORMAT ' section.

`void Print() const'
     Print object in ASCII form on *stdout * (uses `Get ' method).  See
     `ASCII FORMAT ' section.

10.13 WordKeyInfo
=================

10.13.1 WordKeyInfo NAME
------------------------

information on the key structure of the inverted index.

10.13.2 WordKeyInfo SYNOPSIS
----------------------------


     Helper for the WordKey class.

10.13.3 WordKeyInfo DESCRIPTION
-------------------------------

Describe the structure of the index key ( `WordKey ').  The description
includes the layout of the packed version stored on disk.

10.13.4 WordKeyInfo CONFIGURATION
---------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_wordkey_description <desc> (no default)'
     Describe the structure of the inverted index key.  In the
     following explanation of the `<desc> ' format, mandatory words are
     in bold and values that must be replaced in italic.

     *Word * `bits/name bits '[/...]

     The `name ' is an alphanumerical symbolic name for the key field.
     The `bits ' is the number of bits required to store this field.
     Note that all values are stored in unsigned integers (unsigned
     int).  Example:
          Word 8/Document 16/Location 8

10.14 WordType
==============

10.14.1 WordType NAME
---------------------

defines a word in term of allowed characters, length etc.

10.14.2 WordType SYNOPSIS
-------------------------


     Only called thru WordContext::Initialize()

10.14.3 WordType DESCRIPTION
----------------------------

WordType defines an indexed word and operations to validate a word to
be indexed. All words inserted into the `mifluz ' index are *Normalize
*d before insertion. The configuration options give some control over
the definition of a word.

10.14.4 WordType CONFIGURATION
------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_locale <locale> (default C)'
     Set the locale of the program to *locale *. See setlocale(3) for
     more information.

`wordlist_allow_numbers {true|false} <number> (default false)'
     A digit is considered a valid character within a word if this
     configuration parameter is set to `true ' otherwise it is an error
     to insert a word containing digits.  See the *Normalize * method
     for more information.

`wordlist_mimimun_word_length <number> (default 3)'
     The minimum length of a word.  See the *Normalize * method for
     more information.

`wordlist_maximum_word_length <number> (default 25)'
     The maximum length of a word.  See the *Normalize * method for
     more information.

`wordlist_allow_numbers {true|false} <number> (default false)'
     A digit is considered a valid character within a word if this
     configuration parameter is set to `true ' otherwise it is an error
     to insert a word containing digits.  See the *Normalize * method
     for more information.

`wordlist_truncate {true|false} <number> (default true)'
     If a word is too long according to the
     `wordlist_maximum_word_length ' it is truncated if this
     configuration parameter is `true ' otherwise it is considered an
     invalid word.

`wordlist_lowercase {true|false} <number> (default true)'
     If a word contains upper case letters it is converted to lowercase
     if this configuration parameter is true, otherwise it is left
     untouched.

`wordlist_valid_punctuation [characters] (default none)'
     A list of punctuation characters that may appear in a word.  These
     characters will be removed from the word before insertion in the
     index.

10.14.5 WordType METHODS
------------------------

`int Normalize(String &s) const'
     Normalize a word according to configuration specifications and
     builtin transformations.  *Every * word inserted in the inverted
     index goes thru this function. If a word is rejected (return value
     has WORD_NORMALIZE_NOTOK bit set) it will not be inserted in the
     index. If a word is accepted (return value has WORD_NORMALIZE_OK
     bit set) it will be inserted in the index. In addition to these
     two bits, informational values are stored that give information on
     the processing done on the word.  The bit field values and their
     meanings are as follows:
    `WORD_NORMALIZE_TOOLONG'
          the word length exceeds the value of     the
          `wordlist_maximum_word_length ' configuration parameter.

    `WORD_NORMALIZE_TOOSHORT'
          the word length is smaller than the value of     the
          `wordlist_minimum_word_length ' configuration parameter.

    `WORD_NORMALIZE_CAPITAL'
          the word contained capital letters and has been converted
          to lowercase. This bit is only set     if the
          `wordlist_lowercase ' configuration parameter     is true.

    `WORD_NORMALIZE_NUMBER'
          the word contains digits and the configuration     parameter
          `wordlist_allow_numbers ' is set to false.

    `WORD_NORMALIZE_CONTROL'
          the word contains control characters.

    `WORD_NORMALIZE_BAD'
          the word is listed in the file pointed by     the
          `wordlist_bad_word_list ' configuration parameter.

    `WORD_NORMALIZE_NULL'
          the word is a zero length string.

    `WORD_NORMALIZE_PUNCTUATION'
          at least one character listed in     the
          `wordlist_valid_punctuation ' attribute was removed     from
          the word.

    `WORD_NORMALIZE_NOALPHA'
          the word does not contain any alphanumerical character.

`static String NormalizeStatus(int flags)'
     Returns a string explaining the return flags of the Normalize
     method.

10.15 WordDBInfo
================

10.15.1 WordDBInfo NAME
-----------------------

inverted index usage environment.

10.15.2 WordDBInfo SYNOPSIS
---------------------------


     Only called thru WordContext::Initialize()

10.15.3 WordDBInfo DESCRIPTION
------------------------------

The inverted indexes may be shared among processes/threads and provide
the appropriate locking to prevent mistakes. In addition the memory
cache used by `WordList ' objects may be shared by processes/threads,
greatly reducing the memory needs in multi-process applications.  For
more information about the shared environment, check the Berkeley DB
documentation.

10.15.4 WordDBInfo CONFIGURATION
--------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_env_skip {true,false} (default false)'
     If true no environment is created at all. This must never be used
     if a `WordList ' object is created. It may be useful if only
     `WordKey ' objects are used, for instance.

`wordlist_env_share {true,false} (default false)'
     If true a sharable environment is open or created if none exist.

`wordlist_env_dir <directory> (default .)'
     Only valid if `wordlist_env_share ' set to `true.  ' Specify the
     directory in which the sharable environment will be created. All
     inverted indexes specified with a non-absolute pathname will be
     created relative to this directory.

10.16 WordRecordInfo
====================

10.16.1 WordRecordInfo NAME
---------------------------

information on the record structure of the inverted index.

10.16.2 WordRecordInfo SYNOPSIS
-------------------------------


     Only called thru WordContext::Initialize()

10.16.3 WordRecordInfo DESCRIPTION
----------------------------------

The structure of a record is very limited. It can contain a single
integer value or a string.

10.16.4 WordRecordInfo CONFIGURATION
------------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_wordrecord_description {NONE|DATA|STR} (no default)'
     NONE: the record is empty

     DATA: the record contains an integer (unsigned int)

     STR: the record contains a string (String)

10.17 WordRecord
================

10.17.1 WordRecord NAME
-----------------------

inverted index record.

10.17.2 WordRecord SYNOPSIS
---------------------------


     #include <WordRecord.h>

     WordContext* context;
     WordRecord* record = context->Record();
     if(record->DefaultType() == WORD_RECORD_DATA) {
       record->info.data = 120;
     } else if(record->DefaultType() == WORD_RECORD_STR) {
       record->info.str = "foobar";
     }
     delete record;

10.17.3 WordRecord DESCRIPTION
------------------------------

The record can contain an integer, if the default record type (see
CONFIGURATION in `WordKeyInfo ') is set to `DATA ' or a string if set to
`STR.  ' If the type is set to `NONE ' the record does not contain any
usable information.

   Although constructors may be used, the prefered way to create a
WordRecord object is by using the *WordContext::Record * method.

10.17.4 WordRecord ASCII FORMAT
-------------------------------

If default type is `DATA ' it is the decimal representation of an
integer. If default type is `NONE ' it is the empty string.

10.17.5 WordRecord METHODS
--------------------------

`inline WordRecord(WordContext* ncontext)'
     Constructor. Build an empty record.  The *ncontext * argument must
     be a pointer to a valid WordContext object.

`inline void Clear()'
     Reset to empty and set the type to the default specified in the
     configuration.

`inline int DefaultType()'
     Return the default type WORD_RECORD_{DATA,STR,NONE}

`inline int Pack(String& packed) const'
     Convert the object to a representation for disk storage written in
     the *packed * string.  Return OK on success, NOTOK otherwise.

`inline int Unpack(const char* string, int length)'
     Alias for Unpack(String(string, length))

`inline int Unpack(const String& packed)'
     Read the object from a representation for disk storage contained
     in the *packed * argument.  Return OK on success, NOTOK otherwise.

`int Set(const String& bufferin)'
     Set the whole structure from ASCII string description stored in the
     *bufferin * argument.  Return OK on success, NOTOK otherwise.

`int Get(String& bufferout) const'
     Convert the whole structure to an ASCII string description and
     return it in the *bufferout * argument.  Return OK on success,
     NOTOK otherwise.

`String Get() const'
     Convert the whole structure to an ASCII string description and
     return it.

`inline WordContext* GetContext()'
     Return a pointer to the WordContext object used to create this
     instance.

`inline const WordContext* GetContext() const'
     Return a pointer to the WordContext object used to create this
     instance as a const.

`int Write(FILE* f) const'
     Print object in ASCII form on descriptor *f * using the Get method.

10.18 WordReference
===================

10.18.1 WordReference NAME
--------------------------

inverted index occurrence.

10.18.2 WordReference SYNOPSIS
------------------------------


     #include <WordReference.h>

     WordContext* context;
     WordReference* word = context->Word("word");
     WordReference* word = context->Word();
     WordReference* word = context->Word(WordKey("key 1 2"), WordRecord());

     WordKey key = word->Key()
     WordKey record = word->Record()

     word->Clear();

     delete word;

10.18.3 WordReference DESCRIPTION
---------------------------------

A `WordReference ' object is an agregate of a `WordKey ' object and a
`WordRecord ' object.

   Although constructors may be used, the prefered way to create a
WordReference object is by using the *WordContext::Word * method.

10.18.4 WordReference ASCII FORMAT
----------------------------------

The ASCII description is a string with fields separated by tabs or
white space. It is made of the ASCII description of a `WordKey ' object
immediately followed by the ASCII description of a `WordRecord '
object.  See the corresponding manual pages for more information.

10.18.5 WordReference METHODS
-----------------------------

`WordReference(WordContext* ncontext) :'
     Constructor. Build an object with empty key and empty record.  The
     *ncontext * argument must be a pointer to a valid WordContext
     object.

`WordReference(WordContext* ncontext, const String& key0, const String& record0) :'
     Constructor. Build an object from disk representation of *key * and
     *record *.  The *ncontext * argument must be a pointer to a valid
     WordContext object.

`WordReference(WordContext* ncontext, const String& word) :'
     Constructor. Build an object with key word set to *word * and
     otherwise empty and empty record.  The *ncontext * argument must
     be a pointer to a valid WordContext object.

`void Clear()'
     Reset to empty key and record

`inline WordContext* GetContext()'
     Return a pointer to the WordContext object used to create this
     instance.

`inline const WordContext* GetContext() const'
     Return a pointer to the WordContext object used to create this
     instance as a const.

`inline String& GetWord()'
     Return the *word * data member.

`inline const String& GetWord() const'
     Return the *word * data member as a const.

`inline void SetWord(const String& nword)'
     Set the *word * data member from the *nword * argument.

`WordKey& Key()'
     Return the key object.

`const WordKey& Key() const'
     Return the key object as const.

`WordRecord& Record()'
     Return the record object.

`const WordRecord& Record() const'
     Return the record object as const.

`void Key(const WordKey& arg)'
     Copy *arg * in the key part of the object.

`int KeyUnpack(const String& packed)'
     Set key structure from disk storage format as found in *packed *
     string.  Return OK if successfull, NOTOK otherwise.

`String KeyPack() const'
     Convert key object into disk storage format as found in return the
     resulting string.

`int KeyPack(String& packed) const'
     Convert key object into disk storage format as found in and place
     the result in *packed * string.  Return OK if successfull, NOTOK
     otherwise.

`void Record(const WordRecord& arg)'
     Copy *arg * in the record part of the object.

`int RecordUnpack(const String& packed)'
     Set record structure from disk storage format as found in *packed
     * string.  Return OK if successfull, NOTOK otherwise.

`String RecordPack() const'
     Convert record object into disk storage format as found in return
     the resulting string.

`int RecordPack(String& packed) const'
     Convert record object into disk storage format as found in and
     place the result in *packed * string.  Return OK if successfull,
     NOTOK otherwise.

`inline int Pack(String& ckey, String& crecord) const'
     Short hand for KeyPack( *ckey *) RecordPack( *crecord *).

`int Unpack(const String& ckey, const String& crecord)'
     Short hand for KeyUnpack( *ckey *) RecordUnpack( *crecord *).

`int Merge(const WordReference& other)'
     Merge key with other.Key() using the `WordKey::Merge ' method:
     key.Merge(other.Key()).  See the corresponding manual page for
     details. Copy other.record into the record part of the object.

`static WordReference Merge(const WordReference& master, const WordReference& slave)'
     Copy *master * before merging with *master.  *Merge( *slave *) and
     return the copy. Prevents alteration of *master *.

`int Set(const String& bufferin)'
     Set the whole structure from ASCII string in *bufferin *.  See
     `ASCII FORMAT ' section.  Return OK if successfull, NOTOK
     otherwise.

`int Get(String& bufferout) const'
     Convert the whole structure to an ASCII string description in
     *bufferout.  * See `ASCII FORMAT ' section.  Return OK if
     successfull, NOTOK otherwise.

`String Get() const'
     Convert the whole structure to an ASCII string description and
     return it.  See `ASCII FORMAT ' section.

`int Write(FILE* f) const'
     Print object in ASCII form on *f * (uses `Get ' method).  See
     `ASCII FORMAT ' section.

`void Print() const'
     Print object in ASCII form on *stdout * (uses `Get ' method).  See
     `ASCII FORMAT ' section.

10.19 WordCursor
================

10.19.1 WordCursor NAME
-----------------------

abstract class to search and retrieve entries in a WordList object.

10.19.2 WordCursor SYNOPSIS
---------------------------


     #include <WordList.h>

     int callback(WordList *, WordDBCursor& , const WordReference *, Object &)
     {
        ...
     }

     Object* data = ...

     WordList *words = ...;

     WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), HTDIG_WORDLIST_COLLECTOR);

     if(search->Walk() == NOTOK) bark;
     List* results = search->GetResults();

     WordCursor *search = words->Cursor(callback, data);
     WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"));
     WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), callback, data);
     WordCursor *search = words->Cursor(WordKey());

     search->WalkInit();
     if(search->WalkNext() == OK)
       dosomething(search->GetFound());
     search->WalkFinish();

10.19.3 WordCursor DESCRIPTION
------------------------------

WordCursor is an iterator on an inverted index. It is created by asking
a `WordList ' object with the `Cursor.  ' There is no other way to
create a WordCursor object.  When the `Walk* ' methods return, the
WordCursor object contains the result of the search and status
information that indicates if it reached the end of the list (IsAtEnd()
method).

   The *callback * function that is called each time a match is found
takes the following arguments:
     WordList* words pointer to the inverted index handle.
     WordDBCursor& cursor to call Del() and delete the current match
     WordReference* wordRef is the match
     Object& data is the user data provided by the caller when
                  search began.

   The `WordKey ' object that specifies the search criterion may be
used as follows (assuming word is followed by DOCID and LOCATION):

   Ex1: *WordKey() * walk the entire list of occurences.

   Ex2: *WordKey("word <UNDEF> <UNDEF>") * find all occurrences of `word
'.

   Ex3: *WordKey("meet <UNDEF> 1") * find all occurrences of `meet '
that occur at LOCATION 1 in any DOCID. This can be inefficient since
the search has to scan all occurrences of `meet ' to find the ones that
occur at LOCATION 1.

   Ex4: *WordKey("meet 2 <UNDEF>") * find all occurrences of `meet '
that occur in DOCID 2, at any location.

   WordList is an abstract class and cannot be instanciated.  See the
WordCursorOne manual page for an actual implementation of a WordCursor
object.

10.19.4 WordCursor METHODS
--------------------------

`virtual void Clear() = 0'
     Clear all data in object, set *GetResult() * data to NULL but do
     not delete it (the application is responsible for that).

`virtual inline int IsA() const'
     Returns the type of the object. May be overloaded by derived
     classes to differentiate them at runtime.  Returns WORD_CURSOR.

`virtual inline int Optimize()'
     Optimize the cursor before starting a Walk.  Returns OK on
     success, NOTOK otherwise.

`virtual int ContextSave(String& buffer) const = 0'
     Save in *buffer * all the information necessary to resume the walk
     at the point it left. The ASCII representation of the last key
     found (GetFound()) is written in *buffer * using the WordKey::Get
     method.

`virtual int ContextRestore(const String& buffer) = 0'
     Restore from buffer all the information necessary to resume the
     walk at the point it left. The *buffer * is expected to contain an
     ASCII representation of a WordKey (see WordKey::Set method). A
     *Seek * is done on the key and the object is prepared to jump to
     the next occurrence when *WalkNext * is called (the
     cursor_get_flags is set to `DB_NEXT.  '

`virtual int Walk() = 0'
     Walk and collect data from the index.  Returns OK on success,
     NOTOK otherwise.

`virtual int WalkInit() = 0'
     Must be called before other Walk methods are used.  Fill internal
     state according to input parameters and move before the first
     matching entry.  Returns OK on success, NOTOK otherwise.

`virtual int WalkRewind() = 0'
     Move before the first index matching entry.  Returns OK on
     success, NOTOK otherwise.

`virtual int WalkNext() = 0'
     Move to the next matching entry.  At end of list, WORD_WALK_ATEND
     is returned.  Returns OK on success, NOTOK otherwise. When OK is
     returned, the GetFound() method returns the matched entry.  When
     WORD_WALK_ATEND is returned, the GetFound() method returns an
     empty object if the end of the index was reached or the match that
     was found and that is greated than the specified search criterion.

`virtual int WalkNextStep() = 0'
     Advance the cursor one step. The entry pointed to by the cursor may
     or may not match the requirements.  Returns OK if entry pointed by
     cursor matches requirements.  Returns NOTOK on failure. Returns
     WORD_WALK_NOMATCH_FAILED if the current entry does not match
     requirements, it's safe to call WalkNextStep again until either OK
     or NOTOK is returned.

`virtual int WalkNextExclude(const WordKey& key)'
     Return 0 if this key must not be returned by WalkNext as a valid
     match. The WalkNextStep method calls this virtual method
     immediately after jumping to the next entry in the database. This
     may be used, for instance, to skip entries that were selected by a
     previous search.

`virtual int WalkFinish() = 0'
     Terminate Walk, free allocated resources.  Returns OK on success,
     NOTOK otherwise.

`virtual int Seek(const WordKey& patch) = 0'
     Move before the inverted index position specified in *patch.  *
     May only be called after a successfull call to the `WalkNext ' or
     `WalkNextStep 'method.  Copy defined fields from *patch * into a
     copy of the `found ' data member and initialize internal state so
     that `WalkNext ' jumps to this key next time it's called
     (cursor_get_flag set to DB_SET_RANGE).  Returns OK if successfull,
     NOTOK otherwise.

`virtual inline int IsAtEnd() const'
     Returns true if cursor is positioned after the last possible
     match, false otherwise.

`virtual inline int IsNoMatch() const'
     Returns true if cursor hit a value that does not match search
     criterion.

`inline WordKey& GetSearch()'
     Returns the search criterion.

`inline int GetAction() const'
     Returns the type of action when a matching entry is found.

`inline List *GetResults()'
     Returns the list of WordReference found. The application is
     responsible for deallocation of the list. If the *action * input
     flag bit HTDIG_WORDLIST_COLLECTOR is not set, return a NULL
     pointer.

`inline List *GetTraces()'
     For debugging purposes. Returns the list of WordReference hit
     during the search process. Some of them match the searched key,
     some don't.  The application is responsible for deallocation of
     the list.

`inline void SetTraces(List* traceRes_arg)'
     For debugging purposes. Set the list of WordReference hit during
     the search process.

`inline const WordReference& GetFound()'
     Returns the last entry hit by the search. Only contains a valid
     value if the last `WalkNext ' or `WalkNextStep ' call was
     successfull (i.e. returned OK).

`inline int GetStatus() const'
     Returns the status of the cursor which may be OK or
     WORD_WALK_ATEND.

`virtual int Get(String& bufferout) const = 0'
     Convert the whole structure to an ASCII string description.
     Returns OK if successfull, NOTOK otherwise.

`inline String Get() const'
     Convert the whole structure to an ASCII string description and
     return it.

`virtual int Initialize(WordList *nwords, const WordKey &nsearchKey, wordlist_walk_callback_t ncallback, Object * ncallback_data, int naction) = 0'
     Protected method. Derived classes should use this function to
     initialize the object if they do not call a WordCursor constructor
     in their own constructutor. Initialization may occur after the
     object is created and must occur before a *Walk* * method is
     called. See the DESCRIPTION section for the semantics of the
     arguments.  Return OK on success, NOTOK on error.

`WordKey searchKey'
     Input data. The key to be searched, see DESCRIPTION for more
     information.

`WordReference found'
     Output data. Last match found. Use GetFound() to retrieve it.

`int status'
     Output data. WORD_WALK_ATEND if cursor is past last match, OK
     otherwise. Use GetStatus() to retrieve it.

`WordList *words'
     The inverted index used by this cursor.

10.20 WordCursorOne
===================

10.20.1 WordCursorOne NAME
--------------------------

search and retrieve entries in a WordListOne object.

10.20.2 WordCursorOne SYNOPSIS
------------------------------


     #include <WordList.h>

     int callback(WordList *, WordDBCursor& , const WordReference *, Object &)
     {
        ...
     }

     Object* data = ...

     WordList *words = ...;

     WordCursor *search = words->Cursor(callback, data);
     WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"));
     WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), callback, data);
     WordCursor *search = words->Cursor(WordKey());

     ...

     if(search->Walk() == NOTOK) bark;
     List* results = search->GetResults();

     search->WalkInit();
     if(search->WalkNext() == OK)
       dosomething(search->GetFound());
     search->WalkFinish();

10.20.3 WordCursorOne DESCRIPTION
---------------------------------

WordCursorOne is a WordCursor derived class that implements search in a
WordListOne object. It currently is the only derived class of the
WordCursor object. Most of its behaviour is described in the WordCursor
manual page, only the behaviour specific to WordCursorOne is documented
here.

10.20.4 WordCursorOne METHODS
-----------------------------

`WordCursorOne(WordList *words)'
     Private constructor. Creator of the object must then call
     Initialize() prior to using any other methods.

`WordCursorOne(WordList *words, wordlist_walk_callback_t callback, Object * callback_data)'
     Private constructor. See WordList::Cursor method with same
     prototype for description.

`WordCursorOne(WordList *words, const WordKey &searchKey, int action = HTDIG_WORDLIST_WALKER)'
     Private constructor. See WordList::Cursor method with same
     prototype for description.

`WordCursorOne(WordList *words, const WordKey &searchKey, wordlist_walk_callback_t callback, Object * callback_data)'
     Private constructor. See WordList::Cursor method with same
     prototype for description.

10.21 WordMonitor
=================

10.21.1 WordMonitor NAME
------------------------

monitoring classes activity.

10.21.2 WordMonitor SYNOPSIS
----------------------------


     Only called thru WordContext::Initialize()

10.21.3 WordMonitor DESCRIPTION
-------------------------------

The test directory contains a `benchmark-report ' script used to
generate and archive graphs from the output of `WordMonitor '.

10.21.4 WordMonitor CONFIGURATION
---------------------------------

For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
`wordlist_monitor_period <sec> (default 0)'
     If the value *sec * is a positive integer, set a timer to print
     reports every *sec * seconds. The timer is set using the ALRM
     signal and will fail if the calling application already has a
     handler on that signal.

`wordlist_monitor_output <file>[,{rrd,readable] (default stderr)'
     Print reports on *file * instead of the default *stderr *.  If
     *type * is set to *rrd * the output is fit for the
     `benchmark-report ' script. Otherwise it a (hardly :-) readable
     string.

10.22 Configuration
===================

10.22.1 Configuration NAME
--------------------------

reads the configuration file and manages it in memory.

10.22.2 Configuration SYNOPSIS
------------------------------


     #include <Configuration.h>

     Configuration config;

     ConfigDefault config_defaults = {
       { "verbose", "true" },
       { 0, 0 }
     };

     config.Defaults(config_defaults);

     config.Read("/spare2/myconfig") ;

     config.Add("sync", "false");

     if(config["sync"]) ...
     if(config.Value("rate") < 50) ...
     if(config.Boolean("sync")) ...

10.22.3 Configuration DESCRIPTION
---------------------------------

The primary purpose of the *Configuration * class is to parse a
configuration file and allow the application to modify the internal
data structure produced. All values are strings and are converted by the
appropriate accessors. For instance the *Boolean * method will return
numerical true (not zero) if the string either contains a number that
is different from zero or the string `true '.

   The `ConfigDefaults ' type is a structure of two char pointers: the
name of the configuration attribute and it's value. The end of the
array is the first entry that contains a null pointer instead of the
attribute name. Numerical values must be in strings. For instance:
     ConfigDefault* config_defaults = {
       { "wordlist_compress", "true" },
       { "wordlist_page_size", "8192" },
       { 0, 0 }
     };
   The additional fields of the *ConfigDefault * are purely informative.

10.22.4 Configuration FILE FORMAT
---------------------------------

The configuration file is a plain ASCII text file. Each line in the
file is either a comment or an attribute.  Comment lines are blank
lines or lines that start with a '#'.  Attributes consist of a variable
name and an associated value:
     <name>:<whitespace><value><newline>

   The <name> contains any alphanumeric character or underline (_) The
<value> can include any character except newline. It also cannot start
with spaces or tabs since those are considered part of the whitespace
after the colon. It is important to keep in mind that any trailing
spaces or tabs will be included.

   It is possible to split the <value> across several lines of the
configuration file by ending each line with a backslash (\). The effect
on the value is that a space is added where the line split occurs.

   A configuration file can include another file, by using the special
<name>, `include '. The <value> is taken as the file name of another
configuration file to be read in at this point. If the given file name
is not fully qualified, it is taken relative to the directory in which
the current configuration file is found. Variable expansion is
permitted in the file name.  Multiple include statements, and nested
includes are also permitted.
     include: common.conf

10.22.5 Configuration METHODS
-----------------------------

`Configuration()'
     Constructor

`~Configuration()'
     Destructor

`void Add(const String& str)'
     Add configuration item *str * to the configuration. The value
     associated with it is undefined.

`void Add(const String& name, const String& value)'
     Add configuration item *name * to the configuration and associate
     it with *value *.

`int Remove(const String& name)'
     Remove the *name * from the configuration.

`void NameValueSeparators(const String& s)'
     Let the Configuration know how to parse name value pairs.  Each
     character of string *s * is a valid separator between the `name '
     and the `value.  '

`virtual int Read(const String& filename)'
     Read name/value configuration pairs from the file *filename *.

`const String Find(const String& name) const'
     Return the value of configuration attribute *name * as a `String '.

`const String operator[](const String& name) const'
     Alias to the *Find * method.

`int Value(const String& name, int default_value = 0) const'
     Return the value associated with the configuration attribute *name
     *, converted to integer using the atoi(3) function.  If the
     attribute is not found in the configuration and a *default_value *
     is provided, return it.

`double Double(const String& name, double default_value = 0) const'
     Return the value associated with the configuration attribute *name
     *, converted to double using the atof(3) function.  If the
     attribute is not found in the configuration and a *default_value *
     is provided, return it.

`int Boolean(const String& name, int default_value = 0) const'
     Return 1 if the value associated to *name * is either *1, yes * or
     *true *.  Return 0 if the value associated to *name * is either
     *0, no * or *false *.

`void Defaults(const ConfigDefaults *array)'
     Load configuration attributes from the `name ' and `value '
     members of the *array * argument.

10.23 mifluz
============

10.23.1 mifluz NAME
-------------------

C++ library to use and manage inverted indexes

10.23.2 mifluz SYNOPSIS
-----------------------

     #include <mifluz.h>

     main()
     {
        Configuration* config = WordContext::Initialize();

        WordList* words = new WordList(*config);

        ...

        delete words;

        WordContext::Finish();
     }

10.23.3 mifluz DESCRIPTION
--------------------------

The purpose of `mifluz ' is to provide a C++ library to build and query
a full text inverted index. It is dynamically updatable, scalable (up to
1Tb indexes), uses a controlled amount of memory, shares index files
and memory cache among processes or threads and compresses index files
to 50% of the raw data. The structure of the index is configurable at
runtime and allows inclusion of relevance ranking information. The
query functions do not require loading all the occurrences of a
searched term.  They consume very few resources and many searches can
be run in parallel.

   The file management library used in mifluz is a modified Berkeley DB
(www.sleepycat.com) version 3.1.14.

10.23.4 mifluz CLASSES AND COMMANDS
-----------------------------------

`Configuration'
     reads the configuration file and manages it in memory.

`WordContext'
     read configuration and setup mifluz context.

`WordCursor'
     abstract class to search and retrieve entries in a WordList object.

`WordCursorOne'
     search and retrieve entries in a WordListOne object.

`WordDBInfo'
     inverted index usage environment.

`WordDict'
     manage and use an inverted index dictionary.

`WordKey'
     inverted index key.

`WordKeyInfo'
     information on the key structure of the inverted index.

`WordList'
     abstract class to manage and use an inverted index file.

`WordListOne'
     manage and use an inverted index file.

`WordMonitor'
     monitoring classes activity.

`WordRecord'
     inverted index record.

`WordRecordInfo'
     information on the record structure of the inverted index.

`WordReference'
     inverted index occurrence.

`WordType'
     defines a word in term of allowed characters, length etc.

`htdb_dump'
     dump the content of an inverted index in Berkeley DB fashion

`htdb_load'
     displays statistics for Berkeley DB environments.

`htdb_stat'
     displays statistics for Berkeley DB environments.

`mifluzdict'
     dump the dictionnary of an inverted index.

`mifluzdump'
     dump the content of an inverted index.

`mifluzload'
     load the content of an inverted index.

`mifluzsearch'
     search the content of an inverted index.

10.23.5 mifluz CONFIGURATION
----------------------------

The format of the configuration file read by WordContext::Initialize is:
     keyword: value
   Comments may be added on lines starting with a #. The default
configuration file is read from from the file pointed by the
*MIFLUZ_CONFIG * environment variable or *~/.mifluz * or
*/etc/mifluz.conf * in this order. If no configuration file is
available, builtin defaults are used.  Here is an example configuration
file:
     wordlist_extend: true
     wordlist_cache_size: 10485760
     wordlist_page_size: 32768
     wordlist_compress: 1
     wordlist_wordrecord_description: NONE
     wordlist_wordkey_description: Word/DocID 32/Flags 8/Location 16
     wordlist_monitor: true
     wordlist_monitor_period: 30
     wordlist_monitor_output: monitor.out,rrd

`wordlist_allow_numbers {true|false} <number> (default false)'
     A digit is considered a valid character within a word if this
     configuration parameter is set to `true ' otherwise it is an error
     to insert a word containing digits.  See the *Normalize * method
     for more information.

`wordlist_cache_inserts {true|false} (default false)'
     If true all *Insert * calls are cached in memory. When the
     WordList object is closed or a different access method is called
     the cached entries are flushed in the inverted index.

`wordlist_cache_max <bytes> (default 0)'
     Maximum size of the cumulated cache files generated when doing bulk
     insertion with the *BatchStart() * function. When this limit is
     reached, the cache files are all merged into the inverted index.
     The value 0 means infinite size allowed.  See WordList(3) for the
     rationale behind cache file handling.

`wordlist_cache_size <bytes> (default 500K)'
     Berkeley DB cache size (see Berkeley DB documentation) Cache makes
     a huge difference in performance. It must be at least 2% of the
     expected total data size. Note that if compression is activated
     the data size is eight times larger than the actual file size. In
     this case the cache must be scaled to 2% of the data size, not 2%
     of the file size. See *Cache tuning * in the mifluz guide for more
     hints.  See WordList(3) for the rationale behind cache file
     handling.

`wordlist_compress {true|false} (default false)'
     Activate compression of the index. The resulting index is eight
     times smaller than the uncompressed index.

`wordlist_env_dir <directory> (default .)'
     Only valid if `wordlist_env_share ' set to `true.  ' Specify the
     directory in which the sharable environment will be created. All
     inverted indexes specified with a non-absolute pathname will be
     created relative to this directory.

`wordlist_env_share {true,false} (default false)'
     If true a sharable environment is open or created if none exist.

`wordlist_env_skip {true,false} (default false)'
     If true no environment is created at all. This must never be used
     if a `WordList ' object is created. It may be useful if only
     `WordKey ' objects are used, for instance.

`wordlist_extend {true|false} (default false)'
     If *true * maintain reference count of unique words. The
     *Noccurrence * method gives access to this count.

`wordlist_locale <locale> (default C)'
     Set the locale of the program to *locale *. See setlocale(3) for
     more information.

`wordlist_lowercase {true|false} <number> (default true)'
     If a word contains upper case letters it is converted to lowercase
     if this configuration parameter is true, otherwise it is left
     untouched.

`wordlist_maximum_word_length <number> (default 25)'
     The maximum length of a word.  See the *Normalize * method for
     more information.

`wordlist_mimimun_word_length <number> (default 3)'
     The minimum length of a word.  See the *Normalize * method for
     more information.

`wordlist_monitor {true|false} (default false)'
     If true create a `WordMonitor ' instance to gather statistics and
     build reports.

`wordlist_monitor_output <file>[,{rrd,readable] (default stderr)'
     Print reports on *file * instead of the default *stderr *.  If
     *type * is set to *rrd * the output is fit for the
     `benchmark-report ' script. Otherwise it a (hardly :-) readable
     string.

`wordlist_monitor_period <sec> (default 0)'
     If the value *sec * is a positive integer, set a timer to print
     reports every *sec * seconds. The timer is set using the ALRM
     signal and will fail if the calling application already has a
     handler on that signal.

`wordlist_page_size <bytes> (default 8192)'
     Berkeley DB page size (see Berkeley DB documentation)

`wordlist_truncate {true|false} <number> (default true)'
     If a word is too long according to the
     `wordlist_maximum_word_length ' it is truncated if this
     configuration parameter is `true ' otherwise it is considered an
     invalid word.

`wordlist_valid_punctuation [characters] (default none)'
     A list of punctuation characters that may appear in a word.  These
     characters will be removed from the word before insertion in the
     index.

`wordlist_verbose <number> (default 0)'
     Set the verbosity level of the WordList class.

     1 walk logic

     2 walk logic details

     3 walk logic lots of details

`wordlist_wordkey_description <desc> (no default)'
     Describe the structure of the inverted index key.  In the
     following explanation of the `<desc> ' format, mandatory words are
     in bold and values that must be replaced in italic.

     *Word * `bits/name bits '[/...]

     The `name ' is an alphanumerical symbolic name for the key field.
     The `bits ' is the number of bits required to store this field.
     Note that all values are stored in unsigned integers (unsigned
     int).  Example:
          Word 8/Document 16/Location 8

`wordlist_wordkey_document [field ...] (default none)'
     A white space separated list of field numbers that define a
     document.  The field number list must not contain gaps. For
     instance 1 2 3 is valid but 1 3 4 is not valid.  This
     configuration parameter is not used by the mifluz library but may
     be used by a query application to define the semantic of a
     document. In response to a query, the application will return a
     list of results in which only distinct documents will be shown.

`wordlist_wordkey_location field (default none)'
     A single field number that contains the position of a word in a
     given document.  This configuration parameter is not used by the
     mifluz library but may be used by a query application.

`wordlist_wordrecord_description {NONE|DATA|STR} (no default)'
     NONE: the record is empty

     DATA: the record contains an integer (unsigned int)

     STR: the record contains a string (String)

10.23.6 mifluz ENVIRONMENT
--------------------------

*MIFLUZ_CONFIG * file name of configuration file read by
WordContext(3). Defaults to *~/.mifluz.  * or */usr/etc/mifluz.conf *

Index of Concepts
*****************

