Bogofilter Logo  v1.2.5 SourceForge Logo

Bogofilter FAQ

Bogofilter FAQ

Official Versions: In English or French or Italian or Bulgarian
Current maintainer since 2013: Matthias Andree <m-a@users.sf.net>
Previous maintainer of ten years: David Relson <relson@osagesoftware.com>

This document is intended to answer frequently asked questions about bogofilter.

Typographic conventions

  • If we show example commands that start with a dollar sign ($), this means that these commands should be executed by an unprivileged user, NOT the root user.
  • If we show example commands that start with a hash mark (#), this means that these commands need to be executed by the root user.

Frequently asked questions and their answers


What is bogofilter?

Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.

The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.

Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.

The NEWS file describes bogofilter's version history starting with version 1.0.0. Older news (before release 1.0.0) are in the NEWS.0 file.


Bogo-what?

Bogofilter is some kind of a bogometer or bogon filter, i.e., it tries to identify bogus mail by measuring the bogosity.


How does bogofilter work?

See the man page's THEORY OF OPERATION section for an introduction. The main source for understanding this is Gary Robinson's Linux Journal article "A Statistical Approach to the Spam Problem".

After you read all this you might ask some questions. The first could be "Is bogofilter really a Bayesian spam filter?" Bogofilter is based on Bayes' theorem and uses it in the initial calculations and other statistical methods later. Without doubt it is a statistical spam filter with a Bayesian flavor.

Other questions you might have might concern the basic assumptions of Bayes' theory. Two short answers are: "No, they are not satisfied" and "We don't care as long as it works". A longer answer will mention that the basic assumption that "an e-mail is a random collection of words, each independent of the others" is violated. There are several places where practice doesn't follow theory. Some are always present, and some which will depend on the way you use bogofilter:

  • Words in an e-mail are by no means independent. In all languages, the opposite is true.
  • The words used are not random, though some spammers include random words.
  • Full training or using a random sample, follows Bayes' principles. Choosing messages to use for training violates the assumption that the training messages are a random sample of the messages received. This principle is also violated by Bogofilter's auto-update function (with the thresh_update parameter), training on error, or anything similar to these approaches.
  • The same applies if you train with the same message more than once.
  • Other problems arise if you modify your database by removing tokens (like using bogoutil with -a or -c).
  • Undoubtedly there are more.

As the man page explains, bogofilter tries to understand how badly the null hypothesis fails. Some people argue that "those departures from reality usually work in our favor" (from Gary's article). Some argue that, even then, we should not violate too much. Nobody really knows. Just keep in mind that problems might occur if you push too hard. The key to bogofilter's approach is: What matters most is simply what works in the real world.

Now that you have been warned, have fun and use bogofilter as suits you best.


Mailing Lists

There are currently four mailing lists for bogofilter:

List Address Links Description
bogofilter-announce@bogofilter.org [subscribe] [archives: mailman] An announcement-only list where new versions are announced.
bogofilter@bogofilter.org [subscribe] [archives: mailman] A discussion list where any conversation about bogofilter may take place.
bogofilter-dev@bogofilter.org [subscribe] [archives: mailman] A list for sharing patches, development, and technical discussions.
bogofilter-cvs@lists.sourceforge.net [subscribe] [archive] Mailing list for announcing code changes to the SVN archive. (The CVS name is a leftover from before the migration for our users' convenience.)

The bogofilter-announce list is moderated and is used only for important announcements (eg: new versions). It is low traffic. If you have subscribed to the user's list or the developer's list, you don't need to subscribe to the announce list. Messages posted to the announce list are also distributed to the others.


How do I start my bogofilter training?

To classify messages as ham (non-spam) or spam, bogofilter needs to learn from your mail. To start with it is best to have collections (that are as large as possible) of messages you know for sure are ham or spam. (Errors here will cause problems later, so try hard;-). Warning: Only use your mail; using other collections (like a spam collection found on the web), might cause bogofilter to draw a wrong conclusion — after all you want it to understand your mail.

Once you have the spam and ham collections, you have basically four choices. In all cases it works better if your training base (the above collections) is bigger, rather than smaller. The smaller your training collection is, the higher the number of errors bogofilter will make in production. Let's assume your collection is two mbox files: ham.mbox and spam.mbox.

  • Method 1) Full training. Train bogofilter with all your messages. In our example:

        bogofilter -s < spam.mbox
        bogofilter -n < ham.mbox

Note: Bogofilter's contrib directory includes two scripts that both use a train-on-error technique. This technique scores each message and adds to the database only those messages that were scored incorrectly (messages scored as uncertain, ham scored as spam, or spam scored as ham). The goal is to build a database of those words needed to correctly classify messages. The resulting database is smaller than the one build using full training.

  • Method 2) Use the script bogominitrain.pl (in the contrib directory). It checks the messages in the same order as your mailbox files. You can use the -f option which will repeat this until all messages in your training collection are classified correctly (you can even adjust the level of certainty). Since the script makes sure the database understands your training collection "exactly" (with your chosen precision), it works very well. You can use -o to create a security margin around your spam_cutoff. Assuming spam_cutoff=0.6 you might want to score all ham in your collection below 0.3 and all spam above 0.9. Our example is:

        bogominitrain.pl -fnv ~/.bogofilter ham.mbox spam.mbox '-o 0.9,0.3'
  • Method 3) Use the script randomtrain (in the contrib directory). The script generates a list of all the messages in the mailboxes, randomly shuffles the list, and then scores each message, with training as needed. In our example:

        randomtrain -s spam.mbox -n ham.mbox

    As with method 4, it works better if you start with full training using several thousand messages. This will give a database that is more comprehensive and significantly bigger.

  • Method 4) If you have enough spams and non-spams in your training collection, separate out some 10,000 spams and 10,000 non-spams into separate mbox files, and train as in method 1. Then use bogofilter to classify the remaining spams and non-spams. Take any messages that it classifies as unsure or classifies incorrectly, and train with those. Here are two little scripts you can use to classify the train-on-error messages:

        #! /bin/sh
        #  class3 -- classify one message as bad, good or unsure
        cat >msg.$$
        bogofilter $* <msg.$$
        res=$?
        if [ $res = 0 ]; then
            cat msg.$$ >>corpus.bad
        elif [ $res = 1 ]; then
            cat msg.$$ >>corpus.good
        elif [ $res = 2 ]; then
            cat msg.$$ >>corpus.unsure
        fi
        rm msg.$$
        #! /bin/sh
        # classify -- put all messages in mbox through class3
        src=$1;
        shift
        formail -s class3 $* <$src

    In our example (after the initial full training):

        classify spam.mbox [bogofilter options]
        bogofilter -s < corpus.good
        rm -f corpus.*
        classify ham.mbox [bogofilter options]
        bogofilter -n < corpus.bad
        rm -f corpus.*

Comparing these methods

It is important to understand the consequences of the methods just described. Doing full training as in methods 1 and 4 produces a larger database than does training with methods 2 or 3. If your database size needs to be small (for example due to quota limitations), use methods 2 or 3.

Full training with method 1 is fastest. Training on error (as in methods 2, 3 and 4) is effective, but the initial training takes longer.


How do I train using maildirs?

Initial training from mbox:

    bogofilter -M -s -I ~/mail/Spam
    bogofilter -M -n -I ~/mail/NonSpam

Initial training from maildir:

    bogofilter -s -B ~/Maildir/.Spam
    bogofilter -n -B ~/Maildir/.NonSpam

Corrective training from mbox:

    bogofilter -M -Ns -I ~/mail/Missed_Spam
    bogofilter -M -Sn -I ~/mail/False_Spam

Corrective training from maildir:

    bogofilter -s -B ~/Maildir/.Missed_Spam
    bogofilter -n -B ~/Maildir/.False_Spam

How can I keep the scoring accuracy high?

Bogofilter will make mistakes once in a while. So ongoing training is important. There are two main methodologies for doing this. First, you can train with every incoming message (using the -u option). Second, you can train on error only.

Since you might want to rebuild your database at some point, for example when a major new feature is implemented in bogofilter, it can be very useful to update your training collection continuously.

Bogofilter always does the best it can with the information available to it. However, it will make mistakes, i.e., classify ham as spam (false positives) or spam as ham (false negatives). To reduce the likelihood of repeating the mistake, it is necessary to train bogofilter with the errant message. If a message is incorrectly classified as spam, use switch -n to train with it as ham. Use switch -s to train with a spam message.

Bogofilter has a -u switch that automatically updates the wordlists after scoring each message. As bogofilter sometimes misclassifies a message, monitoring is necessary to correct any mistakes. Corrections can be done using -Sn to change a message's classification from spam to non-spam and -Ns to change it from non-spam to spam.

Correcting a misclassified message may affect classification for other message. The smaller your database is, the higher is the likelihood that a training error will cause a misclassification.

Using a method like #2 or #3 (above) can compensate for this effect. Repeat the training with your complete training collection (including all the new messages added since the earlier training). This will add messages to the database which show that adverse effect on both sides until you have a new equilibrium.

An alternative strategy, based on method 4 in the previous section, is the following: Periodically take blocks of messages and use the scripts in method 4 above to classify them. Then manually review the good, bad and unsure files, correct any errors, and split the unsures into spam and non-spam. Until you have accumulated some 10,000 spam and 10,000 non-spam in your training database, train with the good, the bad, and the separated errors and unsures; thereafter, train with only the separated and unsures, discarding the messages that bogofilter already classifies correctly.


What mailbox (file) formats does bogofilter understand?

Bogofilter understands the traditional Unix mbox format, the Maildir and MH formats. Note though that bogofilter does not support subfolders, you will have to explicitly list them in MH or Maildir++ folders - just mention the full path to the subfolder.

For unsupported formats, you will have to convert the mailbox to a format bogofilter understands. Mbox is often convenient because it can be piped into bogofilter.

For example, to convert UW-IMAP/PINE mbx format to mbox:

    mailtool copy /full/path/to/mail.mbox '#driver.unix//full/path/to/mbox'

or:

    for MSG in /full/path/to/maildir/* ; do 
        formail -I Status: < "$MSG" >> /full/path/to/mbox
    done

What does bogofilter's verbose output mean?

Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".

  • Using "-v" causes bogofilter to generate the "X-Bogosity:" header line, i.e.
        X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
  • Using "-vv" causes bogofilter to generate a histogram, i.e.
        X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
          int  cnt    prob   spamicity  histogram
         0.00   29  0.000209  0.000052  #############################
         0.10    2  0.179065  0.003425  ##
         0.20    2  0.276880  0.008870  ##
         0.30   18  0.363295  0.069245  ##################
         0.40    0  0.000000  0.069245
         0.50    0  0.000000  0.069245
         0.60   37  0.667823  0.257307  #####################################
         0.70    5  0.767436  0.278892  #####
         0.80   13  0.836789  0.334980  #############
         0.90   32  0.984903  0.499835  ################################

    Each row shows an interval, the count of tokens with scores in that interval, the average spam probability for those tokens, the message's spamicity score (for those tokens and all lesser valued tokens), and a bar graph corresponding to the token count.

    In the above histogram there are a lot of low scoring tokens and a lot of high scoring tokens. They "balance" one another to give the spamicity score of 0.5000

  • Using "-vvv" produces a list of all the tokens in the messages with information on each one, i.e.
        X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
                              n    pgood     pbad      fw     U
        "which"              10  0.208333  0.000000  0.000041 +
        "own"                 7  0.145833  0.000000  0.000059 +
        "having"              6  0.125000  0.000000  0.000069 +
        ...
        "unsubscribe.asp"     2  0.000000  0.095238  0.999708 +
        "million"             4  0.000000  0.190476  0.999854 +
        "copy"                5  0.000000  0.238095  0.999883 +
        N_P_Q_S_s_x_md      138  0.00e+00  0.00e+00  5.00e-01
                                 1.00e-03  4.15e-01  0.100
    The columns printed contain the following information:
    "…"
    the token in question
    n
    number of times this token was encountered in training
    pgood
    proportion of good messages that contained this token
    pbad
    proportion of spam messages that contained this token
    fw
    Robinson's weighted index, which combines pgood and pbad to give a value that will be close to zero if a message containing this token is likely to be non-spam and close to one if it's likely to be spam
    U
    '+' if this token contributes to the final bogosity value, '-' otherwise. A token is excluded when its score is closer to 0.5 than min_dev.

    The final lines show:

    • The cumulative results of the columns
    • The values of Robinson's s and x parameters and of min_dev
  • Using "-R" produces the "-vvv" output described above plus two additional columns:
    invfwlog
    logarithm of fw
    fwlog
    logarithm of (1-fw)

    The "-R" output is formatted for use with the R language for statistical computing. More information is available at The R Project for Statistical Computing.


What is Unsure mode?

Bogofilter's default configuration will classify a message as spam or non-spam. The SPAM_CUTOFF parameter is used for this. Messages with scores greater than or equal to SPAM_CUTOFF are classified as spam. Other messages are classified as ham.

There is also a HAM_CUTOFF parameter. When used, messages must have scores less than or equal to HAM_CUTOFF to be classified as ham. Messages with scores between HAM_CUTOFF and SPAM_CUTOFF are classified as unsure. If you look in bogofilter.cf, you will see the following lines:

    #### CUTOFF Values
    #
    #    both ham_cutoff and spam_cutoff are allowed.
    #    setting ham_cutoff to a non-zero value will
    #    enable tri-state results (Spam/Ham/Unsure).
    #
    #ham_cutoff  = 0.45
    #spam_cutoff = 0.99
    #
    #    for two-state classification:
    #
    ## ham_cutoff = 0.00
    ## spam_cutoff= 0.99

To turn on Yes/No/Unsure classification, remove the #'s from the last two lines.

Alternatively, if you'd rather use labels Yes/No/Unsure instead of Spam/Ham/Unsure, remove the #'s from the following bogofilter.cf line:

    ## spamicity_tags = Yes, No, Unsure

Once that's done, you may want to set the filtering rules for your mail program to include rules like:

    if header contains "X-Bogosity: Spam", put in Spam folder
    if header contains "X-Bogosity: Unsure", put in Unsure folder

Alternatively, bogofilter.cf has directives for modifying the Subject: line, i.e.

    #### SPAM_SUBJECT_TAG
    #
    #    tag added to "Subject: " line for identifying spam or unsure
    #    default is to add nothing.
    #
    ##spam_subject_tag=***SPAM***
    ##unsure_subject_tag=???UNSURE???

With these subject tags, the filtering rules would look like:

    if subject contains "***SPAM***", put in Spam folder
    if subject contains "???UNSURE???", put in Unsure folder

What are "training on error" and "training to exhaustion"?

"Training on error" involves scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train; if you train bogofilter on the hard messages, it learns to handle obvious spam and non-spam too.

This method can be enhanced by using a "security margin". By increasing the spam cutoff value and decreasing the ham cutoff value, messages which are close to a cutoff will be used for training. Using security margins improves results when training on error. In general, greater margins help more (although too much also isn't optimal). As a rule of thumb spam cutoff +/- 0.3 gives good results. For tristate mode, you might try the middle of the unsure interval +/- 0.3 for training.

Repeating training on error on the same message corpus can improve accuracy. The idea is that messages which were rated correctly in the first place might after some more training be rated wrongly which will then be corrected.

"Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain. Also this method can be improved with security margins. See Gary Robinson's Rants on this topic for more details.

Note: bogominitrain.pl has a -f option to do "training to exhaustion". Using -fn avoids repeated training for each message.


What does the '-u' (autoupdate) switch do?

The "-u" switch (autoupdate) is used to automatically expand the wordlist. When this switch is used and bogofilter classifies a message as Spam or Ham, the message's tokens are added to the wordlist with a ham/spam tag (as appropriate).

As an example, suppose a new "Refinance now - best Mortgage rates" message comes in. It will have some words that bogofilter has seen and (probably) some new ones as well. Using '-u' the new words will be added to the wordlist so that bogofilter can better recognize the next, related message.

If/when you use to use '-u', you need to be on the lookout for classification errors and retrain bogofilter with any messages that have been classified incorrectly. An incorrectly classified message that is auto-updated _may_ cause bogofilter to make additional classification errors in the future. This is the same problem as when you (the sys admin) incorrectly register a ham message as spam (or vice versa).


How can I use SpamAssassin to train Bogofilter?

If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:

    BOGOFILTER     = "/usr/bin/bogofilter"
    BOGOFILTER_DIR = "training"
    SPAMASSASSIN  = "/usr/bin/spamassassin"

    :0 HBc
    * ? $SPAMASSASSIN -e
    #spam yields non-zero
    #non-spam yields zero
    | $BOGOFILTER -n -d $BOGOFILTER_DIR
    #else (E)
    :0Ec
    | $BOGOFILTER -s -d $BOGOFILTER_DIR

    :0fw
    | $BOGOFILTER -p -e

    :0:
    * ^X-Bogosity:.Spam
    spam

    :0:
    * ^X-Bogosity:.Ham
    non-spam

What can I do about Asian spam?

Many people get unsolicited email using Asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.

The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:

  • You can simply let bogofilter handle it. Just train bogofilter with the Asian language messages identified as spam. Bogofilter will parse the messages as best it can and will add tokens to the spam wordlist. The wordlist will contain many tokens which don't make sense to you (since the charset cannot be displayed), but bogofilter can work with them and successfully identify Asian spam.

    A second method is to use the "replace_nonascii_characters" config file option. This will replace high-bit characters, i.e. those between 0x80 and 0xFF, with question marks, '?'. This keeps the database much smaller. Unfortunately this conflicts with European language which have many accented vowels and consonant in the high-bit range.

  • If you are sure you will not receive any legitimate messages in those languages, you can kill them right away. This will keep the database smaller. You can do this with an MDA script.

    Here's a procmail recipe that will sideline messages written with Asian charsets:

        ## Silently drop all Asian language mail
        UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
        :0:
        * 1^0 $ ^Subject:.*=\?($UNREADABLE)
        * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
        spam-unreadable
    
        :0:
        * ^Content-Type:.*multipart
        * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
        spam-unreadable

    With the above recipe, bogofilter will never see the message.


How can I compact my database?

You can periodically compact the database so it occupies a minimum of disk space. Assuming your wordlist is in directory ~/.bogofilter, for bogofilter 0.93.0 (or newer) use:

    bf_compact ~/.bogofilter wordlist.db

For bogofilter older than 0.93.0, use:

    cd ~/.bogofilter
    bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
    mv wordlist.db wordlist.db.prv
    mv wordlist.db.new wordlist.db

The script is needed to duplicate your database environment (in order to support BerkeleyDB transaction processing). Your original directory will be renamed to ~/.bogofilter.old and ~/.bogofilter will contain the new database environment.

Since older versions of bogofilter don't use Berkeley DB transactions, the database is just a single file (wordlist.db) and it isn't necessary to use the script. The commands shown above create a new compact database and rename the original file to wordlist.db.prv

Note: it's O.K. to use the script with old versions of bogofilter.


How do I manually query the database?

To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR/wordlist.db example.com" gives the good and bad counts for "example.com".

If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR/wordlist.db example.com" gives the good and bad counts for "example.com".

To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR/wordlist.db .MSG_COUNT" to see the counts for the spam and ham wordlists.

To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/wordlist.db | wc -l " to display the count.


Can I use multiple wordlists?

Yes. Bogofilter can be run with multiple wordlists. For example, if you have both user and system wordlists, bogofilter can be instructed to check the user list and, if the word isn't there, then check the system list. Alternatively, it can be instructed to add together the information from the two lists.

Following are the config file options and some examples:

A wordlist has several attributes, notably type, name, filename, and precedence.

  • Type: 'R' and 'I' (for "regular" and "ignore"). Current wordlists are of type 'R'. Type 'I' means "don't score the token if found in the ignore list".
  • Name: a short identifying symbol used when printing error messages. Examples are "global", "user", and "ignore", but you can use any identifier you want.
  • Filename: the name (path) of the file. When opening the wordlist, if the name is fully qualifified (with a leading '/' or '~'), that name is used. If the name isn't fully qualified, bogofilter will prepend the directory, using the usual search order is used, i.e. $BOGOFILTER_DIR, $BOGODIR, $HOME.
  • Precedence: an integer like 1, 2, 3, ... Wordlists are searched in ascending order for the token. If the search token is found, lists with the same precedence number will be checked (and counts added together). Lists with higher precedence numbers will not be checked.

Example 1 - merge user and system lists:

    wordlist R,user,~/wordlist.db,1
    wordlist R,system,/var/spool/bogofilter/wordlist.db,1

Example 2 - prefer user to system list:

    wordlist R,user,~/wordlist.db,2
    wordlist R,system,/var/spool/bogofilter/wordlist.db,3

Example 3 - prefer system to user list:

    wordlist R,user,~/wordlist.db,5
    wordlist R,system,/var/spool/bogofilter/wordlist.db,4

Note 1: bogofilter's registration flags ('-s', '-n', '-u', '-S', '-N' ) will apply to the lowest numbered list.

Note 2: having lists of types 'R' and 'I' of the same precedence won't be allowed because the types are contradictory.


Can I tell bogofilter to ignore certain tokens?

Through the use of an ignore list, bogofilter will ignore the listed tokens when scoring the message.

Example:

    wordlist I,ignore,~/ignorelist.db,7
    wordlist R,system,/var/spool/bogofilter/wordlist.db,8

Because ignorelist.db has a lower index (7), than wordlist.db (8), bogofilter will stop looking when finds a token in ignorelist.db.

Note: Technically, bogofilter gives a score of ROBX to the tokens and expects the min_dev parameter to drop them from the scoring.

There are two main methods for building/maintaining an ignore list.

First, a text file can be created and maintained using any text editor. Bogoutil can convert the text file to database format, e.g. "bogoutil -l ignorelist.db < ignorelist.txt".

Alternatively, echo ... | bogoutil ... can be used to add a single token, for example "ignore.me", as in:

  echo ignore.me | bogoutil -l ~/ignorelist.db

How do I upgrade from separate word databases to the combined wordlist format?

Run script bogoupgrade. For more info, run "bogoupgrade -h" to see its help message or run "man bogoupgrade" and read its man page.


How can I tell if my wordlists are corrupted?

NOTE: some distributors rename all the db_ utilities given below by inserting or appending the version number, with or without dot, for instance db4.1_verify or db_verify-4.2. There is no standard on the renaming of these utilities.

If you think your wordlists are hosed, you can see what BerkeleyDB thinks by running:

    db_verify wordlist.db

You may be able to recover some (or all) of the tokens and their counts with the following commands:

    bogoutil -d wordlist.db | bogoutil -l wordlist.new.db

or - if there has been more damage to the token list - with

    db_dump -r wordlist.db > wordlist.txt
    db_load wordlist.new.db < wordlist.txt

You can also use a text file instead of a pipe, as in:

    bogoutil -d wordlist.db > wordlist.txt
    bogoutil -l wordlist.db.new < wordlist.txt

How can I convert my wordlist to/from unicode?

Wordlists can be converted from raw storage to unicode using:

    bogoutil -d wordlist.db > wordlist.raw.txt
    iconv -f iso-8859-1 -t utf-8 < wordlist.raw.txt > wordlist.utf8.txt
    bogoutil -l wordlist.db.new < wordlist.utf8.txt

or:

    bogoutil --unicode=yes -m wordlist.db

Wordlists can be converted from unicode to raw storage using:

    bogoutil -d wordlist.db > wordlist.utf8.txt
    iconv -f utf-8  -t iso-8859-1 < wordlist.utf8.txt > wordlist.raw.txt
    bogoutil -l wordlist.db.new < wordlist.raw.txt

or:

    bogoutil --unicode=no -m wordlist.db

The above methods work best when the wordlist is based on the iso-8859-1 charset. If your wordlist is based on a different charset, for example CP866 or KOI8-R, use that charset in the above commands.

For a wordlist containing tokens from multiple languages, particularly non-european languages, the conversion methods described above may not work well. Building a new wordlist (from scratch) will likely work better as the new wordlist will be based solely on unicode.


How can I switch from non-transaction to transaction mode?

How to do this is fully documented in file doc/README.db section 2.2.1. We suggest you read the whole section.

In brief, use these commands:

    cd ~/.bogofilter
    bogoutil -d wordlist.db > wordlist.txt
    mv wordlist.db wordlist.db.old
    bogoutil --db-transaction=yes -l wordlist.db < wordlist.txt

If everything went well, you can remove the backup files:

    rm wordlist.db.old wordlist.txt

How can I switch from transaction to non-transaction mode?

How to do this is fully documented in file doc/README.db section 2.2.2. We suggest you read the whole section.

In brief, you can use bogoutil to dump/load the wordlist, for example:

    cd ~/.bogofilter
    bogoutil -d wordlist.db > wordlist.txt
    mv wordlist.db wordlist.db.old
    rm -f log.?????????? __db.???
    bogoutil --db-transaction=no -l wordlist.db < wordlist.txt

Why does bogofilter die after printing "Lock table is out of available locks" or "Lock table is out of available object entries"

The transactional and concurrent modes of BerkeleyDB require a lock table that corresponds to the data base in size. See the README.db file for a detailed explanation and a remedy.

The size of the lock table can be set in bogofilter.cf or in DB_CONFIG. Bogofilter.cf uses the db_lk_max_locks and db_lk_max_objects directives, while DB_CONFIG uses the set_lk_max_objects and set_lk_max_locks directives.

After changing these values in DB_CONFIG, run command

  bogoutil --db-recover /your/bogofilter/directory

to rebuild the lock tables.


Why does bogofilter crash with "File size limit exceeded"?

Some mail transfer agents, such as Postfix, impose file size limits. When bogofilter's database reaches that limit, for instance, because it is executed by Postfix's local(8), which enforces Postfix's mailbox_size_limit on all files, including bogofilter's database, then bogofilter may crash with SIGXFSZ "File size limit exceeded".

Why am I getting DB_PAGE_NOTFOUND messages?

You have a problem with your BerkeleyDB database. There are two likely causes: either you've hit a max size limit or the database is corrupt.

Some mail transfer agents, such as Postfix, impose file size limits. When bogofilter's database reaches that limit, write problems will occur.

To show the database size use:

    ls -lh $BOGOFILTER_DIR/wordlist.db

To show the postfix setting:

    postconf | grep mailbox_size_limit

To set the limit to 73MB (or whatever size is right for you):

    postconf -e mailbox_size_limit=73000000

If you think your database may be corrupt, read How can I tell if my wordlists are corrupted? FAQ entry.


Why am I getting "Berkeley DB library configured to support only DB_PRIVATE environments" or
"Berkeley DB library configured to support only private environments"?

Some distributors (for instance the Fedora Project) package Berkeley DB with support for POSIX threading and hence POSIX mutexes, but your system does not support POSIX mutexes (whether it does, depends on the kernel version and exact processor type).

To work around this problem:

  1. download, compile and install Berkeley DB on your own and the reconfigure bogofilter:
    1. cd build_unix
    2. ../dist/configure --enable-cxx
    3. make
    4. make install
  2. recompile and install bogofilter:
    1. ./configure --with-libdb-prefix=/usr/local/BerkeleyDB.4.3 (replace your Berkeley DB version number)
    2. make && make check
    3. make install (if space is a premium, use make install-strip)

Can bogofilter be used in a multi-user environment?

Yes, it can. There are multiple, distinct strategies for doing this. The two extremes are:

  • Having a bogofilter administrator who maintains a global wordlist that everybody uses.
  • Having each user maintain his/her own wordlist.

As a middle ground, the bogofilter administrator can create and maintain the global wordlists and each user can be given the choice of using the global wordlist or a private wordlist. An MDA, such as procmail, can be programmed to first apply the global wordlist (with a very stringent spam cutoff) and then (if necessary) apply the user's wordlist.


Can I share wordlists over NFS?

If you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If your system does not support fcntl(), then you will not be able to share wordlist files over NFS without the risk of data corruption.

Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.

For shared directories (NFS directories used by multiple machines, for instance, Sparc/Itanium/Alpha and x86), the architecture-specific parts can be installed separately by giving a different --exec-prefix (it will default to --prefix)


Why does bogofilter give return codes like 0 and 256 when it's run from inside a program?

Likely the return codes are being reformatted by waitpid(2). In C use WEXITSTATUS(status) in sys/wait.h, or comparable macro, to get the correct value. In Perl you can just use 'system("bogofilter $input") >> 8'. If you want more info, run "man waitpid".


Now that I've upgraded why are my scripts broken?

Over time bogofilter accumulated a large number of functions. Some of those were discontinued or changed. Please read the NEWS file for details.


Now that I've upgraded why is bogofilter working less well?

The lexer, i.e., that part of bogofilter which extracts tokens from a message, evolves. This results in different readings of messages with the consequence that some tokens in the database can no longer be used.

If you encounter this problem, you are strongly advised to rebuild your database. If this is not an option for you, you might want to use version 0.15.13 and read the documentation which comes with it for how to migrate your database.


How can I delete all the spam (or non-spam) tokens?

Bogoutil lets you dump a wordlist and load the tokens into a new wordlist. With the added use of awk and grep, counts can be zeroed and tokens with zero counts for both spam and non-spam can be deleted.

The following commands will delete the tokens from spam messages:

    bogoutil -d wordlist.db | \
    awk '{print $1 " " $2 " 0"}' | grep -v " 0 0" | \
    bogoutil -l wordlist.new.db

The following commands will delete the tokens from non-spam messages:

    bogoutil -d wordlist.db | \
    awk '{print $1 " 0 " $3}' | grep -v " 0 0" | \
    bogoutil -l wordlist.new.db

How do I get bogofilter working on Solaris, BSD, etc?

If you don't already have a v3.0+ version of BerkeleyDB, then download it (take one of the 4.4.X, 4.3.X or 4.2.X versions), unpack it, and do these commands in the db directory:

    $ cd build_unix
    $ sh ../dist/configure
    $ make
    # make install

Next, download a portable version of bogofilter.

On Solaris

Be sure that your PATH environment variable begins with /usr/xpg6/bin:/usr/xpg4/bin:/usr/ccs/bin (/usr/xpg6/bin is only present on Solaris 10 and can be omitted on Solaris 9 and older versions). That is required for POSIX compliance.

Unpack it, and then do:

    $ ./configure --with-libdb-prefix=/usr/local/BerkeleyDB.4.4
    $ make
    # make install-strip

You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter. On newer systems, the most convenient way is probably to use the crle(1) tool to set the path permanently so BerkeleyDB is available to all applications.

    $ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.4
    $ export LD_LIBRARY_PATH

Note that some "make" versions shipped with older Solaris version break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).

If your Solaris GCC complains with "ld: fatal: file values-Xa.o: open failed: No such file or directory", install the SUNWarc package.

On FreeBSD

The FreeBSD ports collection carries the latest stable versions of bogofilter to be compiled from source. The bogofilter ports are also auto-built and provided as binary packages for you to install.

The binary packages approach uses default installed software. To install bogofilter from binary package, type, as the privileged user:

    pkg install -y bogofilter

The ports from-source approach uses the highly recommended portmaster and portsnap software packages. To install portmaster, type (you need to do this only once), as root:

    pkg install -y portmaster

To install or upgrade bogofilter, just upgrade your portstree using portsnap, then type, as root:

    portmaster mail/bogofilter

Note: This assumes you are root. If not, read through the remainder of this FreeBSD section and then see how you can build if you haven't got root privileges.

On NetBSD and other systems that use "pkgsrc"

pkgsrc should be offering a reasonably recent stable bogofilter release. See http://www.pkgsrc.org/ for information on pkgsrc.

On HP-UX

See the file doc/programmer/README.hp-ux in the source distribution.


Can I use the make command on my operating system?

Bogofilter has been successfully built on many operating systems using GNU make and the native make commands. However, bogofilter's Makefile doesn't work with some make commands.

GNU make is recommended for building bogofilter because we know it works. We cannot support less capable make commands. If your non-GNU make command can successfully build bogofilter, that's great. If you encounter problems, the right thing to do is install GNU make. If your non-GNU make can't build bogofilter, we're sorry but you're on your own. If it takes just a minor and clean patch to make it compatible, we might take it.


How do I build bogofilter as non-root user or for a non-standard installation prefix?

To install bogofilter to a non-standard path (as non-root user you don't have the permission to the normal paths), you need to provide the installation prefix when you run ./configure.

After downloading and unpacking the source code, run ./configure --prefix=PATH where PATH is the installation prefix for the generated files (binaries, man pages etc.). Then run the usual build commands — make && make check && make install.


How do I build bogofilter with patches?

If you need to apply patches, get the source code and unpack it using tar -xzf or gunzip | tar -xf - (as appropriate). Change to the source directory and run ./configure --prefix=PATH where PATH is the installation prefix for the generated files (binaries, man pages etc.). Apply your patches than run make && make install.


How do I make the executables smaller?

When space is tight, you can use make install-strip instead of make install. Doing this will save space, but crashes can't be debugged unless more information on reproducing the bug is provided to the developers.


datastore_db.c does not compile!

If you are configuring a data base path for instance with --with-libdb-prefix or via CPPFLAGS and LIBS, be sure to pass in an absolute path (with leading slash), a relative path will not work. Example: use --with-libdb-prefix=/usr/local/BerkeleyDB.4.2, but not --with-libdb-prefix=../BerkeleyDB.4.2


With which mail programs does bogofilter work?

Bogofilter is known to work with kmail, mozilla-mail, mutt, alpine, sylpheed-claws. A google search will help you find more information on using bogofilter with the mail program you use.


How do I use bogofilter with mutt?

Use a mail filter (procmail, maildrop, etc.) to filter mail into different folders based on bogofilter's return code and set mutt key bindings to train bogofilter on errors:

    macro index S "|bogofilter -s\ns=junkmail"  "Learn as spam and save to junk"
    macro pager S "|bogofilter -s\ns=junkmail"  "Learn as spam and save to junk"
    macro index H "|bogofilter -n\ns="          "Learn as ham and save"
    macro pager H "|bogofilter -n\ns="          "Learn as ham and save"

These will pipe the selected message through bogofilter, training a false-ham as spam or vice versa, then offer to save the message to a different folder.


How do I use bogofilter with Sylpheed Claws?

Add a filtering rule to run bogofilter on incoming messages and an action to perform if it's spam

    condition:
    * test "bogofilter < %F"
    action:
    * move "#mh/YOUR_SPAM_BOX"

Note: this assumes that bogofilter is in your path!

Create two Claws actions - one for marking messages as spam and one for marking messages as ham. Use the "Mark As Spam" action for messages incorrectly classified as ham and use the "Mark As Ham" action for messages incorrectly classified as spam.

    Mark as ham / spam:
    * bogofilter -n -v -B "%f" (mark ham)
    * bogofilter -s -v -B "%f" (mark spam)

Another approach is to save incorrectly classified messages in a folder (or folders) and run a script like:

    #!/bin/sh
    CONFIGDIR=~/.bogofilter
    SPAMDIRS="$CONFIGDIR/spamdirs"
    MARKFILE="$CONFIGDIR/lastbogorun"
    for D in `cat "$SPAMDIRS"`; do
        find "$D" -type f -newer "$MARKFILE" -not -name ".sylpheed*"
    done|bogofilter -bNsv
    touch "$MARKFILE"

This script can be used as an action and/or made into a toolbar button. It will register as spam the messages in ${SPAMDIRS} that are newer than ${MARKFILE}.

Additional information is available at the Sylpheed-Claws's wiki.


Another approach is to run bogofilter from procmail, maildrop, etc and have Claws check the X-Bogosity header and filter messages into Spam and Unsure folders, e.g.:

    Condition:
        header "X-Bogosity" matchcase "Spam"
    Action:
        move "#mh/Mailbox/Spam"
    Condition:
        header "X-Bogosity" matchcase "Unsure"
    Action:
        move "#mh/Mailbox/Unsure"

Any messages in the Unsure folder should be used for training, as should messages incorrectly classified as ham or spam. The actions below will handle these cases:

    Register Spam:
        bogofilter -s < "%f"

    Register Ham:
        bogofilter -n < "%f"

    Unregister Spam:
        bogofilter -S < "%f"

    Unregister Ham:
        bogofilter -N < "%f"

To look inside the bogofilter scoring mechanism, the following diagnostic are useful:

    BogoTest -vv:
        bogofilter -vv < "%f"

    BogoTest -vvv:
        bogofilter -vvv < "%f"

Additional information on this approach is available here.


How do I use bogofilter with VM (an Emacs Mail tool)?

You need to include the separate file vm-bogofilter.el (included in bogofilter's contrib directory). The latest version of the file is at http://www.cis.upenn.edu/~bjornk/bogofilter/vm-bogofilter.el) in your emacs path.

Then, just add in your ~/.vm configuration file:

;; load bogofilter capabilities (spam)
;;
(require 'vm-bogofilter)

;; short-key for bogofilter
;; C (shift-c) means spam message
;; K (shift-k) means ham message
(define-key vm-mode-map "K" 'vm-bogofilter-is-spam)
(define-key vm-mode-map "C" 'vm-bogofilter-is-clean)

All the messages are filtered by bogofilter each time you check newly arrived e-mail. When you change the status of an e-mail, the bogofilter header is changed (X-Bogosity: header).

There is a limit: you cannot change multiple message headers at one time in VM; you have to do it message by message.


How do I use bogofilter with MH-E (the Emacs interface to the MH mail system)?

The default setting of the 'mh-junk-program' option is 'Auto-detect' which means that MH-E will automatically choose one of SpamAssassin, Bogofilter, or SpamProbe in that order. If, for example, you have both SpamAssassin and Bogofilter installed and you want to use BogoFilter, then you can set this option to 'Bogofilter'.

The 'J b' ('mh-junk-blacklist') command trains the spam program in use with the content of the range and then handles the message(s) as specified by the 'mh-junk-disposition' option. By default, this option is set to 'Delete Spam' but you can also specify the name of the folder which is useful for building a corpus of spam for training purposes.

In contrast, the 'J w' ('mh-junk-whitelist') command reclassifies a range of messages as ham if it were incorrectly classified as spam. It then refiles the message into the '+inbox' folder.

For more information, see the MH-E home page



SourceForge Logo

Site designed by
www.nkstudios.net