Bogofilter FAQ
Bogofilter FAQ
Official Versions: In
English or
French or
Italian or
Bulgarian
Current maintainer since 2013: Matthias Andree <m-a@users.sf.net>
Previous maintainer of ten years: David Relson <relson@osagesoftware.com>
This document is intended to answer frequently asked
questions about bogofilter.
Typographic conventions
- If we show example commands that start with a
dollar sign ($), this means that these commands should be executed
by an unprivileged user, NOT the root user.
- If we show example commands that start with a hash mark (#),
this means that these commands need to be executed by the root
user.
Frequently asked questions and their answers
-
General Information
-
Operational Questions
-
Database Questions
-
Berkeley DB Questions
-
Technical problems
-
Build and Portability Problems
-
Using Bogofilter with different mail programs
What is bogofilter?
Bogofilter is a fast Bayesian spam filter along the lines
suggested by Paul Graham
in his article
A Plan For Spam.
bogofilter uses
Gary Robinson's
geometric-mean algorithm with the
Fisher's method modification
to classify email as spam or non-spam.
The bogofilter
home page at
SourceForge is the central clearinghouse for bogofilter
resources.
Bogofilter was started by
Eric S. Raymond on August 19,
2002. It gained popularity in September 2002, and a number of other
authors have started to contribute to the project.
The NEWS file
describes bogofilter's version history starting with version 1.0.0.
Older news (before release 1.0.0) are in the NEWS.0 file.
Bogo-what?
Bogofilter is some kind of a
bogometer or
bogon filter,
i.e., it tries to identify
bogus
mail by measuring the
bogosity.
How does bogofilter work?
See the man page's
THEORY OF OPERATION
section for an introduction. The main source for understanding
this is Gary Robinson's Linux Journal article
"A Statistical Approach to the Spam Problem".
After you read all this you might ask some questions. The first
could be "Is bogofilter really a Bayesian spam filter?"
Bogofilter is based on Bayes' theorem and uses it in the initial
calculations and other statistical methods later. Without doubt it
is a statistical spam filter with a Bayesian flavor.
Other questions you might have might concern the basic
assumptions of Bayes' theory. Two short answers are: "No, they are
not satisfied" and "We don't care as long as it works". A longer
answer will mention that the basic assumption that "an e-mail is a
random collection of words, each independent of the others" is
violated. There are several places where practice doesn't follow
theory. Some are always present, and some which will depend on
the way you use bogofilter:
- Words in an e-mail are by no means independent. In all languages,
the opposite is true.
- The words used are not random, though some spammers include random
words.
- Full training or using a random sample, follows Bayes'
principles. Choosing messages to use for training violates
the assumption that the training messages are a random sample of
the messages received. This principle is also violated by
Bogofilter's auto-update function (with the thresh_update
parameter), training on error, or
anything similar to these approaches.
- The same applies if you train with the same message more than once.
- Other problems arise if you modify your database by removing tokens
(like using bogoutil with -a or -c).
- Undoubtedly there are more.
As the man page explains, bogofilter tries to understand how
badly the null hypothesis fails. Some people argue that "those
departures from reality usually work in our favor" (from Gary's
article). Some argue that, even then, we should not violate too
much. Nobody really knows. Just keep in mind that
problems might occur if you push too hard. The key to bogofilter's
approach is: What matters most is simply what works in the real
world.
Now that you have been warned, have fun and use bogofilter as
suits you best.
Mailing Lists
There are currently four mailing lists for bogofilter:
List Address |
Links |
Description |
bogofilter-announce@bogofilter.org |
[subscribe]
[archives: mailman]
|
An announcement-only list where new versions are
announced. |
bogofilter@bogofilter.org |
[subscribe]
[archives: mailman]
|
A discussion list where any conversation about
bogofilter may take place. |
bogofilter-dev@bogofilter.org |
[subscribe]
[archives: mailman]
|
A list for sharing patches, development, and technical
discussions. |
bogofilter-cvs@lists.sourceforge.net |
[subscribe]
[archive] |
Mailing list for announcing code changes to the SVN
archive. (The CVS name is a leftover from before the migration
for our users' convenience.) |
The bogofilter-announce list is moderated and is used only for
important announcements (eg: new versions). It is low traffic.
If you have subscribed to the user's list or the developer's list,
you don't need to subscribe to the announce list. Messages posted
to the announce list are also distributed to the others.
How do I start my bogofilter training?
To classify messages as ham (non-spam) or spam, bogofilter
needs to learn from your mail. To start with it is best to have
collections (that are as large as possible) of messages you know
for sure are ham or spam. (Errors here will cause problems later,
so try hard;-)
. Warning: Only use your mail; using other
collections (like a spam collection found on the web), might cause
bogofilter to draw a wrong conclusion — after all you want it to
understand your mail.
Once you have the spam and ham collections, you have basically
four choices. In all cases it works better if your training base
(the above collections) is bigger, rather than smaller. The
smaller your training collection is, the higher the number of
errors bogofilter will make in production. Let's assume your
collection is two mbox files: ham.mbox and spam.mbox.
Note: Bogofilter's contrib directory includes two scripts that
both use a train-on-error technique. This technique scores each
message and adds to the database only those messages that were
scored incorrectly (messages scored as uncertain, ham scored as
spam, or spam scored as ham). The goal is to build a database of
those words needed to correctly classify messages. The
resulting database is smaller than the one build using full
training.
Method 2) Use the script bogominitrain.pl (in the contrib
directory). It checks the messages in the same order as your
mailbox files. You can use the -f
option which will
repeat this until all messages in your training collection are
classified correctly (you can even adjust the level of
certainty). Since the script makes sure the database understands
your training collection "exactly" (with your chosen
precision), it works very well. You can use -o
to
create a security margin around your spam_cutoff. Assuming
spam_cutoff=0.6 you might want to score all ham in your
collection below 0.3 and all spam above 0.9. Our example is:
bogominitrain.pl -fnv ~/.bogofilter ham.mbox spam.mbox '-o 0.9,0.3'
Method 3) Use the script randomtrain (in the contrib
directory). The script generates a list of all the messages in the
mailboxes, randomly shuffles the list, and then scores each
message, with training as needed. In our example:
randomtrain -s spam.mbox -n ham.mbox
As with method 4, it works better if you start with full
training using several thousand messages. This will give a
database that is more comprehensive and significantly
bigger.
Method 4) If you have enough spams and non-spams in your
training collection, separate out some 10,000 spams and 10,000
non-spams into separate mbox files, and train as in method 1. Then
use bogofilter to classify the remaining spams and non-spams. Take
any messages that it classifies as unsure or classifies
incorrectly, and train with those. Here are two little scripts you
can use to classify the train-on-error messages:
#! /bin/sh
# class3 -- classify one message as bad, good or unsure
cat >msg.$$
bogofilter $* <msg.$$
res=$?
if [ $res = 0 ]; then
cat msg.$$ >>corpus.bad
elif [ $res = 1 ]; then
cat msg.$$ >>corpus.good
elif [ $res = 2 ]; then
cat msg.$$ >>corpus.unsure
fi
rm msg.$$
#! /bin/sh
# classify -- put all messages in mbox through class3
src=$1;
shift
formail -s class3 $* <$src
In our example (after the initial full training):
classify spam.mbox [bogofilter options]
bogofilter -s < corpus.good
rm -f corpus.*
classify ham.mbox [bogofilter options]
bogofilter -n < corpus.bad
rm -f corpus.*
Comparing these methods
It is important to understand the consequences of the methods
just described. Doing full training as in methods 1 and 4 produces
a larger database than does training with methods 2 or 3. If your
database size needs to be small (for example due to quota
limitations), use methods 2 or 3.
Full training with method 1 is fastest. Training on error (as
in methods 2, 3 and 4) is effective, but the initial training takes
longer.
How do I train using maildirs?
Initial training from mbox:
bogofilter -M -s -I ~/mail/Spam
bogofilter -M -n -I ~/mail/NonSpam
Initial training from maildir:
bogofilter -s -B ~/Maildir/.Spam
bogofilter -n -B ~/Maildir/.NonSpam
Corrective training from mbox:
bogofilter -M -Ns -I ~/mail/Missed_Spam
bogofilter -M -Sn -I ~/mail/False_Spam
Corrective training from maildir:
bogofilter -s -B ~/Maildir/.Missed_Spam
bogofilter -n -B ~/Maildir/.False_Spam
How can I keep the scoring accuracy high?
Bogofilter will make mistakes once in a while. So ongoing
training is important. There are two main methodologies for doing this.
First, you can train with every incoming message (using the -u
option). Second, you can train on error only.
Since you might want to rebuild your database at some point,
for example when a major new feature is implemented in bogofilter,
it can be very useful to update your training collection
continuously.
Bogofilter always does the best it can with the information
available to it. However, it will make mistakes, i.e., classify
ham as spam (false positives) or spam as ham (false negatives). To
reduce the likelihood of repeating the mistake, it is necessary to
train bogofilter with the errant message. If a message is
incorrectly classified as spam, use switch -n
to
train with it as ham. Use switch -s
to train with a
spam message.
Bogofilter has a -u
switch that automatically
updates the wordlists after scoring each message. As bogofilter
sometimes misclassifies a message, monitoring is necessary to
correct any mistakes. Corrections can be done using
-Sn
to change a message's classification from spam to
non-spam and -Ns
to change it from non-spam to spam.
Correcting a misclassified message may affect classification for
other message. The smaller your database is, the higher is the
likelihood that a training error will cause a misclassification.
Using a method like #2 or #3 (above) can compensate for this
effect. Repeat the training with your complete training
collection (including all the new messages added since the earlier
training). This will add messages to the database which show that
adverse effect on both sides until you have a new equilibrium.
An alternative strategy, based on method 4 in the previous
section, is the following: Periodically take blocks of messages
and use the scripts in method 4 above to classify them. Then
manually review the good, bad and unsure files, correct any
errors, and split the unsures into spam and non-spam. Until you
have accumulated some 10,000 spam and 10,000 non-spam in your
training database, train with the good, the bad, and the separated
errors and unsures; thereafter, train with only the separated and
unsures, discarding the messages that bogofilter already
classifies correctly.
Bogofilter understands the traditional Unix mbox format, the
Maildir and MH formats. Note though that bogofilter does not support
subfolders, you will have to explicitly list them in MH or Maildir++
folders - just mention the full path to the subfolder.
For unsupported formats, you will have to convert the mailbox to
a format bogofilter understands. Mbox is often convenient because it can
be piped into bogofilter.
For example, to convert UW-IMAP/PINE mbx format to mbox:
mailtool copy /full/path/to/mail.mbox '#driver.unix//full/path/to/mbox'
or:
for MSG in /full/path/to/maildir/* ; do
formail -I Status: < "$MSG" >> /full/path/to/mbox
done
What does bogofilter's verbose output mean?
Bogofilter can instructed to display information on the
scoring of a message by running it with flags "-v", "-vv",
"-vvv", or "-R".
-
Using "-v" causes bogofilter to generate the "X-Bogosity:"
header line, i.e.
X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
-
Using "-vv" causes bogofilter to generate a histogram, i.e.
X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
int cnt prob spamicity histogram
0.00 29 0.000209 0.000052 #############################
0.10 2 0.179065 0.003425 ##
0.20 2 0.276880 0.008870 ##
0.30 18 0.363295 0.069245 ##################
0.40 0 0.000000 0.069245
0.50 0 0.000000 0.069245
0.60 37 0.667823 0.257307 #####################################
0.70 5 0.767436 0.278892 #####
0.80 13 0.836789 0.334980 #############
0.90 32 0.984903 0.499835 ################################
Each row shows an interval, the count of tokens with
scores in that interval, the average spam probability for
those tokens, the message's spamicity score (for those
tokens and all lesser valued tokens), and a bar graph
corresponding to the token count.
In the above histogram there are a lot of low scoring
tokens and a lot of high scoring tokens. They "balance" one
another to give the spamicity score of 0.5000
-
Using "-vvv" produces a list of all the tokens in
the messages with information on each one, i.e.
X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
n pgood pbad fw U
"which" 10 0.208333 0.000000 0.000041 +
"own" 7 0.145833 0.000000 0.000059 +
"having" 6 0.125000 0.000000 0.000069 +
...
"unsubscribe.asp" 2 0.000000 0.095238 0.999708 +
"million" 4 0.000000 0.190476 0.999854 +
"copy" 5 0.000000 0.238095 0.999883 +
N_P_Q_S_s_x_md 138 0.00e+00 0.00e+00 5.00e-01
1.00e-03 4.15e-01 0.100
The columns printed contain the following information:
- "…"
- the token in question
- n
- number of times this token was encountered in
training
- pgood
- proportion of good messages that contained this
token
- pbad
- proportion of spam messages that contained this
token
- fw
- Robinson's weighted index, which combines pgood and
pbad to give a value that will be close to zero if a
message containing this token is likely to be non-spam
and close to one if it's likely to be spam
- U
- '+' if this token contributes to the final
bogosity value, '-' otherwise. A token is excluded
when its score is closer to 0.5 than min_dev.
The final lines show:
- The cumulative results of the columns
- The values of Robinson's s and x
parameters and of min_dev
-
Using "-R" produces the "-vvv" output described above plus
two additional columns:
- invfwlog
- logarithm of fw
- fwlog
- logarithm of (1-fw)
The "-R" output is formatted for use with the R language
for statistical computing. More information is available at
The R Project for
Statistical Computing.
What is Unsure mode?
Bogofilter's default configuration will classify a message as
spam or non-spam. The SPAM_CUTOFF parameter is used for this.
Messages with scores greater than or equal to SPAM_CUTOFF are
classified as spam. Other messages are classified as ham.
There is also a HAM_CUTOFF parameter. When used, messages must
have scores less than or equal to HAM_CUTOFF to be classified as
ham. Messages with scores between HAM_CUTOFF and SPAM_CUTOFF are
classified as unsure. If you look in bogofilter.cf, you will see
the following lines:
#### CUTOFF Values
#
# both ham_cutoff and spam_cutoff are allowed.
# setting ham_cutoff to a non-zero value will
# enable tri-state results (Spam/Ham/Unsure).
#
#ham_cutoff = 0.45
#spam_cutoff = 0.99
#
# for two-state classification:
#
## ham_cutoff = 0.00
## spam_cutoff= 0.99
To turn on Yes/No/Unsure classification, remove the #'s from the last
two lines.
Alternatively, if you'd rather use labels Yes/No/Unsure
instead of Spam/Ham/Unsure, remove the #'s from the following
bogofilter.cf line:
## spamicity_tags = Yes, No, Unsure
Once that's done, you may want to set the filtering rules for your mail
program to include rules like:
if header contains "X-Bogosity: Spam", put in Spam folder
if header contains "X-Bogosity: Unsure", put in Unsure folder
Alternatively, bogofilter.cf has directives for modifying the
Subject: line, i.e.
#### SPAM_SUBJECT_TAG
#
# tag added to "Subject: " line for identifying spam or unsure
# default is to add nothing.
#
##spam_subject_tag=***SPAM***
##unsure_subject_tag=???UNSURE???
With these subject tags, the filtering rules would look like:
if subject contains "***SPAM***", put in Spam folder
if subject contains "???UNSURE???", put in Unsure folder
What are "training on error" and "training to exhaustion"?
"Training on error" involves scanning a corpus of known spam
and non-spam messages; only those that are misclassified, or
classed as unsure, get registered in the training database. It's
been found that sampling just messages prone to misclassification
is an effective way to train; if you train bogofilter on the hard
messages, it learns to handle obvious spam and non-spam too.
This method can be enhanced by using a "security margin". By
increasing the spam cutoff value and decreasing the ham cutoff
value, messages which are close to a cutoff will be used for
training. Using security margins improves results when training
on error. In general, greater margins help more (although too
much also isn't optimal). As a rule of thumb spam cutoff +/- 0.3 gives good
results. For tristate mode, you might try the middle of the unsure
interval +/- 0.3 for training.
Repeating training on error on the same message corpus can
improve accuracy. The idea is that messages which were rated
correctly in the first place might after some more training be
rated wrongly which will then be corrected.
"Training to exhaustion" is repeating training on error, with
the same message corpus, until no errors remain. Also this method
can be improved with security margins. See
Gary Robinson's Rants
on this topic for more details.
Note: bogominitrain.pl
has a -f
option
to do "training to exhaustion". Using -fn
avoids
repeated training for each message.
What does the '-u' (autoupdate) switch do?
The "-u" switch (autoupdate) is used to automatically expand the
wordlist. When this switch is used and bogofilter classifies a message
as Spam or Ham, the message's tokens are added to the wordlist with a
ham/spam tag (as appropriate).
As an example, suppose a new "Refinance now - best Mortgage rates"
message comes in. It will have some words that bogofilter has seen and
(probably) some new ones as well. Using '-u' the new words will be
added to the wordlist so that bogofilter can better recognize the next,
related message.
If/when you use to use '-u', you need to be on the lookout for
classification errors and retrain bogofilter with any messages that have
been classified incorrectly. An incorrectly classified message that is
auto-updated _may_ cause bogofilter to make additional classification
errors in the future. This is the same problem as when you (the sys
admin) incorrectly register a ham message as spam (or vice versa).
How can I use SpamAssassin to train Bogofilter?
If you have a working SpamAssassin installation (or care to
create one), you can use its return codes to train bogofilter.
The easiest way is to create a script for your MDA that runs
SpamAssassin, tests the spam/non-spam return code, and runs
bogofilter to register the message as spam (or non-spam). The
sample procmail recipe below shows one way to do this:
BOGOFILTER = "/usr/bin/bogofilter"
BOGOFILTER_DIR = "training"
SPAMASSASSIN = "/usr/bin/spamassassin"
:0 HBc
* ? $SPAMASSASSIN -e
#spam yields non-zero
#non-spam yields zero
| $BOGOFILTER -n -d $BOGOFILTER_DIR
#else (E)
:0Ec
| $BOGOFILTER -s -d $BOGOFILTER_DIR
:0fw
| $BOGOFILTER -p -e
:0:
* ^X-Bogosity:.Spam
spam
:0:
* ^X-Bogosity:.Ham
non-spam
What can I do about Asian spam?
Many people get unsolicited email using Asian language
charsets. Since they don't know the languages and don't know
people there, they assume it's spam.
The good news is that bogofilter does detect them quite
successfully. The bad news is that this can be expensive. You
have basically two choices:
-
You can simply let bogofilter handle it. Just train
bogofilter with the Asian language messages identified as
spam. Bogofilter will parse the messages as best it can and
will add tokens to the spam wordlist. The wordlist will
contain many tokens which don't make sense to you (since
the charset cannot be displayed), but bogofilter can work
with them and successfully identify Asian spam.
A second method is to use the
"replace_nonascii_characters" config file option. This will
replace high-bit characters, i.e. those between 0x80 and
0xFF, with question marks, '?'. This keeps the database
much smaller. Unfortunately this conflicts with European
language which have many accented vowels and consonant in
the high-bit range.
-
If you are sure you will not receive any legitimate
messages in those languages, you can kill them right away.
This will keep the database smaller. You can do this with
an MDA script.
Here's a procmail recipe that will sideline messages
written with Asian charsets:
## Silently drop all Asian language mail
UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
:0:
* 1^0 $ ^Subject:.*=\?($UNREADABLE)
* 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
spam-unreadable
:0:
* ^Content-Type:.*multipart
* B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
spam-unreadable
With the above recipe, bogofilter will never
see the message.
How can I compact my database?
You can periodically compact the database so it occupies a
minimum of disk space. Assuming your wordlist is in directory
~/.bogofilter, for bogofilter 0.93.0 (or newer) use:
bf_compact ~/.bogofilter wordlist.db
For bogofilter older than 0.93.0, use:
cd ~/.bogofilter
bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
mv wordlist.db wordlist.db.prv
mv wordlist.db.new wordlist.db
The script is needed to duplicate your database environment (in
order to support BerkeleyDB transaction processing). Your
original directory will be renamed to ~/.bogofilter.old and
~/.bogofilter will contain the new database environment.
Since older versions of bogofilter don't use Berkeley DB
transactions, the database is just a single file (wordlist.db) and
it isn't necessary to use the script. The commands shown above
create a new compact database and rename the original file to
wordlist.db.prv
Note: it's O.K. to use the script with old versions of
bogofilter.
How do I manually query the database?
To find the spam and ham counts for a token (word) use
bogoutil's '-w' option. For example, "bogoutil -w
$BOGOFILTER_DIR/wordlist.db example.com" gives the good and bad
counts for "example.com".
If you want the spam score in addition to the spam and ham
counts for a token (word) use bogoutil's '-p' option. For example,
"bogoutil -p $BOGOFILTER_DIR/wordlist.db example.com" gives the
good and bad counts for "example.com".
To find out how many messages are in your wordlists query the
special token .MSG_COUNT, i.e., run command "bogoutil -w
$BOGOFILTER_DIR/wordlist.db .MSG_COUNT" to see the counts for the
spam and ham wordlists.
To tell how many tokens are in your wordlists pipe the output
of bogoutil's dump command to command "wc", i.e. use "bogoutil -d
$BOGOFILTER_DIR/wordlist.db | wc -l " to display the count.
Can I use multiple wordlists?
Yes. Bogofilter can be run with multiple wordlists. For
example, if you have both user
and
system
wordlists, bogofilter can be instructed to
check the user list and, if the word isn't there, then check the
system list. Alternatively, it can be instructed to add together
the information from the two lists.
Following are the config file options and some examples:
A wordlist has several attributes, notably type, name,
filename, and precedence.
- Type: 'R' and 'I' (for "regular" and "ignore"). Current
wordlists are of type 'R'. Type 'I' means "don't score the token
if found in the ignore list".
- Name: a short identifying symbol used when printing error
messages. Examples are "global", "user", and "ignore", but you
can use any identifier you want.
- Filename: the name (path) of the file. When opening the
wordlist, if the name is fully qualifified (with a leading '/'
or '~'), that name is used. If the name isn't fully qualified,
bogofilter will prepend the directory, using the usual search
order is used, i.e. $BOGOFILTER_DIR, $BOGODIR, $HOME.
- Precedence: an integer like 1, 2, 3, ... Wordlists are
searched in ascending order for the token. If the search token
is found, lists with the same precedence number will be checked
(and counts added together). Lists with higher precedence
numbers will not be checked.
Example 1 - merge user and system lists:
wordlist R,user,~/wordlist.db,1
wordlist R,system,/var/spool/bogofilter/wordlist.db,1
Example 2 - prefer user to system list:
wordlist R,user,~/wordlist.db,2
wordlist R,system,/var/spool/bogofilter/wordlist.db,3
Example 3 - prefer system to user list:
wordlist R,user,~/wordlist.db,5
wordlist R,system,/var/spool/bogofilter/wordlist.db,4
Note 1: bogofilter's registration flags ('-s', '-n', '-u',
'-S', '-N' ) will apply to the lowest numbered list.
Note 2: having lists of types 'R' and 'I' of the same
precedence won't be allowed because the types are
contradictory.
Can I tell bogofilter to ignore certain tokens?
Through the use of an ignore list, bogofilter will ignore the
listed tokens when scoring the message.
Example:
wordlist I,ignore,~/ignorelist.db,7
wordlist R,system,/var/spool/bogofilter/wordlist.db,8
Because ignorelist.db
has a lower index (7), than
wordlist.db
(8), bogofilter will stop looking when
finds a token in ignorelist.db
.
Note: Technically, bogofilter gives a score of ROBX to the
tokens and expects the min_dev parameter to drop them from the
scoring.
There are two main methods for building/maintaining an ignore list.
First, a text file can be created and maintained using any text
editor. Bogoutil can convert the text file to database format,
e.g. "bogoutil -l ignorelist.db < ignorelist.txt".
Alternatively, echo ... | bogoutil ...
can be used
to add a single token, for example "ignore.me", as in:
echo ignore.me | bogoutil -l ~/ignorelist.db
How do I upgrade from separate word databases to
the combined wordlist format?
Run script bogoupgrade. For more info, run "bogoupgrade -h" to
see its help message or run "man bogoupgrade" and read its man
page.
How can I tell if my wordlists are corrupted?
NOTE: some distributors rename all the db_
utilities given below by inserting or appending the version number,
with or without dot, for instance db4.1_verify or db_verify-4.2.
There is no standard on the renaming of these utilities.
If you think your wordlists are hosed, you can see what
BerkeleyDB thinks by running:
db_verify wordlist.db
You may be able to recover some (or all) of the tokens and
their counts with the following commands:
bogoutil -d wordlist.db | bogoutil -l wordlist.new.db
or - if there has been more damage to the token list - with
db_dump -r wordlist.db > wordlist.txt
db_load wordlist.new.db < wordlist.txt
You can also use a text file instead of a pipe, as in:
bogoutil -d wordlist.db > wordlist.txt
bogoutil -l wordlist.db.new < wordlist.txt
How can I convert my wordlist to/from unicode?
Wordlists can be converted from raw storage to unicode using:
bogoutil -d wordlist.db > wordlist.raw.txt
iconv -f iso-8859-1 -t utf-8 < wordlist.raw.txt > wordlist.utf8.txt
bogoutil -l wordlist.db.new < wordlist.utf8.txt
or:
bogoutil --unicode=yes -m wordlist.db
Wordlists can be converted from unicode to raw storage using:
bogoutil -d wordlist.db > wordlist.utf8.txt
iconv -f utf-8 -t iso-8859-1 < wordlist.utf8.txt > wordlist.raw.txt
bogoutil -l wordlist.db.new < wordlist.raw.txt
or:
bogoutil --unicode=no -m wordlist.db
The above methods work best when the wordlist is based on the
iso-8859-1 charset. If your wordlist is based on a different
charset, for example CP866 or KOI8-R, use that charset in the
above commands.
For a wordlist containing tokens from multiple languages,
particularly non-european languages, the conversion methods
described above may not work well. Building a new wordlist (from
scratch) will likely work better as the new wordlist will be based
solely on unicode.
How can I switch from non-transaction
to transaction mode?
How to do this is fully documented in file doc/README.db section
2.2.1. We suggest you read the whole section.
In brief, use these commands:
cd ~/.bogofilter
bogoutil -d wordlist.db > wordlist.txt
mv wordlist.db wordlist.db.old
bogoutil --db-transaction=yes -l wordlist.db < wordlist.txt
If everything went well, you can remove the backup files:
rm wordlist.db.old wordlist.txt
How can I switch from transaction to
non-transaction mode?
How to do this is fully documented in file doc/README.db section
2.2.2. We suggest you read the whole section.
In brief, you can use bogoutil to dump/load the wordlist, for example:
cd ~/.bogofilter
bogoutil -d wordlist.db > wordlist.txt
mv wordlist.db wordlist.db.old
rm -f log.?????????? __db.???
bogoutil --db-transaction=no -l wordlist.db < wordlist.txt
Why does bogofilter die after printing
"Lock table is out of available locks" or
"Lock table is out of available object entries"
The transactional and concurrent modes of BerkeleyDB require a
lock table that corresponds to the data base in size. See the
README.db file for a detailed explanation and a
remedy.
The size of the lock table can be set in bogofilter.cf or in
DB_CONFIG. Bogofilter.cf uses the db_lk_max_locks and
db_lk_max_objects directives, while DB_CONFIG uses the
set_lk_max_objects and set_lk_max_locks directives.
After changing these values in DB_CONFIG, run command
bogoutil --db-recover /your/bogofilter/directory
to rebuild the lock tables.
Why does bogofilter crash with "File size limit exceeded"?
Some mail transfer agents, such as Postfix, impose file size
limits. When bogofilter's database reaches that limit, for instance, because
it is executed by Postfix's local(8), which enforces Postfix's mailbox_size_limit
on all files, including bogofilter's database, then bogofilter may crash with SIGXFSZ "File size limit exceeded".
Why am I getting DB_PAGE_NOTFOUND messages?
You have a problem with your BerkeleyDB database. There are
two likely causes: either you've hit a max size limit or the
database is corrupt.
Some mail transfer agents, such as Postfix, impose file size
limits. When bogofilter's database reaches that limit, write
problems will occur.
To show the database size use:
ls -lh $BOGOFILTER_DIR/wordlist.db
To show the postfix setting:
postconf | grep mailbox_size_limit
To set the limit to 73MB (or whatever size is right for you):
postconf -e mailbox_size_limit=73000000
If you think your database may be corrupt, read
How can I tell if my wordlists are corrupted?
FAQ entry.
Why am I getting "Berkeley DB
library configured to support only DB_PRIVATE
environments" or
"Berkeley DB library configured to support only
private environments"?
Some distributors (for instance the Fedora Project) package
Berkeley DB with support for POSIX threading and hence POSIX
mutexes, but your system does not support POSIX mutexes
(whether it
does, depends on the kernel version and exact processor
type).
To work around this problem:
- download, compile and install Berkeley
DB on your own and the reconfigure bogofilter:
- cd build_unix
- ../dist/configure --enable-cxx
- make
- make install
- recompile and install bogofilter:
- ./configure
--with-libdb-prefix=/usr/local/BerkeleyDB.4.3
(replace your Berkeley DB version number)
- make && make check
- make install (if space is a
premium, use make install-strip)
Can bogofilter be used in a multi-user environment?
Yes, it can. There are multiple, distinct strategies for doing
this. The two extremes are:
- Having a bogofilter administrator who maintains a global
wordlist that everybody uses.
- Having each user maintain his/her own wordlist.
As a middle ground, the bogofilter administrator can create and
maintain the global wordlists and each user can be given the
choice of using the global wordlist or a private wordlist. An
MDA, such as procmail, can be programmed to first apply the global
wordlist (with a very stringent spam cutoff) and then (if
necessary) apply the user's wordlist.
Can I share wordlists over NFS?
If you're just reading from them, there are no problems.
When you're updating them, you need to use the correct file
locking to avoid data corruption. When you compile bogofilter, you
will need to verify that the configure script has set "#define
HAVE_FCNTL 1" in your config.h file. Popular UNIX operating
systems will all support this. If you are running an unusual, or
an older version of an operating system, make sure it supports
fcntl(). If your system does not
support fcntl(), then you will not be able to share wordlist
files over NFS without the risk of data corruption.
Next, make sure you have NFS set up properly, with "lockd"
running. Refer to your NFS documentation for more information
about running "lockd" or "rpc.lockd". Most operating systems
with NFS turn this on by default.
For shared directories (NFS directories used by multiple
machines, for instance, Sparc/Itanium/Alpha and x86), the
architecture-specific parts can be installed separately by giving
a different --exec-prefix
(it will default to
--prefix
)
Why does bogofilter give return codes
like 0 and 256 when it's run from inside a program?
Likely the return codes are being reformatted by waitpid(2).
In C use WEXITSTATUS(status) in sys/wait.h, or comparable macro,
to get the correct value. In Perl you can just use
'system("bogofilter $input") >> 8'. If you want more info, run
"man waitpid"
.
Now that I've upgraded why are
my scripts broken?
Over time bogofilter accumulated a large number of functions.
Some of those were discontinued or changed. Please read the
NEWS file
for details.
Now that I've upgraded why is
bogofilter working less well?
The lexer, i.e., that part of bogofilter which extracts tokens
from a message, evolves. This results in different readings of messages
with the consequence that some tokens in the database can no longer be
used.
If you encounter this problem, you are strongly advised to rebuild your
database. If this is not an option for you, you might want to use version
0.15.13
and read the documentation which comes with it for how to migrate your
database.
How can I
delete all the spam (or non-spam) tokens?
Bogoutil lets you dump a wordlist and load the tokens into a
new wordlist. With the added use of awk and grep, counts can be
zeroed and tokens with zero counts for both spam and non-spam can be
deleted.
The following commands will delete the tokens from spam messages:
bogoutil -d wordlist.db | \
awk '{print $1 " " $2 " 0"}' | grep -v " 0 0" | \
bogoutil -l wordlist.new.db
The following commands will delete the tokens from non-spam messages:
bogoutil -d wordlist.db | \
awk '{print $1 " 0 " $3}' | grep -v " 0 0" | \
bogoutil -l wordlist.new.db
How do I get bogofilter working on Solaris, BSD, etc?
If you don't already have a v3.0+ version of
BerkeleyDB, then
download it (take
one of the 4.4.X, 4.3.X or 4.2.X versions),
unpack it, and do these commands in the db directory:
$ cd build_unix
$ sh ../dist/configure
$ make
# make install
Next, download a
portable version
of bogofilter.
On Solaris
Be sure that your PATH environment variable begins with
/usr/xpg6/bin:/usr/xpg4/bin:/usr/ccs/bin (/usr/xpg6/bin is only
present on Solaris 10 and can be omitted on Solaris 9 and older
versions). That is required for POSIX compliance.
Unpack it, and then do:
$ ./configure --with-libdb-prefix=/usr/local/BerkeleyDB.4.4
$ make
# make install-strip
You will either want to put a symlink to libdb.so in
/usr/lib, or use a modified LD_LIBRARY_PATH environment
variable before you start bogofilter. On newer systems, the most
convenient way is probably to use the crle(1) tool to set the path
permanently so BerkeleyDB is available to all applications.
$ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.4
$ export LD_LIBRARY_PATH
Note that some "make" versions shipped with older Solaris version
break when you try to build bogofilter outside of its source
directory. Either build in the source directory (as suggested
above) or use GNU make (gmake).
If your Solaris GCC complains with "ld: fatal: file values-Xa.o:
open failed: No such file or directory", install the SUNWarc
package.
On FreeBSD
The FreeBSD ports collection carries the latest stable versions of
bogofilter to be compiled from source. The bogofilter ports are also auto-built and provided as binary packages for you to install.
The binary
packages approach uses default installed software. To install
bogofilter from binary package, type, as the privileged user:
pkg install -y bogofilter
The ports from-source approach uses the highly recommended
portmaster and portsnap software packages. To install portmaster,
type (you need to do this only once), as root:
pkg install -y portmaster
To install or upgrade bogofilter, just
upgrade
your portstree using portsnap,
then type, as root:
portmaster mail/bogofilter
Note: This assumes you are root. If not, read through
the remainder of this FreeBSD section and then see how you can
build if you haven't got root privileges.
On NetBSD and other systems that use "pkgsrc"
pkgsrc should be offering a reasonably recent stable bogofilter
release. See http://www.pkgsrc.org/ for
information on pkgsrc.
On HP-UX
See the file
doc/programmer/README.hp-ux
in the source distribution.
Can I use the make command on my operating system?
Bogofilter has been successfully built on many operating
systems using GNU make and the native make commands. However,
bogofilter's Makefile doesn't work with some make commands.
GNU make is recommended for building bogofilter because we
know it works. We cannot support less capable make commands. If
your non-GNU make command can successfully build bogofilter,
that's great. If you encounter problems, the right thing to do is
install GNU make. If your non-GNU make can't build bogofilter,
we're sorry but you're on your own. If it takes just a minor and clean
patch to make it compatible, we might take it.
How do I build bogofilter as non-root user or for
a non-standard installation prefix?
To install bogofilter to a non-standard path (as non-root user
you don't have the permission to the normal paths), you need to
provide the installation prefix when you run ./configure
.
After downloading and unpacking the
source code, run ./configure --prefix=PATH
where
PATH is the installation prefix for the generated files (binaries,
man pages etc.). Then run the usual build commands —
make && make check && make install
.
How do I build bogofilter with patches?
If you need to apply patches, get the source
code and unpack it using tar -xzf
or gunzip
| tar -xf -
(as appropriate). Change to the
source directory and run ./configure --prefix=PATH
where PATH is the installation prefix for the generated files
(binaries, man pages etc.). Apply your patches than run
make && make install
.
How do I make the executables smaller?
When space is tight, you can use make
install-strip
instead of make install
. Doing
this will save space, but crashes can't be debugged unless more
information on reproducing the bug is provided to the
developers.
datastore_db.c does not compile!
If you are configuring a data base path for instance with
--with-libdb-prefix or via CPPFLAGS and LIBS, be sure to pass in an
absolute path (with leading slash), a relative path will
not work. Example: use
--with-libdb-prefix=/usr/local/BerkeleyDB.4.2, but
not --with-libdb-prefix=../BerkeleyDB.4.2
With which mail programs does bogofilter work?
Bogofilter is known to work with kmail, mozilla-mail, mutt,
alpine, sylpheed-claws. A google search will help you
find more information on using bogofilter with the mail program
you use.
How do I use bogofilter with mutt?
Use a mail filter (procmail, maildrop, etc.) to filter mail
into different folders based on bogofilter's return code and set
mutt key bindings to train bogofilter on errors:
macro index S "|bogofilter -s\ns=junkmail" "Learn as spam and save to junk"
macro pager S "|bogofilter -s\ns=junkmail" "Learn as spam and save to junk"
macro index H "|bogofilter -n\ns=" "Learn as ham and save"
macro pager H "|bogofilter -n\ns=" "Learn as ham and save"
These will pipe the selected message through bogofilter,
training a false-ham as spam or vice versa, then offer to save the
message to a different folder.
How do I use bogofilter with Sylpheed Claws?
Add a filtering rule to run bogofilter on incoming messages
and an action to perform if it's spam
condition:
* test "bogofilter < %F"
action:
* move "#mh/YOUR_SPAM_BOX"
Note: this assumes that bogofilter is in your path!
Create two Claws actions - one for marking messages as spam
and one for marking messages as ham. Use the "Mark As Spam"
action for messages incorrectly classified as ham and use the "Mark As Ham"
action for messages incorrectly classified as spam.
Mark as ham / spam:
* bogofilter -n -v -B "%f" (mark ham)
* bogofilter -s -v -B "%f" (mark spam)
Another approach is to save incorrectly classified messages in
a folder (or folders) and run a script like:
#!/bin/sh
CONFIGDIR=~/.bogofilter
SPAMDIRS="$CONFIGDIR/spamdirs"
MARKFILE="$CONFIGDIR/lastbogorun"
for D in `cat "$SPAMDIRS"`; do
find "$D" -type f -newer "$MARKFILE" -not -name ".sylpheed*"
done|bogofilter -bNsv
touch "$MARKFILE"
This script can be used as an action and/or made into a toolbar
button. It will register as spam the messages in ${SPAMDIRS} that
are newer than ${MARKFILE}.
Additional information is available at the
Sylpheed-Claws's wiki.
Another approach is to run bogofilter from procmail, maildrop,
etc and have Claws check the X-Bogosity header and filter messages
into Spam and Unsure folders, e.g.:
Condition:
header "X-Bogosity" matchcase "Spam"
Action:
move "#mh/Mailbox/Spam"
Condition:
header "X-Bogosity" matchcase "Unsure"
Action:
move "#mh/Mailbox/Unsure"
Any messages in the Unsure folder should be used for training,
as should messages incorrectly classified as ham or spam. The
actions below will handle these cases:
Register Spam:
bogofilter -s < "%f"
Register Ham:
bogofilter -n < "%f"
Unregister Spam:
bogofilter -S < "%f"
Unregister Ham:
bogofilter -N < "%f"
To look inside the bogofilter scoring mechanism, the following
diagnostic are useful:
BogoTest -vv:
bogofilter -vv < "%f"
BogoTest -vvv:
bogofilter -vvv < "%f"
Additional information on this approach is available here.
How do I use bogofilter with VM (an Emacs Mail
tool)?
You need to include the separate file vm-bogofilter.el
(included in bogofilter's contrib directory). The latest version
of the file is at
http://www.cis.upenn.edu/~bjornk/bogofilter/vm-bogofilter.el) in
your emacs path.
Then, just add in your ~/.vm configuration file:
;; load bogofilter capabilities (spam)
;;
(require 'vm-bogofilter)
;; short-key for bogofilter
;; C (shift-c) means spam message
;; K (shift-k) means ham message
(define-key vm-mode-map "K" 'vm-bogofilter-is-spam)
(define-key vm-mode-map "C" 'vm-bogofilter-is-clean)
All the messages are filtered by bogofilter each time you check
newly arrived e-mail. When you change the status of an e-mail,
the bogofilter header is changed (X-Bogosity: header).
There is a limit: you cannot change multiple message headers at
one time in VM; you have to do it message by message.
How do I use bogofilter with MH-E (the Emacs
interface to the MH mail system)?
The default setting of the 'mh-junk-program' option is
'Auto-detect' which means that MH-E will automatically choose one
of SpamAssassin, Bogofilter, or SpamProbe in that order. If, for
example, you have both SpamAssassin and Bogofilter installed and
you want to use BogoFilter, then you can set this option to
'Bogofilter'.
The 'J b' ('mh-junk-blacklist') command trains the spam program
in use with the content of the range and then handles the
message(s) as specified by the 'mh-junk-disposition' option. By
default, this option is set to 'Delete Spam' but you can also
specify the name of the folder which is useful for building a
corpus of spam for training purposes.
In contrast, the 'J w' ('mh-junk-whitelist') command
reclassifies a range of messages as ham if it were incorrectly
classified as spam. It then refiles the message into the '+inbox'
folder.
For more information, see the MH-E home page