Examples:
Related:
Our goal for this week is to develop a very simple example of applying Bayes’ Theorem to text analysis. The example we will use is modeled after Paul Graham’s “A Plan for Spam.” This example will be demonstrated through storing Word objects in a Hash Table, instead of a Binary Search Tree (as with last week’s concordance example). We can use a Hash Table here because the order of the words is not relevant to this problem.
Hash Tables
Last week, we discovered that we could improve on the efficiency of searching through an ordered list of words via a binary tree data structure. This structure provided O(logN) look-up time, a great improvement over O(N) look-up that a simple ArrayList or LinkedList provided. We wrote our own Tree class, however, we could have simplified our lives by using the Java TreeMap.
Wouldn’t it be nice if we could search for a word in a data structure and have constant time (i.e. O(1)) look-up for any word? This is similar to the speed at which we can look-up any element in an array by accessing it via its index. We only need one computation cycle! In fact, a hash table does exactly this! Although we could build our own as we did with the Tree, this time we’ll use the existing Java HashMap class.
A hash table provides constant-time look-up of any value via its key. In our case, the key and the value can essentially be the same (the word we are searching for). The key is transformed into a numerical value via a hash function. That number is then used as an index in an array to check if the key exists and/or retrieve the value associated with that key. Consider a room full of approximately 50 ITP students that we want to put into a Hash Table. We need to come up with a function that takes the person and converts them into a numerical value. Let’s try a few different solutions:
Which of the above functions are good hash functions and which are bad? Birthday year, for example, is a terrible hash function. If a large number of the students are born in the same year (which they likely will be) then we will end up with the same hashcode for many of the people. This is what’s known as a “Collision.” The HashTable can handle collisions by performing a linear search, however, the more collisions, the more our algorithm’s efficiency will deterioriate. Imagine if everyone had the same hashcode, we’d be in no different shape then we would with a LinkedList!
The last two digits of NYU ID, however, would be a much better hash function since we would expect this to be fairly randomized and would give us 100 possibilities (00 – 99) with relatively few collisions.
Fortunately for us, we don’t have to worry about writing our own Hash Functions (though this would be an interesting avenue to pursue), we can simply use the HashMap class. Here’s how we use a HashMap:
Create an empty one:
HashMap words = new HashMap();
Add some words to it (using a String as both the key and value)
String text = "This example demonstrates using a HashMap";
// Add every word to the hashmap
String[] tokens = text.split("\\W");
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i];
// Note we are using the same String for both Key and Value
words.put(word,word);
}
Now we can use the containsKey() method to search for a word:
// Search the hash table for various words
if (words.containsKey("example")) {
System.out.println("I found the word example!");
} else {
System.out.println("I did not find the word example!");
}
full code: HashMapTest.java
p(A|B) = ( p(B|A)*p(A) ) / ( p(B|A)*p(A) + p(B|~A)*p(~A) )
Bayes’ Rule
Consider the following scenario:
You have received a positive TID, what is the likelihood you have ITPosis?
Strangely enough, there is a very precise answer to this question and it’s not what you would expect. Bayesian reasoning is counter-intuitive and takes quite a bit of getting used to. In fact, when given a similar question related to breast cancer and mammograms, only 15% of doctors get the answer correct.
The answer — 15.3% — is calculated via Bayes’ Theorem. Let’s look at it again with this scenario:
The problem our brains run into are those rascally 90% and 95% numbers. 90% of students who test positive have the disease and 95% who don’t test negative, if I test positive, I should probably have it, right?!! The important thing to remember is that only 1% of students actually have the disease. Sure testing positive increases the likelihood, but because 5% of students without the disease receive a false positive, it only increases the chances to 15%. All of this is explained in incredibly thorough and wonderful detail in Eliezer Yudkowsky’s article An Intuitive Explanation of Bayesian Reasoning. My explanation is simply adapated from his.
By the way, we could have calculated it as follows:
P (ITPosis | Positive TID) = (90%*1%) / (90%*1% + 5%*99%).
This reads as “the probability that a positive TID means you have ITPosis” equals. . . .
So why do we care? This type of reasoning can be applied quite nicely to text analysis. The example we’ll look at is Spam Filtering. If we know the probability that a spam e-mail contains certain words and that non-spam e-mails contain certain words, we can calculate the likelihood that an e-mail is spam based on what words it contains.
Spam Filtering using Bayesian Analysis
Before we attempt any of this we should read Paul Graham’s A Plan for Spam. And perhaps we might take a look at Better Bayesian Filtering. Now, we can start putting together some code. Note that the code is not a perfect Spam filter by any means. There are lots of issues with it (it only considers certain types of words, it doesn’t know anything about the number of e-mails received, it may run into memory problems if very large training files are used, etc.) Nevertheless, the example outlines the steps one might take to apply Bayesian Filtering to text. We would also use similar strategies to perform Naive Bayes Classification of text, much like is done in the Nora Project.
Fancy Word Class
The first thing we need to do is develop a class to describe any individual word. Not only do we want to know what the String is itself, but we want to know how many times that word appears in spam e-mails, how many times in good e-mails, and ultimately the probability that an e-mail is Spam based on the appearance of that word.
public class Word {
private String word; // The String itself
private int countBad; // The total times it appears in "bad" messages
private int countGood; // The total times it appears in "good" messages
private float rBad; // bad count / total bad words
private float rGood; // good count / total good words
private float pSpam; // probability this word is Spam
// Create a word, initialize all vars to 0
public Word(String s) {
word = s;
countBad = 0;
countGood = 0;
rBad = 0.0f;
rGood = 0.0f;
pSpam = 0.0f;
}
}
That’s a good start and the above code encompasses all we need to store the necessary data for each Word object. We’ll also need to give Word objects some functionality, such as increasing the the count for “bad” occurences, calculating the probability it occurs in spam messages, and more. We’ll add some methods to the above class as follows (this is just a partial sample):
// Increment bad counter
public void countBad() {
countBad++;
}
// Implement bayes rules to compute how likely this word is "spam"
public void finalizeProb() {
if (rGood + rBad > 0) pSpam = rBad / (rBad + rGood);
if (pSpam < 0.01f) pSpam = 0.01f;
else if (pSpam > 0.99f) pSpam = 0.99f;
}
// The “interesting†rating for a word is
// How different from 0.5 it is
public float interesting() {
return Math.abs(0.5f - pSpam);
}
Note that it will become important later to determine which words are “interesting.” Our interesting rating is defined as how different the spam probability is from 0.5 (i.e. 50/50 is as boring as it gets).
Once we have the Word object taken care of, we can create a HashMap and start dumping words in it. Where do we get these words? We’ll train the filter via known spam e-mails and known “good” e-mails. Every time we encounter a word, we will check if it already exists in the Hash Table. If it does not, we will add it, if it does, we will simply increase the counter for “bad” or “good” (depending on whether it’s found in a good or bad e-mail.)
HashMap words = new HashMap();
// Read the content and break up into words
A2ZFileReader fr = new A2ZFileReader(file);
String content = fr.getContent();
String[] tokens = content.split("\\W");
int spamTotal = tokens.length; // How many words total
// For every word token
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i].toLowerCase();
// If it exists in the HashMap already
if (words.containsKey(word)) {
Word w = (Word) words.get(word);
w.countBad();
// Otherwise it's a new word so add it
} else {
Word w = new Word(word);
w.countBad();
words.put(word,w);
}
}
The above steps would then be repeated for any good e-mails (using the same HashMap, but calling countGood() instead of countBad().) Once all the "training" e-mails are read, the probability of an e-mail containing a certain word is Spam, can be calculated for every word. This is done by creating an Iterator object from the HashMap. The Iterator allows us to visit every single value in the HashMap and call a method on it, i.e.:
Iterator iterator = words.values().iterator();
while (iterator.hasNext()) {
Word word = (Word) iterator.next();
word.finalizeProb();
}
Now, our spam filter is trained! For every word in the HashMap, we know the probability that an e-mail containing it is spam. Now, all that is left to do is take a new e-mail, find the 15 most interesting words, and compute the total probability for the e-mail according to the formula specified in Graham’s essay.
There are a number of different ways we can find the most interesting 15 words. The following algorithm is an Insertion Sort algorithm where an ArrayList is created one element at a time, inserting the elements the right spot to keep a descending sorted order. If the list grows above 15, we just delete the last element to keep the top 15 words.
// Create an arraylist of 15 most "interesting" words
// Words are most interesting based on how different their Spam probability is from 0.5
ArrayList interesting = new ArrayList();
// For every word in the String to be analyzed
String[] tokens = stuff.split(splitregex);
for (int i = 0; i < tokens.length; i++) {
String s = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(s);
if (m.matches()) {
Word w;
// If the String is in our HashMap get the word out
if (words.containsKey(s)) {
w = (Word) words.get(s);
// Otherwise, make a new word with a Spam probability of 0.4;
} else {
w = new Word(s);
w.setPSpam(0.4f);
}
// We will limit ourselves to the 15 most interesting word
int limit = 15;
// If this list is empty, then add this word in!
if (interesting.isEmpty()) {
interesting.add(w);
// Otherwise, add it in sorted order by interesting level
} else {
for (int i = 0; i < interesting.size(); i++) {
// For every word in the list already
Word nw = (Word) interesting.get(i);
// If it's the same word, don't bother
if (w.getWord().equals(nw.getWord())) {
break;
// If it's more interesting stick it in the list
} else if (w.interesting() > nw.interesting()) {
interesting.add(i,w);
break;
// If we get to the end, just tack it on there
} else if (i == interesting.size()-1) {
interesting.add(w);
}
}
}
// If the list is bigger than the limit, delete entries
// at the end (the more "interesting" ones are at the
// start of the list
while (interesting.size() > limit) interesting.remove(interesting.size()-1);
}
}
We’re almost there. . . now that we’ve got the probabilities for each word, we can move on to the combined probability. (more here)
// Apply Bayes' rule (via Graham)
float pposproduct = 1.0f;
float pnegproduct = 1.0f;
// For every word, multiply Spam probabilities ("Pspam") together
// (As well as 1 - Pspam)
for (int i = 0; i < interesting.size(); i++) {
Word w = (Word) interesting.get(i);
//System.out.println(w.getWord() + " " + w.getPSpam());
pposproduct *= w.getPSpam();
pnegproduct *= (1.0f - w.getPSpam());
}
// Apply formula
float pSpam = pposproduct / (pposproduct + pnegproduct);
. . .and if pSpam is greater than 0.9, we've found a spam e-mail, if not, we're ok!!

Hi,
I want C (C) program to process spam.
how can i store tokenized data ?
how can i develop classifier ?
Is it bayes filter is a package that is available ?
Can you suggest the logic ?
Thanks
Ram,
I want C (C) program to process spam.
how can i store tokenized data ?
how can i develop classifier ?
Is it bayes filter is a package that is available ?
Can you suggest the logic ?