Examples:

  • Note: updated and extended examples on the A2Z CVS!
  • Full example: concordance.zip
  • Individual source files: Concordance.java, Tree.java, A2ZFileReader.java.
  • Simplified example using TreeMap instead of custom binary search tree: concordanceTreeMap.zip, Concordance.java, Word.java, A2ZFileReader.java
  • Related:

  • Algorithmic Efficiency — Beating a Dead Horse Faster
  •  
    Our goal for this week is to develop a very simple example of a text concordance. This program will read in a text file and process a count for every word that appears in that file, producing a list of words with their corresponding counts in alphabetical order. We might instinctively begin to develop this example by storing words in an array or ArrayList, however, we’re going to use a more advanced data structure to store our concordance information — a Binary Tree.

    Binary Tree

    A Binary Tree consists of a branching network of nodes, where each node has 0, 1, or 2 children. In our tree (technically a Binary Search Tree, each node will represent a word, i.e. String). Each node’s left subnode will contain words alphabetically preceding that node’s word, and each node’s right subnode will contain words that alphabetically follow that node’s word. Confused? The applet to the right demos a simple binary search tree of characters. (My goodness, shocking, are we looking at graphics in this class? Yowsa!)

    Here’s a better example of an applet that visualizes a binary tree of integers.

    Why is this better? Consider a data structure to keep track of 1000 words. If our task is to search and find any given word and our data structure is an array, the amount of cycles it might take to find any word would be equal to the amount of words, i.e. the efficiency of the algorithm would be linear. For N words, N cycles are required. This is known as Big O notation. Examples of Big O notation are:

    c: constant
    log N: logarithmic
    log^2 N: log-squared
    N: linear
    N log N: n * log(n)
    N^2: quadratic
    N^3: cubic
    2^N: exponential
    

    It turns out that (balanced) binary search trees are logN in big O notation. You can quickly see why they are more efficient. Instead of traversing the list one word at a time, we can maneuver left and right between nodes based on the alphabetical position of the word we are searching for. It turns out that to find any word, we only have to pass through each level once. In a perfectly balanced tree this amounts approximately to the equation: 2^n = totalwords (where ‘n’ is the number of steps we have to take to find any given word.) In logarithm-speak, we can solve for n, i.e. n = log2(totalwords).

    We really should create a generic binary tree, one that can hold any type of data that can be compared (i.e. implements Java’s Comparable interface.) Nevertheless, to keep things simple we are going to create a binary search tree that holds onto Strings (which can be compared alphabetically). (Diagram above thanks to: Br. David Carlson’s page.) The first thing we need to do is write a class to describe any given “Node” in the system.

    What type of data must a Node keep track of?

  • Word (String)
  • Left child (Node)
  • Right child (Node)
  •  
    Ok, so our Node class will look something like this to start:

    private class Node
    {
      public String word;
      public Node left;
      public Node right;
    
      public Node(String s) {
        word = s;
        left = null;
        right = null;
      }
    }
    

    Each node starts with a word and empty slots for left and right “subnodes.”

    Why is this class private? Certainly, there are different ways to implement this Tree structure. In our example, the Node class is private because it is an inner class that lives inside a Tree class. The node data is only managed internally and thus does not need to be accessed by any other classes. Here is how this works:

    public class Tree
    {
       private Node root;
    
       public Tree()
       {
          root = null;
       }
    
      private class Node
      {
        public String word;
        public Node left;
        public Node right;
        public int count;
    
        public Node(String s) {
          word = s;
          left = null;
          right = null;
        }
      }
    }
    

    Each Tree object starts with a root Node, which contains references to two subnodes, which each contain references to two subnodes, etc. When a Tree object is created, the root is set to null. So what happens when we add our first word to the tree? Well, we’ll have to implement an “insert” method inside the Tree object.

    public void insert(String s)
    {
      Node newNode = new Node(s);
      if (root == null) root = newNode;
      else root.insertNode(newNode);
    }
    

    In other words, if the root is null (i.e. the tree is empty) this new node becomes the root. If not, well then we call the “insertNode” method on the existing root node. Oy, but what is this insertNode() method? It turns out that we will need two insert methods. The above method lives in the Tree object and deals with inserting a node into the first spot in our tree. In all other cases, the Tree must be recursively traversed in order to find the right spot to insert the node. This is accomplished via an insert method written into the node class itself:

    // Inserts a new node as a descendant of this node
    // If both spots are full, keep on going. . .
    public void insertNode(Node newNode)
    {
       // If new word is alphabetically before
       if (newNode.word.compareTo(word) < 0)
       {
          // if the spot is empty (null) insert here
          // otherwise keep going!
          if (left == null) left = newNode;
          else left.insertNode(newNode);
       }
       // if new word is alphabeticall after
       else if (newNode.word.compareTo(word) > 0)
       {
          // if the spot is empty (null) insert here
          // otherwise keep going!
          if (right == null) right = newNode;
          else right.insertNode(newNode);
       // if new word is the same
       } else {
          // We'd do something here if we wanted to when the Strings are equal
          // For example, increment a counter
       }
    }
    

    This idea of recursively traversing the tree can be pretty confusing at first. It’s important to try to get a handle on it since all other operations we might want to perform on the data (searching, printing, etc.) will require this same methodology.

    This nice friendly applet might help.

    Following is the code for printing the information in the tree. Note how simple it is: only three lines! However, conceptually it may take some time to get used to this idea. We’ll be going over this in detail in class.

    public void printNodes()
    {
      if (left != null) left.printNodes();
      System.out.println(word + " " + count);
      if (right != null) right.printNodes();
    }
    

    Thinking of it in pseudo-code might help:

    1 -- Arrive at a node.
    2 -- If the left child is not empty, start step 1 for the left child node.
    3 -- Once we're back here, print the word.
    4 -- If the right child is not empty, start step 1 for the right child node.
    

    What would happen if we did it this way?

    1 -- Arrive at a node.
    2 -- If the right child is not empty, start step 1 for the right child node.
    3 -- Once we're back here, print the word.
    4 -- If the left child is not empty, start step 1 for the left child node.
    

    The full source for our Tree class is as follows (partially based on the Horstmann example from Big Java.

    /* Daniel Shiffman               */
    /* Binary Tree Class             */
    /* Programming from A to Z       */
    /* ITP, Spring 2006              */
    
    // THIS TREE CLASS IS BASED ON THE HORSTMANN EXAMPLE
    // SIMPLIFIED TO WORK WITH JUST STRINGS
    // DOES NOT 'INSERT' A DUPLICATE WORD TO THE TREE
    // INCLUDES A COUNTER FOR EACH WORD
    
    public class Tree
    {
    
       // We only need to keep track of the root
       // Each node will hold references to the other nodes
       private Node root;
    
       // Construct an empty tree
       public Tree()
       {
          root = null;
       }
    
       // Insert a new node
       public void insert(String s)
       {
          Node newNode = new Node(s);
          // If root is empty, then this node is the root
          if (root == null) root = newNode;
          // otherwise begin the recursive process
          else root.insertNode(newNode);
       }
    
       // Start the recursive traversing of the tree
       public void print()
       {
          if (root != null)
             root.printNodes();
       }
    
       // A node of a tree stores a String as well as references to
       // the child nodes to the left and to the right.
       // Note the use of an inner class
       private class Node
       {
          public String word;
          public Node left;
          public Node right;
          public int count;
    
          // Construct a node
          public Node(String s) {
             word = s;
             left = null;
             right = null;
             count = 1;
          }
    
          // Inserts a new node as a descendant of this node
          // If both spots are full, keep on going. . .
          public void insertNode(Node newNode)
          {
    
             // If new word is alphabetically before
             if (newNode.word.compareTo(word) < 0)
             {
                // if the spot is empty (null) insert here
                // otherwise keep going!
                if (left == null) left = newNode;
                else left.insertNode(newNode);
             }
             // if new word is alphabeticall after
             else if (newNode.word.compareTo(word) > 0)
             {
                // if the spot is empty (null) insert here
                // otherwise keep going!
                if (right == null) right = newNode;
                else right.insertNode(newNode);
             // if new word is the same
             } else {
                count++;
             }
          }
    
          // Recursively print words (with counts) in sorted order
          public void printNodes()
          {
             if (left != null) left.printNodes();
             System.out.println(word + " " + count);
             if (right != null) right.printNodes();
          }
       }
    }
    

    Breaking out File I/O operations into a separate class

    Now that we’ve completed the Tree class, it’s fairly simple to implement a main “driver” program to read an input file, create an empty tree, use the StringTokenizer to split it up into “words”, and insert all the words into the Tree. We’ll examine these pieces step-by-step in a moment, but first let’s look at cleaning up our code a bit for reading from a text file.

    In earlier examples, all of the file input code could be found in the main program and it’s probably a good idea to break it out into a separate class that manages this work for us. The following class stores two Strings, “filename” and “content.” The constructor is called with a filename, and a process is then set in motion that reads the content from the file into the “content” String. The content can then be accessed via the “getContent()” method. It also should be noticed that I’ve set this example up to be part of an “a2z” package.

    /* Daniel Shiffman               */
    /* Class to read an input file   */
    /* and return a String           */
    
    package a2z;
    
    import java.nio.*;
    import java.io.*;
    import java.nio.channels.*;
    
    public class A2ZFileReader
    {
      private String filename;
      private String content;
    
      public A2ZFileReader(String name) throws IOException {
        filename = name;
        readContent();
      }
    
      public void readContent() throws IOException {
        // Create an input stream and file channel
        // Using first arguemnt as file name to read in
        FileInputStream fis = new FileInputStream(filename);
        FileChannel fc = fis.getChannel();
    
        // Read the contents of a file into a ByteBuffer
        ByteBuffer bb = ByteBuffer.allocate((int)fc.size());
        fc.read(bb);
        fc.close();
    
        // Convert ByteBuffer to one long String
        content = new String(bb.array());
      }
    
      public String getContent() {
        return content;
      }
    }
    

    The Concordance

    Once this is complete, the code for our main program is fairly simple and just needs to implement several steps.

    STEP 1: READ IN THE DATA FILE

    String path   = args[0];
    A2ZFileReader fr = new A2ZFileReader(path);
    String content = fr.getContent();
    

    STEP 2: CREATE AN EMPTY TREE:

    Tree tree = new Tree();
    

    STEP 3: SPLIT CONTENT UP INTO “WORDS” AND INSERT INTO TREE

    String regex = "\\b";  // We could come up with a better regex
    String words[] = content.split(regex);
    // For every word
    for (int i = 0; i < words.length; i++)
    {
      tree.insert(words[i]);
    }
    

    STEP 4: DISPLAY TREE CONTENTS

    System.out.println("Here are the contents of your tree:");
    System.out.println();
    tree.print();
    

    There are plenty of improvements that could be made (error handling, eliminating tokens that aren’t really words, etc.) and some of these are incorporated into the full source for the example this week, available at the top of this page. (Or gee, I guess I might as well put a link to it here too.) Others are suggested as exercisess for this week’s programming assignment.

    It should also be noted that we could have implemented a concordance using one of Java’s built-in data structures, such as TreeMap. My example: concordanceTreeMap.zip or also: Chapter 12.5 of this online Java textbook.


    3 Responses to “Concordance”  

    1. 1 Mongie

      Can u please show the method leave count using C++.

    1. 1 Word Counting • Blog Archive • benyee
    2. 2 Rest of You » Blog Archive » Word Counting


    Leave a Reply