Mining

This week’s topics:

  • HTML
  • XML
  • APIs
  • Examples:

  • A2ZUrlReader.java, WeatherGrabberHTML.java
  • A2ZXMLHelper.java, XMLWeatherGrab.java, XMLTraverse.java
  • YahooNameSearch.java
  • DeliciousTest.java
  • FlickrHelper.java, FlickrProcessing.java
  • Fore more on Processing in Eclipse: http://www.mostpixelsever.com/tutorial/eclipse/

    Exercises (optional):

  • Combine these examples with one of the text analysis programs. Try using RSS news feed to train a bayesian filter.
  • Develop some visualization / analysis driven by yahoo searches. Here’s an interesting example.
  •  
    So far we’ve looked at several techniques for analyzing and manipulating text, dealing mostly with locally stored text files. Last week, we tasted the delicious nectar of mining the web by developing an example that grabbed a the source from a URL. This week, we’ll look further at the possibilites of grabbing textual content from the network, exploring parsing HTML / XML, as well as looking at three available APIs, Yahoo, del.icio.us, and flickr.

    Parsing HTML

    Grabbing data from an HTML page can be an uncomfortable experience. All that HTML whoozeemawhatsit is related to the visual appearance of a page, making the data appear rather disorganized. Nevertheless, there’s a lot of useful junk to be found out there in the world of HTML and with a little perserverence, it’s all there for free. There are different ways we might approach the problem of pulling information from a webpage. We can return to some of the basic String class functionality of yesteryear or use regular expressions.

    Let’s say we want to pull out the number of apples from the text below.

    String stuff = “Number of apples: 62.  Boy, do I like apples or what!”;

    Our algorithm would be as follows:

    1 — Find the end of the substring “apples: ” Call it start.
    2 — Find the period after “apples: ” Call it end.
    3 — Make a substring of the stuff between start and end.
    4 — Convert the string to a number (if we want to use it as such)

    In code, this would look like this:

    int start = stuff.indexOf(“apples: ”) + 8;  // STEP 1
    int end   = stuff.indexOf(“.”,start);       // STEP 2
    String apples = stuff.substring(start,end); // STEP 3
    int apple_no = Integer.parseInt(apples);    // STEP 4

    The above code will do the trick, but we should be a bit more careful to make sure we don’t run into any errors if we don’t find the substrings. We can add some error checking and generalize the code into a function:

    // A function that returns a substring between two substrings
    String giveMeTextBetween(String s, String startTag, String endTag) {
      String found = "";
      int startIndex = s.indexOf(startTag);         // Find the index of the beginning tag
      if (startIndex == -1) return "";              // If we don't find anything, return an empty String
      startIndex += startTag.length();              // Move to the end of the beginning tag
      int endIndex = s.indexOf(endTag, startIndex); // Find the index of the end tag
      if (endIndex == -1) return "";                // If we don't find the end tag, return a empty String
      return s.substring(startIndex,endIndex);      // Return the text in between
    }

    We could also rewrite the function using regular expressions:

    // A function to pull out the text between two regular expression Strings
    public static String regexTextBetween(String s, String startRegex, String endRegex) {
      // Create on large regex from start and end with anything in between
      // Note how we use parenthese to capture the in between stuff
      String bigRegex = startRegex + "(.*?)" + endRegex;
      // Compile the pattern and create the matcher
      Pattern p = Pattern.compile(bigRegex);
      Matcher m = p.matcher(s);
      // If we find what we are looking for, return the captured group
      if (m.find()) {
        return m.group(1);
      } else {
        return "oops!";
      }
    }

    Let’s say we wanted to grab the high and low temperatures off of this Yahoo weather page. If we visit that page and view source, we can see that the temperatures are embedded in the following html:

    <font size="2" face="Arial">High:</font> <b><font size="3" face="Arial">37°</font></b>
    <font size="2" face="Arial">Low:</font> <b><font size="3" face="Arial">25°</font></b>

    To grab the data we want, we simply call the above function with a regular expression matching that pattern that precedes and follows “37″ (as well as “25″).

    A2ZUrlReader urlreader = new A2ZUrlReader("http://weather.yahoo.com/forecast/USNY0996.html");
    String html = urlreader.getContent();
     
    // Grab high temperature
    String high = regexTextBetween(html,"High:</font> <b><font size=\"3\" face=\"Arial\">","°");
    System.out.println("The high for today is: " + high + " degrees.");
     
    // Grab low temperature
    String low = regexTextBetween(html,"Low:</font> <b><font size=\"3\" face=\"Arial\">","°");
    System.out.println("The low for today is: " + low + " degrees.");

    XML

    So, yes, HTML is an ugly, scary place with inconsistently formatted pages that are difficult to reverse engineer and parse effectively. Fortunately for you, there is the world of XML (Extensible Markup Language) page. XML is designed to facilitate the sharing of data across different systems and its format won’t cause your hair to go gray as fast. Let’s examine how we might grab exactly the same weather information from Yahoo’s RSS feed.

    XML organizes information in a tree structure. The code we’ll write to search and traverse an XML document is somewhat similar to the work we did in building a binary search tree and will involve recursion. Let’s look at the XML for Yahoo weather’s RSS feed (this is only part of the source in order to simplify the discussion.)

    <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
    <rss version="2.0" xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
     <channel>
       <item>
         <title>Conditions for New York, NY at 3:51 pm EST</title>
         <geo:lat>40.67</geo:lat>
         <geo:long>-73.94</geo:long>
         <link>
           http://us.rd.yahoo.com/dailynews/rss/weather/New_York__NY/*http://xml.weather.yahoo.com/forecast/USNY0996_f.html
         </link>
         <pubDate>Mon, 20 Feb 2006 3:51 pm EST</pubDate>
         <yweather:condition text="Fair" code="34" temp="35" date="Mon, 20 Feb 2006 3:51 pm EST"/>
         <yweather:forecast day="Mon" date="20 Feb 2006" low="25" high="37" text="Clear" code="31"/>
         <guid isPermaLink="false">USNY0996_2006_02_20_15_51_EST</guid>
       </item>
    </channel>
    </rss>

    With the exception of the first line (which simply indicates that the document is XML formatted), this XML document constains a nested structure consisting of elements, each with a start tag, i.e. <channel> and an end tag, i.e. </channel>. Some of these elements have content between the tags, i.e:

    &lt;title&gt;Conditions for New York, NY at 3:51 pm EST&lt;/title&gt;

    and some have attributes:

    &lt;yweather:forecast day="Mon" date="20 Feb 2006" low="25" high="37" text="Clear" code="31"/&gt;

    It should be fairly obvious how searching for information, such as the title of the page or the high temperature, will be significantly less painful than with the tragically arbitrary process of parsing HTML. Java provides us with a nice api for processing XML. The three packages we’ll need are:

  • org.w3c.dom, which contains the Document class, representing the entire XML document. Other noteworthy classes are Element, Attribute, Node, NodeList.
  • javax.xml.parsers, which contains DocumentBuilder. This does exactly what you might think, builds a Document instance out of an XML file.
  • org.xml.sax — “SAX” stands for Simple Api for XML and this package contains other useful classes and interfaces.
  • Ok, we can begin to write some code. Let’s just build a Document object from an XML page. (Note this is code is incomplete, just demonstrates the first few steps of an XML reading program.)

    // Import the necessary libraries
    import java.io.*;
    import java.net.*;
    import org.w3c.dom.*;
    import javax.xml.parsers.*;
    import org.xml.sax.*;
     
    public class XMLWeatherGrab {
      public static void main (String argv []){
        try {
          // Create a URL object and open an InputStream
          URL url = new URL("http://xml.weather.yahoo.com/forecastrss?p=10003");
          InputStream is = url.openStream();
     
          // Build a document from that InputStream
          DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
          DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
          Document doc = docBuilder.parse(is);

    Next, we might choose to grab the root element from the XML tree:

    Element root = doc.getDocumentElement();
    System.out.println(root.toString());

    (Note how the above code displays the entire XML document. Remember, XML is a nested structure and this root element contains all the subelements of the XML document! So instead, we might begin to poke around the children of the root.)

    NodeList children = root.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      System.out.println(n.getNodeName());
    }

    If we run the above code, which grabs a list of children and prints out their corresponding tag names, the result isn’t terribly interesting. All we get is “channel.” This is because the root element “RSS” has only one child, “channel.” So, what’s next? We’ll need to look at the children of “channel”! Oh, and then we might need to look at the children of the children of “channel” (if there are any)! Indeed, this is where our recursive tree traversal code will come in handy. The following function searches for specific element (in our case, we’ll want to look for “yweather:forecast”). As long as it is still searching, it will traverse each node’s child nodes. It’s a bit confusing as we have to deal with both Node objects and Element objects. The code checks each Node to see if it is of the type Element and if so, quickly converts it by casting the Node reference.

    public Element findElement(Element currElement, String elementName)
    {
      // So far we've found nothing, i.e. null
      Element found = null;
      // If the current element is what we want, hooray we are done!
      if (currElement.getTagName().equals(elementName)) {
        found = currElement;
      // Otherwise, if there are children, let's check those
      } else if (currElement.hasChildNodes()) {
        // Get the list of children
        NodeList children = currElement.getChildNodes();
        // As long as we haven't found it (i.e. found is still null)
        for (int i = 0; i < children.getLength() && found == null; i++) {
          // Search each Element type node
          Node n = children.item(i);
          if (n.getNodeType() == Node.ELEMENT_NODE) {
            Element e = (Element) n;
            found = findElement(e, elementName);
          }
        }
      }
      return found;
    }

    We might also just want to traverse the XML document and display all the info, which we can do in a similar fashion:

    // This method recursively traverses the XML Tree (starting from currElement)
    public void traverseXML (Node currNode)
    {
      // If it's an Element, spit out the Name
      if (currNode.getNodeType() == Node.ELEMENT_NODE) {
        System.out.print(currNode.getNodeName() + ": ");
      // If it's a "Text" node, take the value
      // These will come one after another and therefore appear in the right order
      } else if (currNode.getNodeType() == Node.TEXT_NODE) {
        System.out.println(currNode.getNodeValue().trim());
      }
     
      // Display any attributes
      if (currNode.hasAttributes()) {
        NamedNodeMap attributes = currNode.getAttributes();
        for (int i = 0; i < attributes.getLength(); i++) {
          Node attr = attributes.item(i);
          System.out.println("  " + attr.getNodeName() + ": " + attr.getNodeValue());
        }
      }
     
      // Check any children
      if (currNode.hasChildNodes()) {
        // Get the list of children
        NodeList children = currNode.getChildNodes();
        // Go through all the chilrden
        for (int i = 0; i < children.getLength(); i++) {
          // Search each Node
          Node n = children.item(i);
          traverseXML(n);
        }
      }
    }

    Of course, the inevitable question now comes up. Gosh, that seems awfully complicated. Parsing the HTML was, well, simpler! Indeed, getting up to speed with using XML can be a steep climb. Nevertheless, the rewards are greater. And since I've packaged the above code into its own class: A2ZXMLHelper.java, you should feel free to simply use its functionality. This base class (A2ZXMLHelper.java)should provide you with a framework for getting started parsing XML documents. See: XMLWeatherGrab.java for the example pulling weather information from Yahoo's RSS.

    Another thing you might notice about XML is how "object-oriented" it seems. If you take a look at an RSS feed, for example, you might notice that the tree structures contains a list of "item" elements (each item as a date, title, description, link, etc.). Below is a simplification:

    &lt;item&gt;
    	&lt;title&gt;Article Title&lt;/title&gt;
    	&lt;link>http://link&lt;/link&gt;
    	&lt;pubDate>Mon, 25 Feb 2008 02:05:12 -0600&lt;/pubDate&gt;
    	&lt;description&gt;
    		Article Description blah blah blah
    	&lt;/description&gt;
    &lt;/item&gt;

    We could easily then design a Java class to store the information with a similar structure:

    public class Post {
     
      private String title;
      private String content;
      private String link;
      private String date;

    Adding functions to set the values of each variable inside the class:

      void setTitle(String t) {
        title = t;
    }
     
    Full code: <a href="http://www.shiffman.net/itp/classes/a2z/week06/Post.java">Post.java</a>
     
    Now that we have this Post class, we can read the XML document and create Post objects for each "item" element.  We first use the <a href="http://www.shiffman.net/itp/classes/a2z/week06/A2ZXMLHelper">A2ZXMLHelper.java</a> class which includes a function to fill an ArrayList with all XML Elements with a specific name.
    // Use the helper class to create an XML document
    A2ZXMLHelper xml = new A2ZXMLHelper("http://feeds.boingboing.net/boingboing/iBag");
     
    // Make an ArrayList to put all "item" elements from the XML doc
    ArrayList items = new ArrayList();
    xml.fillArrayList(xml.getRoot(), "item", items);

    Once we have all the XML Elements, we make a new ArrayList to fill with Post objects. . .

    // Now, for every "item" in the XML doc, we will make a "Post" object
    // and store in an ArrayList
    ArrayList posts = new ArrayList();

    . . .and walk through each XML Element, filling the details of each Post object.

    // Loop through all items
    for (int i = 0; i < items.size(); i++) {
      // Make an empty Post object
      Post p = new Post();
      // Look at each "item" element
      Element e = (Element) items.get(i);
      // Get each item's children
      NodeList children = e.getChildNodes();
     
      for (int j = 0; j < children.getLength(); j++) {
        Node n = children.item(j);
        // Figure out which Node this is and set the node's text content in the object
        String tag = n.getNodeName();
        if (tag.equals("title")) {
          p.setTitle(n.getTextContent());
        } 
        else if (tag.equals("pubDate")) {
          p.setDate(n.getTextContent());
        } 
        else if (tag.equals("link")) {
          p.setLink(n.getTextContent());
        } 
        else if (tag.equals("description")) {
          p.setContent(n.getTextContent());
        }
      }
      posts.add(p);
    }

    Full code: BlogReader.java
    Full code: Post.java

    APIs

    Sure, we can parse HTML. Sure, we can wade through nicely formatted XML. But wouldn't it be nicer if we didn't have to do either and could simply access the data of a web site via some sort of API? In fact, with many sites we can. For starters: The Programmable Web: APIs. Woohoo!

    Let's look at three examples, Yahoo search, del.icio.us, and Flickr. To use any of these APIs in Eclipse, you first have add the appropriate library files (usually in the form of a JAR) to your Eclipse "Build Path". This is done with the following steps:

    1) Import the JAR file into your project. There are many ways to do this, one way is to use: FILE –> IMPORT –> GENERAL –> FILE SYSTEM. (Technically, you don't need to import the file, you can point to it in another directory. However, for the purpose of keeping everything together in one project, it's sometimes convenient to copy the JAR into your project.) The following screenshot shows importing the Processing JAR "core.jar."

    2) Click “finish.” You should now see “core.jar” listed under your project in the package explorer.

    3) Right-click on the file and select BUILD PATH –> ADD TO BUILD PATH. The “core.jar” file should now appear with a jar icon.

    For the Yahoo API, you'll need the JAR: "yahoo_search-2.0.1.jar" as well as an API Key (so Yahoo can track you in case your use of the API).

  • Yahoo Developer Site
  • Get A Developer Key
  • Download the API (which includes "yahoo_search-2.0.1.jar")
  • Try this example.
  • The code for dealing with the API is pretty simple. Import the library:

    import com.yahoo.search.*;

    Second, create a SearchClient object and pass it your license key.

    SearchClient client = new SearchClient("YOUR KEY HERE!");

    Afterwards, you can ask Yahoo to perform a search and retrieve the resulting metadata (total results, time search took, URLs, titles, descriptions, etc.) The example below displays the total number of results for a google search for "itp." Full documentation is available in the API download.

      WebSearchRequest request = new WebSearchRequest("itp");
      WebSearchResults results = client.webSearch(request);
      int count = results.getTotalResultsAvailable().intValue();

    There's also a nice Java del.icio.us API available on sourceforge. It works in a similar way as the Yahoo API, and is available as a JAR download. The JavaDocs are available online as well.

    It's very simple to use. All you need to do is create a Delicious object instance by passing in your username and password. (I would suggest pulling these from the args array or properties file so that they don't appear in your code like below.) You can then retrieve Tags, Posts, etc. The example below grabs all of the tags associated with an account into a List and prints them.

    import del.icio.us.*;
    import del.icio.us.beans.*;
    import java.util.*;
     
    public class DeliciousTest {
        public static void main(String[] args) {
            System.out.println("Testing del.icio.us API");
            Delicious del = new Delicious("itpa2z","interact");
            List tags = del.getTags();
            Iterator i = tags.iterator();
            while (i.hasNext()) {
                Tag tag = (Tag) i.next();
                int count = tag.getCount();
                String word = tag.getTag();
                System.out.println(word + " " + count);
            }
        }
    }

    Of course, you don't need this API to access del.icio.us since you can just as easily grab data via RSS.

    Finally, here's some quick information to get you started using the Flickr API (which again could also be accessed via RSS feeds.

    Flickr Services API: http://www.flickr.com/services/api/
    FlickrJ: http://sourceforge.net/projects/flickrj/
    FlickrJ docs: http://flickrj.sourceforge.net/api/index.html

    To get started using the Flickr API, you must first get an API key: http://www.flickr.com/services/api/keys/. To use the Flickr API, you simply create a Flickr object with they key.

    Flickr f = new Flickr("YOUR API KEY!");

    Then, you can call various functions to get information related to people, photos, comments, groups, photosets, geocoding info, etc. The API is quite large and we won't cover it in great detail here. However, here is a quick example to show you how you might grab an array of images tagged with "itp."

    // Interface with Flickr photos
    PhotosInterface photos = f.getPhotosInterface();
    // Create a search parameters object to control the search
    SearchParameters sp = new SearchParameters();
    // Simple example, just looking for a single tag
    sp.setTags(new String[] {"itp"});
    // We're looking for n images, starting at "page" 0
    PhotoList list = photos.search(sp, n,0);
    // Grab all the image paths and store in String array
    String[] smallURLs = new String[list.size()];
    for (int i = 0; i < list.size(); i++) {
      Photo p = (Photo) list.get(i);
      smallURLs[i] = p.getSmallUrl();
    }

    The full example takes these photos and loads the as PImage objects in a Processing sketch.

    FlickrHelper.java
    FlickrProcessing.java

    • George Profenza

      Flickr API problem,
      I’m trying to do a simple search and either using the FlickrHelper class, either using the sample code
      up above I get the following exception:

      “com.aetrion.flickr.FlickrException: 96: Invalid signature
      at com.aetrion.flickr.photos.PhotosInterface.search(PhotosInterface.java:1082)
      at flicrkfetchr.FlicrkFetchr.setup(FlicrkFetchr.java:28)
      at processing.core.PApplet.handleDraw(PApplet.java:1383)
      at processing.core.PApplet.run(PApplet.java:1311)
      at java.lang.Thread.run(Thread.java:613)”

      I’ve logged on to flickr and got a new API key for this test, but either key(old or new), makes
      no difference. What am I missing ? What should I check for ? I’m using flickrapi-1.1.jar.

      Thanks,
      George

    • http://www.shiffman.net Daniel

      The code above was made to work with flickrapi-1.0.jar unfortunately so I’m guessing that might be the problem? I’ll have to update this page at some point, but in the mean time try reverting.