Examples:
Fore more on Processing in Eclipse: http://www.mostpixelsever.com/tutorial/eclipse/
Exercises (optional):
So far we’ve looked at several techniques for analyzing and manipulating text, dealing mostly with locally stored text files. Last week, we tasted the delicious nectar of mining the web by developing an example that grabbed a the source from a URL. This week, we’ll look further at the possibilites of grabbing textual content from the network, exploring parsing HTML / XML, as well as looking at three available APIs, Yahoo, del.icio.us, and flickr.
Parsing HTML
Grabbing data from an HTML page can be an uncomfortable experience. All that HTML whoozeemawhatsit is related to the visual appearance of a page, making the data appear rather disorganized. Nevertheless, there’s a lot of useful junk to be found out there in the world of HTML and with a little perserverence, it’s all there for free. There are different ways we might approach the problem of pulling information from a webpage. We can return to some of the basic String class functionality of yesteryear or use regular expressions.
Let’s say we want to pull out the number of apples from the text below.
String stuff = “Number of apples: 62. Boy, do I like apples or what!â€; |
Our algorithm would be as follows:
1 — Find the end of the substring “apples: †Call it start.
2 — Find the period after “apples: †Call it end.
3 — Make a substring of the stuff between start and end.
4 — Convert the string to a number (if we want to use it as such)
In code, this would look like this:
int start = stuff.indexOf(“apples: â€) + 8; // STEP 1 int end = stuff.indexOf(“.â€,start); // STEP 2 String apples = stuff.substring(start,end); // STEP 3 int apple_no = Integer.parseInt(apples); // STEP 4 |
The above code will do the trick, but we should be a bit more careful to make sure we don’t run into any errors if we don’t find the substrings. We can add some error checking and generalize the code into a function:
// A function that returns a substring between two substrings String giveMeTextBetween(String s, String startTag, String endTag) { String found = ""; int startIndex = s.indexOf(startTag); // Find the index of the beginning tag if (startIndex == -1) return ""; // If we don't find anything, return an empty String startIndex += startTag.length(); // Move to the end of the beginning tag int endIndex = s.indexOf(endTag, startIndex); // Find the index of the end tag if (endIndex == -1) return ""; // If we don't find the end tag, return a empty String return s.substring(startIndex,endIndex); // Return the text in between } |
We could also rewrite the function using regular expressions:
// A function to pull out the text between two regular expression Strings public static String regexTextBetween(String s, String startRegex, String endRegex) { // Create on large regex from start and end with anything in between // Note how we use parenthese to capture the in between stuff String bigRegex = startRegex + "(.*?)" + endRegex; // Compile the pattern and create the matcher Pattern p = Pattern.compile(bigRegex); Matcher m = p.matcher(s); // If we find what we are looking for, return the captured group if (m.find()) { return m.group(1); } else { return "oops!"; } } |
Let’s say we wanted to grab the high and low temperatures off of this Yahoo weather page. If we visit that page and view source, we can see that the temperatures are embedded in the following html:
<font size="2" face="Arial">High:</font>&nbsp;<b><font size="3" face="Arial">37&deg;</font></b> <font size="2" face="Arial">Low:</font>&nbsp;<b><font size="3" face="Arial">25&deg;</font></b> |
To grab the data we want, we simply call the above function with a regular expression matching that pattern that precedes and follows “37″ (as well as “25″).
A2ZUrlReader urlreader = new A2ZUrlReader("http://weather.yahoo.com/forecast/USNY0996.html"); String html = urlreader.getContent(); // Grab high temperature String high = regexTextBetween(html,"High:</font>&nbsp;<b><font size=\"3\" face=\"Arial\">","&deg;"); System.out.println("The high for today is: " + high + " degrees."); // Grab low temperature String low = regexTextBetween(html,"Low:</font>&nbsp;<b><font size=\"3\" face=\"Arial\">","&deg;"); System.out.println("The low for today is: " + low + " degrees."); |
XML
So, yes, HTML is an ugly, scary place with inconsistently formatted pages that are difficult to reverse engineer and parse effectively. Fortunately for you, there is the world of XML (Extensible Markup Language) page. XML is designed to facilitate the sharing of data across different systems and its format won’t cause your hair to go gray as fast. Let’s examine how we might grab exactly the same weather information from Yahoo’s RSS feed.
XML organizes information in a tree structure. The code we’ll write to search and traverse an XML document is somewhat similar to the work we did in building a binary search tree and will involve recursion. Let’s look at the XML for Yahoo weather’s RSS feed (this is only part of the source in order to simplify the discussion.)
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <rss version="2.0" xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"> <channel> <item> <title>Conditions for New York, NY at 3:51 pm EST</title> <geo:lat>40.67</geo:lat> <geo:long>-73.94</geo:long> <link> http://us.rd.yahoo.com/dailynews/rss/weather/New_York__NY/*http://xml.weather.yahoo.com/forecast/USNY0996_f.html </link> <pubDate>Mon, 20 Feb 2006 3:51 pm EST</pubDate> <yweather:condition text="Fair" code="34" temp="35" date="Mon, 20 Feb 2006 3:51 pm EST"/> <yweather:forecast day="Mon" date="20 Feb 2006" low="25" high="37" text="Clear" code="31"/> <guid isPermaLink="false">USNY0996_2006_02_20_15_51_EST</guid> </item> </channel> </rss> |
With the exception of the first line (which simply indicates that the document is XML formatted), this XML document constains a nested structure consisting of elements, each with a start tag, i.e. <channel> and an end tag, i.e. </channel>. Some of these elements have content between the tags, i.e:
<title>Conditions for New York, NY at 3:51 pm EST</title> |
and some have attributes:
<yweather:forecast day="Mon" date="20 Feb 2006" low="25" high="37" text="Clear" code="31"/> |
It should be fairly obvious how searching for information, such as the title of the page or the high temperature, will be significantly less painful than with the tragically arbitrary process of parsing HTML. Java provides us with a nice api for processing XML. The three packages we’ll need are:
Ok, we can begin to write some code. Let’s just build a Document object from an XML page. (Note this is code is incomplete, just demonstrates the first few steps of an XML reading program.)
// Import the necessary libraries import java.io.*; import java.net.*; import org.w3c.dom.*; import javax.xml.parsers.*; import org.xml.sax.*; public class XMLWeatherGrab { public static void main (String argv []){ try { // Create a URL object and open an InputStream URL url = new URL("http://xml.weather.yahoo.com/forecastrss?p=10003"); InputStream is = url.openStream(); // Build a document from that InputStream DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse(is); |
Next, we might choose to grab the root element from the XML tree:
Element root = doc.getDocumentElement(); System.out.println(root.toString()); |
(Note how the above code displays the entire XML document. Remember, XML is a nested structure and this root element contains all the subelements of the XML document! So instead, we might begin to poke around the children of the root.)
NodeList children = root.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); System.out.println(n.getNodeName()); } |
If we run the above code, which grabs a list of children and prints out their corresponding tag names, the result isn’t terribly interesting. All we get is “channel.” This is because the root element “RSS” has only one child, “channel.” So, what’s next? We’ll need to look at the children of “channel”! Oh, and then we might need to look at the children of the children of “channel” (if there are any)! Indeed, this is where our recursive tree traversal code will come in handy. The following function searches for specific element (in our case, we’ll want to look for “yweather:forecast”). As long as it is still searching, it will traverse each node’s child nodes. It’s a bit confusing as we have to deal with both Node objects and Element objects. The code checks each Node to see if it is of the type Element and if so, quickly converts it by casting the Node reference.
public Element findElement(Element currElement, String elementName) { // So far we've found nothing, i.e. null Element found = null; // If the current element is what we want, hooray we are done! if (currElement.getTagName().equals(elementName)) { found = currElement; // Otherwise, if there are children, let's check those } else if (currElement.hasChildNodes()) { // Get the list of children NodeList children = currElement.getChildNodes(); // As long as we haven't found it (i.e. found is still null) for (int i = 0; i < children.getLength() && found == null; i++) { // Search each Element type node Node n = children.item(i); if (n.getNodeType() == Node.ELEMENT_NODE) { Element e = (Element) n; found = findElement(e, elementName); } } } return found; } |
We might also just want to traverse the XML document and display all the info, which we can do in a similar fashion:
// This method recursively traverses the XML Tree (starting from currElement) public void traverseXML (Node currNode) { // If it's an Element, spit out the Name if (currNode.getNodeType() == Node.ELEMENT_NODE) { System.out.print(currNode.getNodeName() + ": "); // If it's a "Text" node, take the value // These will come one after another and therefore appear in the right order } else if (currNode.getNodeType() == Node.TEXT_NODE) { System.out.println(currNode.getNodeValue().trim()); } // Display any attributes if (currNode.hasAttributes()) { NamedNodeMap attributes = currNode.getAttributes(); for (int i = 0; i < attributes.getLength(); i++) { Node attr = attributes.item(i); System.out.println(" " + attr.getNodeName() + ": " + attr.getNodeValue()); } } // Check any children if (currNode.hasChildNodes()) { // Get the list of children NodeList children = currNode.getChildNodes(); // Go through all the chilrden for (int i = 0; i < children.getLength(); i++) { // Search each Node Node n = children.item(i); traverseXML(n); } } } |
Of course, the inevitable question now comes up. Gosh, that seems awfully complicated. Parsing the HTML was, well, simpler! Indeed, getting up to speed with using XML can be a steep climb. Nevertheless, the rewards are greater. And since I've packaged the above code into its own class: A2ZXMLHelper.java, you should feel free to simply use its functionality. This base class (A2ZXMLHelper.java)should provide you with a framework for getting started parsing XML documents. See: XMLWeatherGrab.java for the example pulling weather information from Yahoo's RSS.
Another thing you might notice about XML is how "object-oriented" it seems. If you take a look at an RSS feed, for example, you might notice that the tree structures contains a list of "item" elements (each item as a date, title, description, link, etc.). Below is a simplification:
<item> <title>Article Title</title> <link>http://link</link> <pubDate>Mon, 25 Feb 2008 02:05:12 -0600</pubDate> <description> Article Description blah blah blah </description> </item> |
We could easily then design a Java class to store the information with a similar structure:
public class Post { private String title; private String content; private String link; private String date; |
Adding functions to set the values of each variable inside the class:
void setTitle(String t) { title = t; } Full code: <a href="http://www.shiffman.net/itp/classes/a2z/week06/Post.java">Post.java</a> Now that we have this Post class, we can read the XML document and create Post objects for each "item" element. We first use the <a href="http://www.shiffman.net/itp/classes/a2z/week06/A2ZXMLHelper">A2ZXMLHelper.java</a> class which includes a function to fill an ArrayList with all XML Elements with a specific name. |
// Use the helper class to create an XML document A2ZXMLHelper xml = new A2ZXMLHelper("http://feeds.boingboing.net/boingboing/iBag"); // Make an ArrayList to put all "item" elements from the XML doc ArrayList items = new ArrayList(); xml.fillArrayList(xml.getRoot(), "item", items); |
Once we have all the XML Elements, we make a new ArrayList to fill with Post objects. . .
// Now, for every "item" in the XML doc, we will make a "Post" object // and store in an ArrayList ArrayList posts = new ArrayList(); |
. . .and walk through each XML Element, filling the details of each Post object.
// Loop through all items for (int i = 0; i < items.size(); i++) { // Make an empty Post object Post p = new Post(); // Look at each "item" element Element e = (Element) items.get(i); // Get each item's children NodeList children = e.getChildNodes(); for (int j = 0; j < children.getLength(); j++) { Node n = children.item(j); // Figure out which Node this is and set the node's text content in the object String tag = n.getNodeName(); if (tag.equals("title")) { p.setTitle(n.getTextContent()); } else if (tag.equals("pubDate")) { p.setDate(n.getTextContent()); } else if (tag.equals("link")) { p.setLink(n.getTextContent()); } else if (tag.equals("description")) { p.setContent(n.getTextContent()); } } posts.add(p); } |
Full code: BlogReader.java
Full code: Post.java
APIs
Sure, we can parse HTML. Sure, we can wade through nicely formatted XML. But wouldn't it be nicer if we didn't have to do either and could simply access the data of a web site via some sort of API? In fact, with many sites we can. For starters: The Programmable Web: APIs. Woohoo!
Let's look at three examples, Yahoo search, del.icio.us, and Flickr. To use any of these APIs in Eclipse, you first have add the appropriate library files (usually in the form of a JAR) to your Eclipse "Build Path". This is done with the following steps:
1) Import the JAR file into your project. There are many ways to do this, one way is to use: FILE –> IMPORT –> GENERAL –> FILE SYSTEM. (Technically, you don't need to import the file, you can point to it in another directory. However, for the purpose of keeping everything together in one project, it's sometimes convenient to copy the JAR into your project.) The following screenshot shows importing the Processing JAR "core.jar."
![]()
2) Click “finish.†You should now see “core.jar†listed under your project in the package explorer.
![]()
3) Right-click on the file and select BUILD PATH –> ADD TO BUILD PATH. The “core.jar†file should now appear with a jar icon.
![]()
For the Yahoo API, you'll need the JAR: "yahoo_search-2.0.1.jar" as well as an API Key (so Yahoo can track you in case your use of the API).
The code for dealing with the API is pretty simple. Import the library:
import com.yahoo.search.*; |
Second, create a SearchClient object and pass it your license key.
SearchClient client = new SearchClient("YOUR KEY HERE!"); |
Afterwards, you can ask Yahoo to perform a search and retrieve the resulting metadata (total results, time search took, URLs, titles, descriptions, etc.) The example below displays the total number of results for a google search for "itp." Full documentation is available in the API download.
WebSearchRequest request = new WebSearchRequest("itp"); WebSearchResults results = client.webSearch(request); int count = results.getTotalResultsAvailable().intValue(); |
There's also a nice Java del.icio.us API available on sourceforge. It works in a similar way as the Yahoo API, and is available as a JAR download. The JavaDocs are available online as well.
It's very simple to use. All you need to do is create a Delicious object instance by passing in your username and password. (I would suggest pulling these from the args array or properties file so that they don't appear in your code like below.) You can then retrieve Tags, Posts, etc. The example below grabs all of the tags associated with an account into a List and prints them.
import del.icio.us.*; import del.icio.us.beans.*; import java.util.*; public class DeliciousTest { public static void main(String[] args) { System.out.println("Testing del.icio.us API"); Delicious del = new Delicious("itpa2z","interact"); List tags = del.getTags(); Iterator i = tags.iterator(); while (i.hasNext()) { Tag tag = (Tag) i.next(); int count = tag.getCount(); String word = tag.getTag(); System.out.println(word + " " + count); } } } |
Of course, you don't need this API to access del.icio.us since you can just as easily grab data via RSS.
Finally, here's some quick information to get you started using the Flickr API (which again could also be accessed via RSS feeds.
Flickr Services API: http://www.flickr.com/services/api/
FlickrJ: http://sourceforge.net/projects/flickrj/
FlickrJ docs: http://flickrj.sourceforge.net/api/index.html
To get started using the Flickr API, you must first get an API key: http://www.flickr.com/services/api/keys/. To use the Flickr API, you simply create a Flickr object with they key.
Flickr f = new Flickr("YOUR API KEY!"); |
Then, you can call various functions to get information related to people, photos, comments, groups, photosets, geocoding info, etc. The API is quite large and we won't cover it in great detail here. However, here is a quick example to show you how you might grab an array of images tagged with "itp."
// Interface with Flickr photos PhotosInterface photos = f.getPhotosInterface(); // Create a search parameters object to control the search SearchParameters sp = new SearchParameters(); // Simple example, just looking for a single tag sp.setTags(new String[] {"itp"}); // We're looking for n images, starting at "page" 0 PhotoList list = photos.search(sp, n,0); // Grab all the image paths and store in String array String[] smallURLs = new String[list.size()]; for (int i = 0; i < list.size(); i++) { Photo p = (Photo) list.get(i); smallURLs[i] = p.getSmallUrl(); } |
The full example takes these photos and loads the as PImage objects in a Processing sketch.