Crawling
Back to Programming from A to Z
Examples:
Related:
Exercises (optional):
Grabbing a URL
So far, we’ve only looked at how to retrieve textual information from a local file. Of course, there’s also that wacky world wide web thing and it’s not terribly difficult for us to expand the functionality of our examples to include scraping information from a URL. Here, we’ll need to introduce the java.net package, which will allow us to open an InputStream from a URL object.
String urlpath = "http://www.mycoolwebsite.com"; URL url = new URL(urlpath); InputStream stream = url.openStream();
Once we have the InputStream, we can use a BufferedReader to read the stream (i.e. the source from the URL) line by line, appending it all to one big StringBuffer.
StringBuffer stuff = new StringBuffer();
//Create a BufferedReader from the InputStream
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
// Read the stream line by line and append to StringBuffer
String line;
while (( line = reader.readLine()) != null) {
stuff.append(line + "\\n");
}
// Close the reader
reader.close();
Finally, we can simply convert the StringBuffer into a good old fashioned String and have our way with it.
String urlContent = stuff.toString();
Source for URL reading class: A2ZUrlReader.java
Making a list and keeping it twice
Sure, visiting one URL and grabbing data is super duper fun. But there is more joy out there and it comes in the form of crawling the web via a queue of URLs to visit. To accomplish this, we’ll dive into the linked list data structure. Sure, we could use an ArrayList here to keep track of multiple URLs in order, but a linked list will be more efficient because we only need some very basic functionality. We need to:
(Note how we do not need to traverse the list itself. We only are required to access the first element, as well as place elements on the end.)
It’s pretty simple to use the java LinkedList class.
LinkedList urlsToVisit = new LinkedList(); // A queue of URLs to visit
urlsToVisit.add("http://www.yahoo.com"); // Add a URL (as a String) to the list
urlsToVisit.add("http://www.google.com");
urlsToVisit.add("http://itp.nyu.edu");
String urlpath = (String) urlsToVisit.removeFirst(); // Get the first URL (as a String)
We could do something fancier, such as remove all the elements from the list one by one (as long as the list isn’t empty):
while (!urlsToVisit.isEmpty()) {
String urlpath = (String) urlsToVisit.removeFirst();
// Presumably do something here!
}
For our web crawler project, however, one list will not do. For example, the following code would be somewhat tragic:
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
urlsToVisit.add("http://www.sadlittleurl.com");
We shouldn’t be visiting the same URL over and over again. This is why we’ll need a second list, one that keeps track of the URLs previously visited. Ideally, a data structure that had constant time look-up — O(1) — would be great. All we want to figure out is — here’s a URL, are you in the list or not? Our friendly neighborhood hash table from last week will do just fine.
LinkedList urlsToVisit = new LinkedList(); // A queue of URLs to visit
HashMap urlsVisited = new HashMap(); // A hash table of visited URLs
String url = "http://www.abcdefg.com";
urlsVisited.put(url,url);
urlsToVisit.add("url");
For each URL, we place it in both lists. (Note that the HashMap requires a key and a value, whereas the LinkedList requires only a value.) This way, when it comes time to look at a new URL, we can check and see if it’s been visited first before we waste our time and look at it again.
String url = "http://www.iamtiredofmakingupurls.com";
if (!urlsToVisit.containsKey(url)) {
urlsVisited.put(url,url);
urlsToVisit.add("url");
}
We can now see how we are well on the way to writing a web crawler. We start with one URL, grab it, look for other URLs, check if we’ve already visited them, if not add them to the list, and start all over with the next URL!
Being Polite
Ok, so we’re on our way to writing a web crawler. Web crawlers, sadly, can be somewhat rude. They can start poking their noses in places they shouldn’t. They can hog a lot of resources. As humble citizens of the web, we should do our best to follow these guidelines as well as the the rules for robot exclusion. The examples included on this page, for the sake of simplicity, do not live up to these standards and may be construed as rather impolite. If you plan on using them extensively, you will likely want to consider making some significant improvements.
Here is the pseudo-code for our simple crawler:
So far, we’ve covered the necessary code for every step except #4.
Finding URLs
Remembering regular expressions, we can create a Pattern object to match any reference to a URL.
URL references on web pages generally look like this:
<a href = "http://www.someurl.com">
For simplicity’s sake, we’ll ignore any references to links specified by a relative path and assume that the URL has to start with “http”. We can begin to build a regular expression. . .
href # match href \s* # match optional infinite whitespace = # match equal sign \s* # match optional infinite whitespace " # match quote (http[^"\s]+) # capture the URL itself, http followed by one or more non whitespace/quote " # end with a quote
Sure, this could be improved a number of ways, but it will do as a start. In Java, it will look like this:
Pattern href;
// Match URLs
// Note using Pattern.COMMENTS flag which ignores white spaces and anything after '#' in a regex
href = Pattern.compile( "href # match href \\n" +
"\\\\s*=\\\\s*" # 0 or more spaces, =, 0 ore more spaces, quote \n" +
"(http[^\"\\\\s]*) # capture the URL itself, http followed by no spaces and no quotes \n" +
"\" # ending with a quote \\n",
Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
Now, as we grab some html source code, we can look for URLs and add them if appropriate:
// Grab the URL content
A2ZUrlReader urlr = new A2ZUrlReader(urlpath);
String stuff = urlr.getContent();
// Match the URL pattern to the content
Matcher m = href.matcher(stuff);
// While there are URLs
while (m.find()) {
// Grab the captured part of the regex (the URLPath itself)
String newurl = m.group(1);
// If it hasn't already been visited
if (!urlsVisited.containsKey(newurl)) {
// Add it to both the LinkedList and the HashMap
urlsToVisit.add(newurl);
urlsVisited.put(newurl,newurl);
}
}
Now, not all URLs are treated equally. If our goal, for example, is to analyze text on the web, we can certainly ignore references to media files, such as jpgs, movs, etc. We’ll use a second regular expression for this:
// We will ignore URLs ending with certain extensions String ignore = ".*(mov|jpg|gif|pdf)$";
This allows us to not only check if the URL has already been visited, but confirm that it is not one of the listed excluded types.
if (!newurl.matches(ignore) && !urlsVisited.containsKey(newurl)) {
urlsToVisit.add(newurl);
urlsVisited.put(newurl,newurl);
}
A Crawler Class
The following Crawler class encompasses all of the basic elements described above. Its data is of a queue of URLs, a hash table of visited URLs, a regex for matching URL references, and a regex for URLs to ignore. It has methods to add a new URL, crawl to the next URL, read source of a URL (called in the ‘crawl’ method), and check if the queue is empty.
/* Daniel Shiffman */
/* Programming from A to Z */
/* Spring 2006 */
/* http://www.shiffman.net */
/* daniel.shiffman@nyu.edu */
// Simple example of a web crawler
// URL queue: linked list
// Sites already visited: hash table
// Needs to be updated to comply with ROBOTS.TXT!
package a2z;
import java.util.*;
import java.util.regex.*;
public class Crawler {
private LinkedList urlsToVisit; // A queue of URLs to visit
private HashMap urlsVisited; // A table of already visited URLs
private Pattern href; // A Pattern to match an href tag
private String ignore; // To be used as a regex for ignoring media files (JPG,MOV, etc.)
public Crawler() {
urlsToVisit = new LinkedList();
urlsVisited = new HashMap();
// Match URLs
// Note using Pattern.COMMENTS flag which ignores white spaces and anything after '#' in a regex
href = Pattern.compile( "href # match href \n" +
"\\s*=\\s*" # 0 or more spaces, =, 0 ore more spaces, quote \n" +
"(http[^"\\s]*) # capture the URL itself, http followed by no spaces and no quotes \n" +
"" # ending with a quote \n",
Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
// We will ignore URLs ending with certain extensions
ignore = ".*(mov|jpg|gif|pdf)$";
}
public void addUrl(String urlpath) {
// Add it to both the LinkedList and the HashMap
urlsToVisit.add(urlpath);
urlsVisited.put(urlpath,urlpath);
}
public boolean queueEmpty() {
return urlsToVisit.isEmpty();
}
public void crawl() {
String urlpath = (String) urlsToVisit.removeFirst();
read(urlpath);
}
private void read(String urlpath) {
System.out.println(urlsVisited.size() + " " + urlsToVisit.size() + " " + urlpath);
try {
// Grab the URL content
A2ZUrlReader urlr = new A2ZUrlReader(urlpath);
String stuff = urlr.getContent();
// OK I COULD DO SOMETHING WITH ALL OF THE CONTENT HERE!!
// Match the URL pattern to the content
Matcher m = href.matcher(stuff);
// While there are URLs
while (m.find()) {
// Grab the captured part of the regex (the URLPath itself)
String newurl = m.group(1);
// If it hasn't already been visited (or if it matches the ignore pattern)
if (!newurl.matches(ignore) && !urlsVisited.containsKey(newurl)) {
addUrl(newurl);
}
}
} catch (Exception e) {
// System.out.println("Problem reading from " + urlpath + " " + e);
// e.printStackTrace();
}
}
}
It is important to note that this crawler does not loop on its own. The driver program must check and see if the queue has elements and call the “crawl” method in a loop, i.e.:
// Create a crawler object
Crawler crawler = new Crawler();
// Put the URL into the crawler object
crawler.addUrl("http://www.blahlbah.com");
while (!crawler.queueEmpty()) {
crawler.crawl();
}
In the full example, I’ve built in a little safety net so that the crawler doesn’t run rampant (although it might be fun to let it do so!)
// Since this crawler isn't particularly polite: http://www.robotstxt.org/wc/guidelines.html
// I'm limited it to viewing 100 url requests
int count = 0; int limit = 100;
while (!crawler.queueEmpty()) {
crawler.crawl();
count++;
if (count > limit) break;
}
Writing one’s own crawler is a useful and productive exercise, however, if all you need is basic crawling functionality, another option is to use an existing crawling library, such as WebSphinx. To use WebSphinx, simply download download websphinx.jar and import it into you Eclipse project (make sure to add it to the “build path” as well.) You can then create your own Crawler class that extends Crawler. By overriding the visit() method, you can have customize what you want the Crawler to do each time it visits a page.
// What to do when we visit the page
public void visit(Page page) {
System.out.println("Visiting: " + page.getTitle());
String content = page.getContent();
// Do something with the content String!
// Since we don't need to retain the pages
// This code helps with memory management
page.getOrigin().setPage(null);
page.discardContent();
}
Here is an example of a basic Crawler implemented via WebSphinx. For other possibilities, take a look at the WebSphinx API.
You can use the Norbert library to check robots.txt for permissions, it’s Open Source and easy to implement. See http://www.osjava.org/norbert/HowTo.html
Hi,
I went through your website, interesting stuff using websphinx. I would appreciate if u could help me. I am doing a project of mine at LSE, where in I am trying to using websphinx for pattern matching in the database. Can u help to understand how to implement the websphinx pattern matching capability for DB.
Thanks in advance
Rg,
Prashant