DekGenius.com
Previous Section  < Day Day Up >  Next Section

9.6 The Crawler/Indexer Service

The application needs a way to dynamically follow the links from a given URL and the links from those pages, ad infinitum, in order to create the full domain of searchable pages. Just thinking about writing all of the web-related code to do that work gives me the screaming heebie-jeebies. We would have to write methods to post web requests, listen for responses, parse those responses looking for links, and so on.

In light of the "keep it simple" chapter, it seems we are immediately faced with a buy-it-or-build-it question. This functionality must exist already; the question is, where? It turns out we already have a library at our disposal that contains everything we need: HTTPUnit. Because HTTPUnit's purpose in life is to imitate a browser, it can be used to make HTTP requests, examine the HTML results, and follow the links contained therein.

Using HTTPUnit to do the work for us is a fairly nonstandard approach. HTTPUnit is considered a testing framework, not an application development framework. However, since it accomplishes exactly what we need to do with regard to navigating web sites, it would be a waste of effort and resources to attempt to recreate that functionality on our own.

Our main entry point to the crawler/indexer service is IndexLinks. This class establishes the entry point for the indexable domain and all of the configuration settings for controlling the overall result set. The constructor for the class should accept as much of the configuration information as possible:

  public IndexLinks(String indexPath, int maxLinks, 
                    String skippedLinksOutputFileName) 
  {
    this.maxLinks = maxLinks;
    this.linksNotFollowedOutputFileName = skippedLinksOutputFileName;
    writer = new IndexWriter(indexPath, new StandardAnalyzer( ), true);
  }

The writer is an instance of org.apache.lucene.index.IndexWriter, which is initialized to point to the path where a new index should be created.

Our instance requires a series of collections to manage our links. Those collections are:

  Set linksAlreadyFollowed = new HashSet( );
  Set linksNotFollowed = new HashSet( );
  Set linkPrefixesToFollow = new HashSet( );
  HashSet linkPrefixesToAvoid = new HashSet( );

The first two are used to store the links as we discover and categorize them. The next two are configuration settings used to determine if we should follow the link based on its prefix. These settings allow us to eliminate subsites or certain external sites from the search set, thus giving us the ability to prevent the crawler from running all over the Internet, indexing everything.

The other object we need is a com.meterware.httpunit.WebConversation. HTTPUnit uses this class to model a browser-server session. It provides methods for making requests to web servers, retrieving responses, and manipulating the HTTP messages that result. We'll use it to retrieve our indexable pages.

  WebConversation conversation = new WebConversation( );

We must provide setter methods so the users of the indexer/crawler can add prefixes to these two collections:

  public void setFollowPrefixes(String[] prefixesToFollow) 
     throws MalformedURLException {
    for (int i = 0; i < prefixesToFollow.length; i++) {
      String s = prefixesToFollow[i];
      linkPrefixesToFollow.add(new URL(s));
    }
  }
  public void setAvoidPrefixes(String[] prefixesToAvoid) throws MalformedURLException {
    for (int i = 0; i < prefixesToAvoid.length; i++) {
      String s = prefixesToAvoid[i];
      linkPrefixesToAvoid.add(new URL(s));
    }
  }

In order to allow users of the application maximum flexibility, we also provide a way to store lists of common prefixes that they want to allow or avoid:

  public void initFollowPrefixesFromSystemProperties( ) throws MalformedURLException {
    String followPrefixes = System.getProperty("com.relevance.ss.FollowLinks");
    if (followPrefixes == null || followPrefixes.length( ) == 0) return;
    String[] prefixes = followPrefixes.split(" ");
    if (prefixes != null && prefixes.length != 0) {
      setFollowPrefixes(prefixes);
    }
  }

  public void initAvoidPrefixesFromSystemProperties( ) throws MalformedURLException {
    String avoidPrefixes = System.getProperty("com.relevance.ss.AvoidLinks");
    if (avoidPrefixes == null || avoidPrefixes.length( ) == 0) return;
    String[] prefixes = avoidPrefixes.split(" ");
    if (prefixes != null && prefixes.length != 0) {
      setAvoidPrefixes(prefixes);
    }
  }

As links are considered for inclusion in the index, we'll be executing the same code against each to determine its worth to the index. We need a few helper methods to make those determinations:

  boolean shouldFollowLink(URL newLink) {
    for (Iterator iterator = linkPrefixesToFollow.iterator( ); iterator.hasNext( );) {
      URL u = (URL) iterator.next( );
      if (matchesDownToPathPrefix(u, newLink)) {
        return true;
      }
    }
    return false;
  }

  boolean shouldNotFollowLink(URL newLink) {
      for (Iterator iterator = linkPrefixesToAvoid.iterator( ); iterator.hasNext( );) {
      URL u = (URL) iterator.next( );
      if (matchesDownToPathPrefix(u, newLink)) {
        return true;
      }
    }
    return false;
  }

  private boolean matchesDownToPathPrefix(URL matchBase, URL newLink) {
    return matchBase.getHost( ).equals(newLink.getHost( )) &&
       matchBase.getPort( ) == newLink.getPort( ) &&
       matchBase.getProtocol( ).equals(newLink.getProtocol( )) &&
       newLink.getPath( ).startsWith(matchBase.getPath( ));
  }

The first two methods, shouldFollowLink and shouldNotFollowLink, compare the URL to the collections for each. The third, matchesDownToPathPrefix, compares the link to one from the collection, making sure the host, port, and protocol are all the same.

The service needs a way to consider a link for inclusion in the index. It must accept the new link to consider and the page that contained the link (for record-keeping):

  void considerNewLink(String linkFrom, WebLink newLink) throws MalformedURLException {
    URL url = null;
    url = newLink.getRequest( ).getURL( );
    if (shouldFollowLink(url)) {
      if (linksAlreadyFollowed.add(url.toExternalForm( ))) {
        if (linksAlreadyFollowed.size( ) > maxLinks) {
          linksAlreadyFollowed.remove(url.toExternalForm( ));
          throw new Error("Max links exceeded " + maxLinks);
        }
        if (shouldNotFollowLink(url)) {
          IndexLink.log.info("Not following " + url.toExternalForm( ) 
                              + " from " + linkFrom);
        } else {
          IndexLink.log.info("Following " + url.toExternalForm( ) 
                              + " from " + linkFrom);
          addLink(new IndexLink(url.toString( ),conversation, this));
        }
      }
    } else {
      ignoreLink(url, linkFrom);
    }
  }

newLink is an instance of com.meterware.httpunit.WebLink, which represents a single page in a web conversation. This method starts by determining whether the new URL is in our list of approved prefixes; if it isn't, newLink calls the helper method ignoreLink (which we'll see in a minute). If it is approved, we test to see if we have already followed this link; if we have, we just move on to the next link. Note that we verify whether the link as already been followed by attempting to add it to the linksAlreadyFollowed set. If the value already exists in the set, the set returns false. Otherwise, the set returns true and the value is added to the set.

We also determine if the addition of the link has caused the linksAlreadyFollwed set to grow past our configured maximum number of links. If it has, we remove the last link and throw an error.

Finally, the method checks to make sure the current URL is not in the collection of proscribed prefixes. If it isn't, we call the helper method addLink in order to add the link to the index:

private void ignoreLink(URL url, String linkFrom) {
  String status = "Ignoring " + url.toExternalForm( ) + " from " + linkFrom;
  linksNotFollowed.add(status);
  IndexLink.log.fine(status);
}
public void addLink(IndexLink link)
{
    try
    {
      link.runTest( );
    }
    catch(Exception ex)
    {
   // handle error...
    }
}

Finally, we need an entry point to kick off the whole process. This method should take the root page of our site to index and begin processing URLs based on our configuration criteria:

public void setInitialLink(String initialLink) throws MalformedURLException {
  if ((initialLink == null) || (initialLink.length( ) == 0)) {
    throw new Error("Must specify a non-null initialLink");
  }
  linkPrefixesToFollow.add(new URL(initialLink));
  this.initialLink = initialLink;
  addLink(new IndexLink(initialLink,conversation,this));
}

Next, we define a class to model the links themselves and allow us access to their textual representations for inclusion in the index. That class is the IndexLink class. IndexLink needs three declarations:

private WebConversation conversation;
private IndexLinks suite;
private String name;

The WebConversation index again provides us the HTTPUnit framework's implementation of a browser-server session. The IndexLinks suite is the parent instance of IndexLinks that is managing this indexing session. The name variable stored the current link's full URL as a String.

Creating an instance of the IndexLink class should provide values for all three of these variables:

public IndexLink(String name, WebConversation conversation, IndexLinks suite) {
  this.name = name;
  if ((name == null) || (conversation == null) || (suite == null)) {
    throw new IllegalArgumentException(
      "LinkTest constructor requires non-null args");
  }
  this.conversation = conversation;
  this.suite = suite;
}

Each IndexLink exposes a method that navigates to the endpoint specified by the URL and checks to see if the result is an HTML page or other indexable text. If the page is indexable, it is added to the parent suite's index. Finally, we examine the current results to see if they contain links to other pages. For each such link, the process must start over:

public void checkLink( ) throws Exception {
    WebResponse response = null;
    try {
      response = conversation.getResponse(this.name);
    } catch (HttpNotFoundException hnfe) {
      // handle error
    }
    if (!isIndexable(response)) {
      return;
    }
    addToIndex(response);
    WebLink[] links = response.getLinks( );
    for (int i = 0; i < links.length; i++) {
      WebLink link = links[i];
      suite.considerNewLink(this.name, link);
    }
  }

The isIndexable method simply verifies the content type of the returned result:

private boolean isIndexable(WebResponse response) {
    return response.getContentType( ).equals("text/html") || response.getContentType( ).
equals("text/ascii");
  }

whereas the addToIndex method actually retrieves the full textual result from the URL and adds it to the suite's index:

private void addToIndex(WebResponse response) throws SAXException, IOException, 
InterruptedException {
  Document d = new Document( );
  HTMLParser parser = new HTMLParser(response.getInputStream( ));
  d.add(Field.UnIndexed("url", response.getURL( ).toExternalForm( )));
  d.add(Field.UnIndexed("summary", parser.getSummary( )));
  d.add(Field.Text("title", parser.getTitle( )));
  d.add(Field.Text("contents", parser.getReader( )));
  suite.addToIndex(d);
}

The parser is an instance of org.apache.lucene.demo.html.HTMLParser, a freely available component from the Lucene team that takes an HTML document and supplies a collection-based interface to its constituent components. Note the final call to suite.addToIndex, a method on our IndexLinks class that takes the Document and adds it to the central index:

// note : method of IndexLinks
public void addToIndex(Document d)
  {
      try
      {
              writer.addDocument(d);
      }
      catch(Exception ex)
      {
      }
  }

That's it. Together, these two classes provide a single entry point for starting a crawling/indexing session. They ignore the concept of scheduling an indexing event; that task is left to the user interface layers. We only have two classes, making the model extremely simple to maintain. And we chose to take advantage of an unusual library (HTTPUnit) to keep us from writing code outside our problem domain (namely, web request/response processing).

9.6.1 Principles in Action

  • Keep it simple: chooseHTTPUnit for web navigation code, minimum performance enhancements (maximumLinks, linksToAvoid collection)

  • Choose the right tools: JUnit, HTTPUnit, Cactus,[1] Lucene

    [1] Unit tests elided for conciseness. Download the full version to see the tests.

  • Do one thing, and do it well: interface-free model, single entry-point to service, reliance on platform's scheduler; we also ignored this principle in deference to simplicity by combining the crawler and indexer

  • Strive for transparency: none

  • Allow for extension: configuration settings for links to ignore

    Previous Section  < Day Day Up >  Next Section