DekGenius.com
Previous Section  < Day Day Up >  Next Section

10.5 Making Use of the Configuration Service

If we jump straight in and start using the search as it's currently configured, we'll notice a problem. Our searches are returning lots of results—more than can be possible given the number of products in the database. In fact, a search for "dog" returns over 20 results, even though there are only 6 dogs in the database.

This is happening because of the brute-force nature of the crawling service. Without extra help, the crawler finds every link on every page and follows it, adding the results to the index. The problem is that in addition to the links that allow users to browse animals in the catalog, there are also links that allow users to add the animals to their shopping carts, links to let them remove those items from their carts, links to a sign-in page (which, by default in jPetStore, loads with real credentials stored in the textboxes), and a live link for "Login," which the crawler will happily follow—thus generating an entirely new set of links, with a session ID attached to them.

We need to make sure our crawler doesn't get suckered into following all the extraneous links and generate more results than are helpful for our users. In the first part of Chapter 9, we talked about the three major problems that turn up in a naïve approach to crawling a site:


Infinite loops

Once a link has been followed, the crawler must ignore it.


Off-site jumps

Since we are looking at http://localhost/jpetstore, we don't want links to external resources to be indexed: that would lead to indexing the entire Internet (or, at least, blowing up the application due to memory problems after hours of trying).


Pages that shouldn't be indexed

In this case, that's pages like the sign-in page, any page with a session ID attached to it, and so on.

Our crawler/indexer service handles the first two issues for us automatically. Let's go back and look at the code. The IndexLinks class has three collections it consults every time it considers a new link:

Set linksAlreadyFollowed = new HashSet( );
HashSet linkPrefixesToFollow = new HashSet( );
HashSet linkPrefixesToAvoid = new HashSet( );

Every time a link is followed, it gets added to linksAlreadyFollowed. The crawler never revisits a link stored here. The other two collections are a list of link prefixes that are allowed and a list of the ones that are denied. When we call IndexLinks.setInitialLink, we add the root link to the linkPrefixesToFollow set:

linkPrefixesToFollow.add(new URL(initialLink));

IndexLinks also exposes a method, initAvoidPrefixesFromSystemProperties, which tells the IndexLinks bean to read the configured system properties in order to initialize the list:

  public void initAvoidPrefixesFromSystemProperties( ) throws MalformedURLException {
    String avoidPrefixes = System.getProperty("com.relevance.ss.AvoidLinks");
    if (avoidPrefixes == null || avoidPrefixes.length( ) == 0) return;
    String[] prefixes = avoidPrefixes.split(" ");
    if (prefixes != null && prefixes.length != 0) {
      setAvoidPrefixes(prefixes);
    }
  }

First, the logic for considering a link checks to make sure the new link matches one of the prefixes in linkPrefixesToFollow. For us, the only value stored there is http://localhost/jpetstore. If it is a subpage of that prefix, we make sure the link doesn't match one of the prefixes in linkPrefixesToAvoid.

A special side note: good code documentation is an important part of maintainability and flexibility. Notice the rather severe lack of comments in the code for the Simple Spider. On the other hand, it has rather lengthy method and type names (like initAvoidPrefixesFromSystemProperties), which make comments redundant, since they clearly describe the entity at hand. Good naming, not strict commenting discipline, is often the key to code readability.

All we need to do is populate the linkPrefixesToAvoid collection. ConsoleSearch already calls initAvoidPrefixesFromSystemProperties for us, so all we have to do is add the necessary values to the com.relevance.ss.properties file:

AvoidLinks=http://localhost:8080/jpetstore/shop/signonForm.do http://localhost:8080/
jpetstore/shop/viewCart.do http://localhost:8080/jpetstore/shop/
searchProducts.do http://localhost:8080/jpetstore/shop/viewCategory.do;jsessionid= 
http://localhost:8080/jpetstore/shop/addItemToCart.do http://localhost:8080/jpetstore/shop/
removeItemFromCart.do

These prefixes represent, in order, the sign-on form of the application, any links that show the current user's cart, the results of another search, any pages that are the result of a successful logon, pages that add items to a users cart, and pages that remove items from a users cart.

10.5.1 Principles in Action

  • Keep it simple: use existing Properties tools, not XML

  • Choose the right tools: java.util.Properties

  • Do one thing, and do it well: the service worries about following provided links; the configuration files worry about deciding what links can be followed

  • Strive for transparency: the service doesn't know ahead of time what kinds of links will be acceptable; configuration files make that decision transparent to the service

  • Allow for extension: expandable list of allowable link types

    Previous Section  < Day Day Up >  Next Section