DekGenius.com
Previous Section  < Day Day Up >  Next Section

9.2 Examining the Requirements

The requirements for the Simple Spider leave a wide variety of design decisions open. Possible solutions might be based on hosted EJB solutions with XML-configurable indexing schedules, SOAP-encrusted web services with pass-through security, and any number of other combinations of buzz words, golden hammers, and time-wasting complexities. The first step in designing the Spider was to eliminate complexity and focus on the problem at hand. In this section, we will go through the decision-making steps together. The mantra for this part of the process: ignore what you think you need and examine what you know you need.

9.2.1 Breaking It Down

The first two services described by the requirements are the crawler and the indexer. They are listed as separate services in the requirements, but in examining the overall picture, we see no current need to separate them. There are no other services that rely on the crawler absent the indexer, and it doesn't make sense to run the indexer unless the crawler has provided a fresh look at the search domain. Therefore, in the name of simplicity, let's simplify the requirements to specify a single service that both crawls and indexes a web site.

The requirements next state that the crawler needs to ignore links to image files, since it would be meaningless to index them for textual search and doing so would take up valuable resources. This is a good place to apply the Inventor's Paradox. Think for a second about the Web: there are more kinds of links to ignore than just image files and, over time, the list is likely to grow. Let's allow for a configuration file that specifies what types of links to ignore.

After the link-type requirement comes a requirement for configuring the maximum number of links to follow. Since we have just decided to include a configuration option of some kind, this requirement fits our needs and we can leave it as-is.

Next, we have a requirement for making the indexer schedulable. Creating a scheduling service involves implementing a long-running process that sits dormant most of the time, waking up at specified intervals to fire up the indexing service. Writing such a process is not overly complex, but it is redundant and well outside the primary problem domain. In the spirit of choosing the right tools and doing one thing well, we can eliminate this entire requirement by relying on the deployment platform's own scheduling services. On Linux and Unix we have cron and on Windows we have at. In order to hook to these system services, we need only provide an entry point to the Spider that can be used to fire off the indexing service. System administrators can then configure their schedulers to perform the task at whatever intervals are required.

The final service requirement is the search service. Even though the requirements don't specify it as an individual service, it must be invoked independently of the index (we wouldn't want to re-run the indexer every time we wanted to search for something): it is obvious that it needs to be a separate service within the application. Unfortunately, the search service must be somewhat coupled to the indexing service, as the search service must be coupled to the format of the indexing service's data source. No global standard API currently exists for text index file formats. If and when such a standard comes into being, we'll upgrade the Spider to take advantage of the new standard and make the searching and indexing services completely decoupled from one another.

As for the user interfaces, a console interface is a fairly straightforward choice. However, the mere mention of web services often sends people into paroxysms of standards exuberance. Because of the voluminous and increasingly complex web services standards stack, actually implementing a web service is becoming more and more difficult. Looking at our requirements, however, we see that we can cut through most of the extraneous standards. Our service only needs to launch a search and return an XML result set. The default implementation of an axis web service can provide those capabilities without us messing around with either socket-level programming or high-level standards implementation.

9.2.2 Refining the Requirements

We can greatly improve on the initial requirements. Using the Inventor's Paradox, common sense, and available tools, we can eliminate a few others. Given this analysis, our new requirements are:

  1. Provide a service to crawl and index a web site.

    1. Allow the user to pass a starting point for the search domain.

    2. Let the user configure the service to ignore certain types of links.

    3. Let the user configure the service to only follow a maximum number of links.

    4. Expose an invoke method to both an existing scheduler and humans.

  2. Provide a search service over the results of the crawler/indexer.

    1. The search should collect a search word or phrase.

    2. Search results should include a full path to the file containing the search term.

    3. Search results should contain a relative rank for each result. The actual algorithm for determining the rank is unimportant.

  3. Provide a console-based interface for invoking the indexer/crawler and search service.

  4. Provide a web service interface for invoking the indexer/crawler and the search service. The web service interface does not need to explicitly provide authentication or authorization.

These requirements represent a cleaner design that allows future extensibility and focuses development on tasks that are essential to the problem domain. This is exactly what we need from requirements. They should provide a clear roadmap to success. If you get lost, take a deep breath. It's okay to ask for directions and clarify requirements with a customer.

    Previous Section  < Day Day Up >  Next Section