Java Theory and Practice: Querying a database without a database

Source: Internet
Author: User

I have recently examined a project that involves quite a few Web quick searches. When a crawler crawls over a different Web site, it builds a database that includes data such as the sites and pages it crawls, the links that each page contains, and the analysis results for each page. The end result is a set of reports detailing which sites and pages, which links have been linked, which links have been disconnected, which pages have errors, calculated page sizes, and so on. At first, no one knew exactly what kind of report was needed, or what format it should take-only that there was something to report on. This indicates that the development phase of the report is a recurring phase, with multiple feedbacks, modifications, and possible attempts to use a different structure. The only established reporting requirement is that the report should be presented in XML or in HTML. Therefore, the process of developing and modifying reports must be lightweight, because the reporting requirements are "dynamic discovery" rather than predefined.

No database required

The "most obvious" solution to the problem is to put everything in the SQL database-pages, links, metrics, HTTP result codes, timing results, and other metadata. This problem can be solved well by means of a relational representation, especially since this approach does not require the storage of the content of the visited pages, only the structure and metadata of them.

So far, this project looks like a typical database application, and it doesn't lack the persistence policy to choose from. However, it may be possible to avoid the complexity of using database persistent storage data-this Quick search tool (crawler) accesses only tens of thousands of pages. This number is not very large, so you can put the entire database in memory, and when it is necessary to persist data, it can be serialized to achieve it. (Yes, loading and saving operations take a long time, but these operations are not performed frequently.) Laziness has the advantage of not having to deal with persistence dramatically reducing the time it takes to develop an application, and thus significantly reduces the development effort. Building and manipulating in-memory data structures is much easier than using databases every time you add, extract, or analyze data. Regardless of which persistent storage model is selected, the construction of any code that touches the data is limited.

An in-memory data structure is a tree structure, as shown in Listing 1, whose root is the home page of each Web site that has been quickly searched, so the Visitor mode is the ideal mode for searching these home pages or extracting data from them. (It's not hard to build a basic Visitor class that links to B, B links to C, C links to a--, to prevent falling into link loops--a.) )

Listing 1. A simplified scheme of WEB crawler

public class Site {
   Page homepage;
   Collection<Page> pages;
   Collection<Link> links;
}
public class Page {
   String url;
   Site site;
   PageMetrics metrics;
}
public class Link {
   Page linkFrom;
   Page linkTo;
   String anchorText;
}

This quick search tool has more than 10 Visitor in its application, and they do things like select a page for further analysis, choose a page without links, list the pages that are linked most, and so on. Because all of these operations are simple, the Visitor pattern (shown in Listing 2) works well because the data structure can be put into memory, so even a thorough search is not costly:

Listing 2. Visitor mode for the WEB Quick Search tool Database

public interface Visitor {
   public void visitSite(Site site);
   public void visitLink(Link link);
}

Oh, forget the report.

If you do not run the report, the Visitor policy will do a very good job of accessing the data. One of the benefits of using a database for persistent storage is that when generating a report, the ability of the SQL is much more glorious-almost anything that the database can do. It's easy to even generate a report prototype with SQL-run the prototype report, and if the results aren't what you need, you can modify the SQL query or write a new query and try again. This edit-compile-run cycle may be quick if you change just the SQL query. If SQL is not stored in a program, you can even skip the compilation section of the cycle so that you can quickly generate a prototype for the report. Once you've identified the reports you need, it's easy to build them into your application.

Thus, while in-memory data structures are performing well for adding new results, finding specific results, and special transmissions, these are bad conditions for reporting. For all reports whose own structure differs from the database structure, Visitor must create an entirely new data structure to contain the report data. Therefore, each report type needs to have its own, report-specific intermediate data structure to hold the results, as well as a visitor to populate the intermediate data structure, and a reprocessing (post-processing) code to convert the intermediary to the final report. There seems to be a lot of work to be done, especially when most prototype reports will be discarded. For example, suppose you want to list all the reports that link to a given Web page from another site, a list report for all external pages, and a list of those pages on the site that link to the page, and then categorize the report based on the number of links, and the most linked pages appear at the front. The plan basically moves the data structure from inside to out. To implement this data transformation with Visitor, you need to obtain a list of external page links that can be reached from a given web site and categorize them according to the linked pages, as shown in Listing 3:

Listing 3. Visitor lists the pages that are linked most, and the pages that link to them

public class InvertLinksVisitor {
   public Map<Page, Set<Page>> map = ...;

   public void visitLink(Link link) {
     if (link.linkFrom.site.equals(targetSite)
       && !link.linkTo.site.equals(targetSite)) {
       if (!map.containsKey(link.linkTo))
         map.put(link.linkTo, new HashSet<Page> ());
       map.get(link.linkTo).add(link.linkFrom);
     }
   }
}

The Visitor in Listing 3 generates a map that associates each external page with a set of internal pages that link to it. To prepare the report, you must also categorize the entries based on the size of the associated pages, and then create the report. Although there are no difficult steps, each report requires a large number of reports-specific code, so the Quick report prototype becomes an important goal (since no reporting requirements are reported), and the cost of experimenting with new reports is higher than ideal. Many reports need to pass data multiple times to select, summarize, and categorize data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.