Java Crawler Learning

Source: Internet
Author: User

The most powerful framework in the Java crawler field is jsoup: the ability to parse a specific URL directly (that is, parsing the corresponding HTML) provides a powerful set of APIs, including the ability to extract and manipulate data through the DOM, CSS selector, or similar jquery. The main functions are:

    • Gets the HTML code from the given URL, file, string.
    • Then the DOM, CSS selector (class jquery) to find and fetch the data: first find the HTML element, and then get its properties, text and so on.

API Initial Learning:

There are three ways to get an HTML document (the structure of the Jsoup documents object is:

1, through the string: string html= "Hello"; Document doc = jsoup.parse (HTML);//(At this point Jsoup will place the hello in the body of the Doc object.) If the string is a complete HTML document, the Doc object will be based on the HTML structure of the string.

2. Get the HTML document for the URL (note: In this way, the request is actually sent only when the Get or post method is called):

get connected via URL: Connection object Connection conn = Jsoup.connect ("http://www.baidu.com");  The following are the main methods, most of which return Connectionconn.data ("Query", "Java");   ////////  Set Connection timeout // send request to get HTML Document: Document Object Document DOC = conn.get ();D ocument doc = Conn.post ();           

Get element:

1, through the DOM (same as the pure JavaScript method):

gets document-level information, such as: String title = Doc.title ();  gets a single HTML element, such as: <div id= "Content" ></div>element content = Doc.getelementbyid ("content");  get multiple elements, such as: <a href= "http://www.qunyh.cn" ></a> <a href= "http://cn.bing.com" ></a>elements links = Doc.getelementsbytag ("a");

2, similar to the way of jquery, here is to replace $ with the Select method:

The parameters of select are similar to the jquery selector selectorelements ALLP = Doc.select ("P"); Element FIRSTP = Allp.first (); Element ONEP = allp.get (1); // starting from 0 // operation element:for// manipulation element: Here is similar to jquery String text = p.text ();}     

Of course Jsoup just mimics the convenience of jquery and does not have all the features of jquery, such as jquery plugins that are definitely not available in Jsoup. Therefore, if the JS mastery is very good, choose Node.js+mongodb to deal with the comparative advantage (that is, the degree of support between each other is relatively large). If you are familiar with Java, then you can choose Jsoup+mysql+quartz, is also very useful (the whole Java implementation, worry convenient), and then with the Java Scheduler Quartz, you can achieve a complete crawler.

Reptile Frame Gecco:

Gecco is a Java-based, lightweight, theme-oriented crawler, unlike Nutch, a generic crawler for search engines:

    • Generic crawlers typically focus on three questions: Download, sort, index.
    • Topic crawlers focus on: Download, content extraction, flexible business logic processing.

Gecco goal: To provide a complete theme crawler framework: Simplify the development of downloads and content extraction, and use the pipeline filter mode to provide flexible content cleaning and persistent processing modes. Therefore, the development can focus on the business topics of logic, content processing.

To learn a framework, you first need to know what it's used for:

    • Easy to use, extract elements using JQuery's Selector style.
    • supports asynchronous Ajax requests in the page .
    • Support for JavaScript variable extraction in the page.
    • Using Redis for distributed crawling, refer to Gecco-redis.
    • Support useragent random selection when downloading.
    • Supports random selection of download proxy servers.
    • Supports the development of business logic in conjunction with Spring, referencing gecco-spring.
    • Support Htmlunit extension, refer to Gecco-htmlunit.
    • Supports plug-in extension mechanisms.

One minute you can write a simple crawler :

Find a new, good crawler frame (time to try later) when looking for information:
Gecco Crawler:
Gecco is a lightweight and easy-to-use web crawler developed in the Java language, unlike Nutch, a generic crawler for search engines, Gecco is a theme-oriented crawler.
Https://github.com/xtuhcy/gecco
Https://xtuhcy.gitbooks.io/geccocrawler/content/index.html
https://my.oschina.net/u/2336761/blog/688534
Template code generator: Java EE templates (developed by Jsoup, can be used to learn how to develop the crawler):
Https://www.oschina.net/p/jeetemp

Java Crawler Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.