The most powerful framework in the Java crawler field is jsoup: the ability to parse a specific URL directly (that is, parsing the corresponding HTML) provides a powerful set of APIs, including the ability to extract and manipulate data through the DOM, CSS selector, or similar jquery. The main functions are:
- Gets the HTML code from the given URL, file, string.
- Then the DOM, CSS selector (class jquery) to find and fetch the data: first find the HTML element, and then get its properties, text and so on.
API Initial Learning:
There are three ways to get an HTML document (the structure of the Jsoup documents object is:
1, through the string: string html= "Hello"; Document doc = jsoup.parse (HTML);//(At this point Jsoup will place the hello in the body of the Doc object.) If the string is a complete HTML document, the Doc object will be based on the HTML structure of the string.
2. Get the HTML document for the URL (note: In this way, the request is actually sent only when the Get or post method is called):
get connected via URL: Connection object Connection conn = Jsoup.connect ("http://www.baidu.com"); The following are the main methods, most of which return Connectionconn.data ("Query", "Java"); //////// Set Connection timeout // send request to get HTML Document: Document Object Document DOC = conn.get ();D ocument doc = Conn.post ();
Get element:
1, through the DOM (same as the pure JavaScript method):
gets document-level information, such as: String title = Doc.title (); gets a single HTML element, such as: <div id= "Content" ></div>element content = Doc.getelementbyid ("content"); get multiple elements, such as: <a href= "http://www.qunyh.cn" ></a> <a href= "http://cn.bing.com" ></a>elements links = Doc.getelementsbytag ("a");
2, similar to the way of jquery, here is to replace $ with the Select method:
The parameters of select are similar to the jquery selector selectorelements ALLP = Doc.select ("P"); Element FIRSTP = Allp.first (); Element ONEP = allp.get (1); // starting from 0 // operation element:for// manipulation element: Here is similar to jquery String text = p.text ();}
Of course Jsoup just mimics the convenience of jquery and does not have all the features of jquery, such as jquery plugins that are definitely not available in Jsoup. Therefore, if the JS mastery is very good, choose Node.js+mongodb to deal with the comparative advantage (that is, the degree of support between each other is relatively large). If you are familiar with Java, then you can choose Jsoup+mysql+quartz, is also very useful (the whole Java implementation, worry convenient), and then with the Java Scheduler Quartz, you can achieve a complete crawler.
Reptile Frame Gecco:
Gecco is a Java-based, lightweight, theme-oriented crawler, unlike Nutch, a generic crawler for search engines:
- Generic crawlers typically focus on three questions: Download, sort, index.
- Topic crawlers focus on: Download, content extraction, flexible business logic processing.
Gecco goal: To provide a complete theme crawler framework: Simplify the development of downloads and content extraction, and use the pipeline filter mode to provide flexible content cleaning and persistent processing modes. Therefore, the development can focus on the business topics of logic, content processing.
To learn a framework, you first need to know what it's used for:
- Easy to use, extract elements using JQuery's Selector style.
- supports asynchronous Ajax requests in the page .
- Support for JavaScript variable extraction in the page.
- Using Redis for distributed crawling, refer to Gecco-redis.
- Support useragent random selection when downloading.
- Supports random selection of download proxy servers.
- Supports the development of business logic in conjunction with Spring, referencing gecco-spring.
- Support Htmlunit extension, refer to Gecco-htmlunit.
- Supports plug-in extension mechanisms.
One minute you can write a simple crawler :
Find a new, good crawler frame (time to try later) when looking for information:
Gecco Crawler:
Gecco is a lightweight and easy-to-use web crawler developed in the Java language, unlike Nutch, a generic crawler for search engines, Gecco is a theme-oriented crawler.
Https://github.com/xtuhcy/gecco
Https://xtuhcy.gitbooks.io/geccocrawler/content/index.html
https://my.oschina.net/u/2336761/blog/688534
Template code generator: Java EE templates (developed by Jsoup, can be used to learn how to develop the crawler):
Https://www.oschina.net/p/jeetemp
Java Crawler Learning