Recently the company to catch the project, live a 996 of life, Sunday to accompany the wife, there is no time to quiet down to write something, which led to the swift writing 2048 of the third slow to start, say sorry, as far as possible to spare time in this weekend.
First of all to introduce the role of the crawler, 爬虫
mainly used to crawl the site in large quantities of the data we need, in fact, is to simulate the HTTP request, and then parse the analysis of the data obtained we need information such a process.
Because there are already a lot of off-the-shelf crawler frame, here do not repeat the wheel, first of all to say the principle, we can try to write their own, as to the specific implementation of this only brings a framework of the use of examples, so that you can quickly write the desired crawler according to examples.
The key to the crawler is to analyze the data we need, the more thorough the analysis, we can write more efficient crawler, such as we need to climb out of NetEase cloud Music play more than 7 million of the playlist , then first we have to find the NetEase Cloud Music song single page, the page is as follows:
We need the element is: song name, song single link, song single play volume, you can see the page of these elements are there, then we can directly through HttpClient
the Get method to crawl this page, URL: http://music.163.com/discover/playlist
, after crawling analysis two points:
- Each of the song regions is a single
- Label, we take out
- The amount of playback in the label, judging whether it is greater than 7 million, is more than put into our result set.
- The next Page button area, grab the URL in the Paging button area, join our task queue, our thread will continue to fetch the URL from the queue to fetch and parse, and perform step 1
The analysis process is done, next we look at the technical implementation, here is not specifically written code, because the previous said this article does not repeat the wheel, will only introduce an example with the existing framework to do, put in the back, here first introduced the implementation of the way, we can try it.
- First we need a task queue where the URLs that need to be processed are stored in the queue
- Create a thread pool, get the URL from the task queue and parse it, and wait if the queue is empty
- The thread pool fetches the URL after the impersonation request (HttpClient, jodd HttpRequest, etc. can be, see their own choice), get the return data
- A regular match is made to the returned data, the URL that matches the need to continue parsing is placed in the queue, and the wake-up thread operation is performed
- Match to the data we need to join the result set, writable files can be stored in the library can also be directly output, to see their own needs.
This will probably complete the core of a reptile, it is important to note that the fourth step need to do a certain amount of work, can reduce the total number of requests, improve efficiency.
Next look at a concrete example of the framework, which is used in the domestic Mr. Huang billion shared webmagic framework, the home address is: http://webmagic.io/.
Next look at a concrete example, here I want to grasp the travel topic under the attention of more than 30,000, and the content contains more than 5 times "eat" word. So first we open the Travel page:
As you can see, there are a lot of questions on this page, but there is no problem with the number of concerns, then we open a question to see the question page:
Can see, the right red box has we need to pay attention to this information, chrome under F12 View the Web page source code, analysis of the inside of the DOM structure, you can get the following ideas:
- Request Https://www.zhihu.com/topic/19551556/top-answers, analyze the information
- The page that matches the regular https://www.zhihu.com/topic/19551556/top-answers\?page=\d+ is the page for each list of selected catalogs of the current topic, where page is the number of pages
- A page that matches a regular https://www.zhihu.com/question/\d+ is a specific question page
- In the specific question page to get to the number of concerns, to determine whether greater than 30000, greater than continue to take the problem of the body block content, matching it contains how many "eat" word, if more than 5 will save the results.
Next look at the implementation of each step:
A: Create a new MAVEN project and add maven dependencies
This step does not introduce much, here to paste out the configuration in the Pom.xml. Of course, if you don't use maven, you can download the existing jar package in http://webmagic.io/.
<dependencies> <dependency> <groupId>us.codecraft</groupId> <artifactid>web Magic-core</artifactid> <version>0.5. 3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> &L T;artifactid>webmagic-extension</artifactid> <version>0.5. 3</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <a Rtifactid>slf4j-log4j12</artifactid> <version>1.7.</version> </dependency> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>19.0</version> </dependency></dependencies>
The Webmagic-core and webmagic-extension are the WebMagic framework shared by Mr. Huang, and the two dependencies are external dependencies that need to be relied on in the WebMagic framework, and of course, if it is a jar package of its own, The next two jar packages are also added to their own project catalog.
Second: To write a concrete implementation
Because of the existing framework used here, the implementation code is simple, just one class:
PackageOrg.white.spider;ImportUs.codecraft.webmagic.Page;ImportUs.codecraft.webmagic.Site;ImportUs.codecraft.webmagic.Spider;ImportUs.codecraft.webmagic.processor.PageProcessor;ImportJava.util.HashMap;ImportJava.util.List;ImportJava.util.Map;/** * <p> know crawler </p> * @author Mr White * @version v 0.1 zhihutravelprocessor.java * @ Date 2016/4/19 20:55 * / Public class zhihutravelprocessor implements pageprocessor { Private Static FinalString Focus_begin_str ="</button>";Private Static FinalString Focus_end_str ="People pay attention to this problem";PrivateSite site = site.me (). Setcycleretrytimes (5). Setretrytimes (5). Setsleeptime ( -). SetTimeOut (3* -* +). Setuseragent ("mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) gecko/20100101 firefox/38.0 "). AddHeader ("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"). AddHeader ("Accept-language","zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3"). Setcharset ("UTF-8");Private Staticmap<string, string> eatmap =NewHashmap<string, string> (); Public void Process(Page page) {page.addtargetrequests (page.gethtml (). Links (). Regex ("(https://www.zhihu.com/topic/19551556/top-answers\\?page=\\d+)"). All ()); Page.addtargetrequests (Page.gethtml (). Links (). Regex ("(https://www.zhihu.com/question/\\d+)"). All ());if(Page.geturl (). Regex ("(https://www.zhihu.com/question/\\d+)"). Match ()) {list<string> playcountlist = page.gethtml (). XPath ("//div[@class = ' Zm-side-section-inner zg-gray-normal ']/html ()"). All ();if(playcountlist.size () = =1) {String focusstr = Playcountlist.get (0);LongFocus = Long.parselong (Focusstr.substring (Focusstr.indexof (FOCUS_BEGIN_STR) + focus_begin_str.length () , Focusstr.indexof (FOCUS_END_STR)));if(Focus >30000) {list<string> eatlist = page.gethtml (). XPath ("//div[@class = ' Zm-item-rich-text js-collapse-body ']/html ()"). Regex ("Eat"). All (); list<string> titlelist = page.gethtml (). XPath ("//title/html ()"). All ();if(Eatlist.size () >5) {Eatmap.put (Page.geturl (). toString (), Titlelist.get (0)); } } } } } Public Static void Main(string[] args) {Spider.create (NewZhihutravelprocessor ()). Addurl ("Https://www.zhihu.com/topic/19551556/top-answers"). Thread (5). Run (); System.out.println ("====================================total==================================="); for(String S:eatmap.keyset ()) {System.out.println ("title:"+ Eatmap.get (s)); System.out.println ("href:", T); } } PublicSiteGetsite() {returnSite }}
The above is the concrete implementation, are very simple, here is not specific explanation, we can write their own run to see, which about webmagic things can refer to the document: http://webmagic.io/docs/zh/written in very detailed.
It is here today, we have any questions to welcome the proposed.
My blog: blog.scarlettbai.com
My public number: Reading fitness program
Getting Started with Java crawlers (NetEase cloud music and knowledge examples)