Implement a high-performance web crawler from scratch (I) network request analysis and code implementation, high-performance Web Crawler
Summary
The first tutorial on implementing a high-performance web crawler series from scratch will be a series of articles on url deduplication, anti-crawler, how to improve crawling efficiency, and distributed crawler series.
I wrote a web crawler for Demo to explain, github address (https://github.com/wycm/zhihu-crawler), interested friends can star.
Network request analysis is a key and important step for writing web crawlers. This article takes zhihu website as an example, from network request analysis to code (java) implementation.
Purpose
Obtain all the personal data of a user who follows the user
Request Analysis
- For most of the current webpages, most of the data displayed on the webpage is directly generated in the website background (some webpages are displayed after processing by js Code on the website frontend, such as data obfuscation and encryption.
- Although many websites adopt ajax asynchronous loading, it is still an http request. As long as you can analyze the request source of the corresponding data, it is easy to get the data you want. The following steps describe how to analyze http requests.
Code Implementation
- Java HttpClient4.x is used in the code. I will not explain much about HttpClient4.x here. Note that there are great differences between HttpClient4.x and 3.x APIs.
1 package com. cnblogs. wycm; 2 3 import com. alibaba. fastjson. JSON; 4 import com. alibaba. fastjson. JSONObject; 5 import org. apache. http. client. methods. closeableHttpResponse; 6 import org. apache. http. client. methods. httpGet; 7 import org. apache. http. impl. client. closeableHttpClient; 8 import org. apache. http. impl. client. httpClients; 9 import org. apache. http. util. entityUtils; 10 11 import java. io. IOExceptio N; 12 13/** 14 * obtain all the user information followed by wo-yan-chen-mo. 15 * print the user information, no specific resolution (you can use regular expressions, json libraries, and jsonpath to parse detailed data) 16 */17 public class Demo {18 public static void main (String [] args) throws IOException {19 // create http client 20 CloseableHttpClient httpClient = HttpClients. createDefault (); 21 22 String url = "https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees? Include = data % 5B * % 5D. answer_count % 2Carticles_count % 2 Cgender % 2Cfollower_count % 2Cis_followed % 2Cis_following % 2 Cbadge % 5B % 3F (type % 3Dbest_answerer) % 5D. topics & offset = 0 & limit = 20 "; 23 24 // create http request (GET) 25 HttpGet request = new HttpGet (url ); 26 27 // set http request header28 request. setHeader ("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e00000e20"); 29 // execute http request 30 CloseableHttpResponse response = httpClie Nt.exe cute (request); 31 // print response32 String responseStr = EntityUtils. toString (response. getEntity (); 33 System. out. println (responseStr); 34 35 String nextPageUrl = getNextPageUrl (responseStr); 36 boolean isEnd = getIsEnd (responseStr); 37 38 while (! IsEnd & nextPageUrl! = Null) {39 // create http request (GET) 40 request = new HttpGet (nextPageUrl); 41 42 // set http request header43 request. setHeader ("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e00000e20"); 44 response = httpClient.exe cute (request); 45 // print response46 responseStr = EntityUtils. toString (response. getEntity (); 47 System. out. println (responseStr); 48 nextPageUrl = getNextPageUrl (responseStr); 49 isEnd = getIsEnd (responseStr ); 50} 51} 52 53/** 54 * Get next url55 * @ param responseStr56 * @ return57 */58 private static String getNextPageUrl (String responseStr) {59 JSONObject jsonObject = (JSONObject) JSON. parse (responseStr); 60 jsonObject = (JSONObject) jsonObject. get ("paging"); 61 return jsonObject. get ("next "). toString (); 62} 63 64/** 65 * Get is_end66 * @ param responseStr67 * @ return68 */69 private static boolean getIsEnd (String responseStr) {70 JSONObject jsonObject = (JSONObject) JSON. parse (responseStr); 71 jsonObject = (JSONObject) jsonObject. get ("paging"); 72 return (boolean) jsonObject. get ("is_end"); 73} 74}
Maven dependency
1 <dependency> 2 <groupId>org.apache.httpcomponents</groupId> 3 <artifactId>httpclient</artifactId> 4 <version>4.5</version> 5 </dependency> 6 7 <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson --> 8 <dependency> 9 <groupId>com.alibaba</groupId>10 <artifactId>fastjson</artifactId>11 <version>1.2.31</version>12 </dependency>