Implement a high-performance web crawler from scratch (I) network request analysis and code implementation, high-performance Web Crawler

Source: Internet
Author: User
Tags oauth

Implement a high-performance web crawler from scratch (I) network request analysis and code implementation, high-performance Web Crawler
Summary

The first tutorial on implementing a high-performance web crawler series from scratch will be a series of articles on url deduplication, anti-crawler, how to improve crawling efficiency, and distributed crawler series.
I wrote a web crawler for Demo to explain, github address (https://github.com/wycm/zhihu-crawler), interested friends can star.
Network request analysis is a key and important step for writing web crawlers. This article takes zhihu website as an example, from network request analysis to code (java) implementation.

Purpose

Obtain all the personal data of a user who follows the user

Request Analysis
  • For most of the current webpages, most of the data displayed on the webpage is directly generated in the website background (some webpages are displayed after processing by js Code on the website frontend, such as data obfuscation and encryption.
  • Although many websites adopt ajax asynchronous loading, it is still an http request. As long as you can analyze the request source of the corresponding data, it is easy to get the data you want. The following steps describe how to analyze http requests.

 

Code Implementation
  • Java HttpClient4.x is used in the code. I will not explain much about HttpClient4.x here. Note that there are great differences between HttpClient4.x and 3.x APIs.
1 package com. cnblogs. wycm; 2 3 import com. alibaba. fastjson. JSON; 4 import com. alibaba. fastjson. JSONObject; 5 import org. apache. http. client. methods. closeableHttpResponse; 6 import org. apache. http. client. methods. httpGet; 7 import org. apache. http. impl. client. closeableHttpClient; 8 import org. apache. http. impl. client. httpClients; 9 import org. apache. http. util. entityUtils; 10 11 import java. io. IOExceptio N; 12 13/** 14 * obtain all the user information followed by wo-yan-chen-mo. 15 * print the user information, no specific resolution (you can use regular expressions, json libraries, and jsonpath to parse detailed data) 16 */17 public class Demo {18 public static void main (String [] args) throws IOException {19 // create http client 20 CloseableHttpClient httpClient = HttpClients. createDefault (); 21 22 String url = "https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees? Include = data % 5B * % 5D. answer_count % 2Carticles_count % 2 Cgender % 2Cfollower_count % 2Cis_followed % 2Cis_following % 2 Cbadge % 5B % 3F (type % 3Dbest_answerer) % 5D. topics & offset = 0 & limit = 20 "; 23 24 // create http request (GET) 25 HttpGet request = new HttpGet (url ); 26 27 // set http request header28 request. setHeader ("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e00000e20"); 29 // execute http request 30 CloseableHttpResponse response = httpClie Nt.exe cute (request); 31 // print response32 String responseStr = EntityUtils. toString (response. getEntity (); 33 System. out. println (responseStr); 34 35 String nextPageUrl = getNextPageUrl (responseStr); 36 boolean isEnd = getIsEnd (responseStr); 37 38 while (! IsEnd & nextPageUrl! = Null) {39 // create http request (GET) 40 request = new HttpGet (nextPageUrl); 41 42 // set http request header43 request. setHeader ("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e00000e20"); 44 response = httpClient.exe cute (request); 45 // print response46 responseStr = EntityUtils. toString (response. getEntity (); 47 System. out. println (responseStr); 48 nextPageUrl = getNextPageUrl (responseStr); 49 isEnd = getIsEnd (responseStr ); 50} 51} 52 53/** 54 * Get next url55 * @ param responseStr56 * @ return57 */58 private static String getNextPageUrl (String responseStr) {59 JSONObject jsonObject = (JSONObject) JSON. parse (responseStr); 60 jsonObject = (JSONObject) jsonObject. get ("paging"); 61 return jsonObject. get ("next "). toString (); 62} 63 64/** 65 * Get is_end66 * @ param responseStr67 * @ return68 */69 private static boolean getIsEnd (String responseStr) {70 JSONObject jsonObject = (JSONObject) JSON. parse (responseStr); 71 jsonObject = (JSONObject) jsonObject. get ("paging"); 72 return (boolean) jsonObject. get ("is_end"); 73} 74}
Maven dependency
 1   <dependency> 2       <groupId>org.apache.httpcomponents</groupId> 3       <artifactId>httpclient</artifactId> 4       <version>4.5</version> 5     </dependency> 6  7     <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson --> 8     <dependency> 9       <groupId>com.alibaba</groupId>10       <artifactId>fastjson</artifactId>11       <version>1.2.31</version>12     </dependency>

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.