Java Crawler Technology HttpClient Learning notes

Source: Internet
Author: User
Tags try catch

The first section,HttpClient

I.introduction of HttpClient

The Hypertext Transfer Protocol "theHyper-text Transfer Protocol (HTTP)" is the most important (significant) protocol used on the internet today,

More and more Java applications need to access network resources directly through the HTTP protocol.

While the basic functionality of accessing the HTTP protocol has been provided in the Java NET package of the JDK, the JDK library itself provides a lack of functionality and flexibility for most applications.

HttpClient is a sub-project under Apache Jakarta Common to provide an efficient, up-to-date, feature-rich client programming toolkit that supports the HTTP protocol, and it supports the latest versions and recommendations of the HTTP protocol.

Official site:http://hc.apache.org/

Latest Version:http://hc.apache.org/httpcomponents-client-4.5.x/

Official document:http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

Second, maven dependency package

<dependency>

<groupId>org.apache.httpcomponents</groupId>

<artifactId>httpclient</artifactId>

<version>4.5.2</version>

</dependency>

Third,the HelloWorld realization of httpclient

 Packagecom.guo.httpclient;Importjava.io.IOException;Importorg.apache.http.HttpEntity;Importorg.apache.http.ParseException;Importorg.apache.http.client.ClientProtocolException;ImportOrg.apache.http.client.methods.CloseableHttpResponse;ImportOrg.apache.http.client.methods.HttpGet;Importorg.apache.http.impl.client.CloseableHttpClient;Importorg.apache.http.impl.client.HttpClients;Importorg.apache.http.util.EntityUtils; Public classHelloWorld { Public Static voidMain (String args[]) {//Creating an HttpClient instanceCloseablehttpclient httpclient=Httpclients.createdefault (); //Creating an HttpGet instanceHttpGet httpget=NewHttpGet ("https://www.cnblogs.com/"); Closeablehttpresponse Response=NULL;//define a return message        Try{Response=Httpclient.execute (HttpGet); } Catch(Clientprotocolexception e) {//HTTP protocol Exception//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//IO Exception//TODO auto-generated Catch blockE.printstacktrace (); }//executing an HTTP GET request//Get return information entityHttpentity entity=response.getentity (); Try{System.out.println ("Get Web Content" +entityutils.tostring (Entity, "utf-8"));//get Web page content}Catch(ParseException e) {//Parsing Exceptions//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }                Try{response.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }                Try{httpclient.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }            }}

This section is about direct requests without a mock browser some sites can't crawl

The following section will address this issue

If you crawl a domestic web site to remove the above code's try catch, throw an exception on the line

Section two, simulating browser crawl (for example in Firefox)

Some site settings anti-grilled, the above section of the direct request will appear the following questions

This requires a mock browser to query

One: Set the request header message user-agent simulation Browser

1. Request Header Message

Open a website, here take www.tuicool.com as an example press F12 dot network

Look at the request header information, the browser is sent to the target server to see, the target server to identify, such as

How do I simulate a browser?

Use the HttpGet instance to call the SetHeader method to set the header information to the target server, the code is as follows

Two: Get the response content type Content-type

Get The entity called Getcontent-type method with Httpentity, the response content is a key-value pair, here we GetValue get value values, the code is as follows

This will get the page content commented out, only output the response content information

The operation results are as follows

This coded utf-8, which is not encoded, is based on the server settings

Why do you want to get the type of response content?

Because when we collect, there will be a lot of links, to filter out some irrelevant information

Three: Get the response State status

200 normal 403 reject 500 server error 404 Page not found

The previous response was smooth, with a response status of 200 as follows

To get the response status, call the Getstatusline method with the Closeablehttpresponse instance, with the following code

Here we just need state 200, so add a Getstatuscode method, just get the status code

Section III,HttpClient grab pictures

First use the ContentType to get the following type, the code is as follows

Displays the result as a image/jpeg picture type, as follows

Now put this picture locally, (also can be placed on the server)

Here httpentity entity calls a GetContent method This method is InputStream input stream type, so return InputStream, first determine the entity is not empty

Gets the InputStream input stream instance, how to copy the picture to a local

In the traditional way

Slightly

In simple words, with Apache encapsulated Commons.io

First Maven introduces the jar package and then writes the code as follows

But actually development how do you know it's a . jpg suffix? Development will address http://xxx.com/xxx.xx point after the xx get to, and then stitching to save the file .

Fourth section proxy IP

With the GAO anonymous agent

Baidu Search Agent IP,

call the Setconfig method with the HttpGet instance .

Specific project to write a small reptile crawling proxy IP site, just crawl the first 10 proxy IP, put in the queue.

Section Fifth link and read timeout

First, httpclient connection time

is the time at which the HttpClient sends the request to the destination URL host address on the connection, theoretically the shorter the faster the distance.

The default connection time for HttpClient is 1 minutes, if more than 1 minutes will continue to try to connect .

If a URL is always not connected, it can affect threads in other threads, so we need to set it up.

For example, set 10 seconds if 10 seconds are not connected we will error. Use the log4j log to record relevant information.

Second, HttpClient Read Time

Is HttpClient has connected to the target server, and then to obtain the content data, the general situation of reading data is very fast,

If the amount of data being read is large, or the problem with the target server itself (such as slow reading of the database, large concurrency, etc.) can affect the read time.

Still need to set, such as set 10 seconds if 10 seconds have not finished reading, the error

Java Crawler Technology HttpClient Learning notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.