HttpClient pulls serialized Novels

Source: Internet
Author: User

The novel I just got started with in the morning, I tried to pull it from my website and make it an e-book in the afternoon ~ Just do it ~

[Packet capture]

This step is more important than anything. If you cannot find the request to obtain the real resource, you don't need to do anything ~

I first tried to download all the pages with thunder and then process them locally. I found that the saved pages only contain no content on the interface ~ I looked at the Javascript code. It turned out that when ready was used, ajax sent a post to another URL to retrieve the content.

Then, capture the package and check it. The packet capture tool is really difficult. I tried two or three of them and did not succeed. I finally got it through firefox ~ A total of 50 requests are sent to open the page, but there are only two post requests, and the content of the http packet will soon be seen.

 

[Write program pulling]

URLs, request headers, and forms are all available. What are you waiting for? Please crawl the code ~ I was worried that I had to pretend to be a browser. I had to fill in cookies. After debugging, I found that I thought too much. It would be enough to go straight to the website and bring a form ~

The use of HttpClient is selling, and the official Quick Start. java of example is clear enough. Then debug is used to view the request and response.

For usage, see HttpPostCrawlOnePage (HttpRequestBase). MyFileWriter will not post it to show ugliness. It is an I/O.

Problems encountered and solved:

1. The resources returned from the response are compressed by gzip and decoded using the corresponding class;

2. For some missing pages in the URL serial number, you can skip the status code in the response.

(The full text is complete. The following is the code .)

CloseableHttpClient httpclient = main (String [] args) startChapter = page serial number endChapter = page serial number Integer bookId = bookid String outPattern = "c: \ book *. txt "MyFileWriter fw = HttpPost httpPost = HttpPost (" url "List <NameValuePair> nvps = ArrayList <NameValuePair> (2 nvps. add (BasicNameValuePair ("B" nvps. add (BasicNameValuePair ("c", "placeholder" (Integer I = startChapter, j = 0; I <= endChapter; I ++, j ++ n Vps. set (1, BasicNameValuePair ("c" httpPost. setEntity (String outStr = (outStr = | j -- outStr = "=====" + MyCrawl. chapterArr [j] + "\ r \ n" + System. out. println ("completed"} CloseableHttpResponse resp = status = (status <200 | status> = 300 System. out. println ("[Error]" + ""} (status! = 200 System. out. println ("[Warn]" + "" HttpEntity entity = (entity GzipDecompressingEntity gEntity = result =} result = (txt = | "" contentStart = txt. indexOf ("content") + 10 contentEnd = txt. indexOf ("<br/> \", \ "next" txt = txt. replace ("<br/>", "\ r \ n" String [] chapterArr = String [] {"Chapter 1" "Chapter 2 "}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.