Document capture-download images and files

Source: Internet
Author: User

The first task after I came to the new company was to let people in the editorial department capture articles. The document grabbing Tool Company has developed one, and I used it directly. The results are not used to me, the Code logic in complicated operations is also chaotic. So I made some changes to it. I mainly hope that this tool can be used as widely as possible. Although the websites are different, I found some common things after several days. By setting the xpath expressions for each element, most websites can be crawled.

1. extracts the title link of the article list. Generally, the title link is a tag in ul or div, and these ul and div usually set a class attribute, in this document, you can set xpath: // div [@ class = 'title']

2. The list page contains pages. You can set the size of the start page and the last page to perform a loop.

3. after obtaining the content link based on the above article link, you can send an http request to obtain the body part. If the body part is text, it is better to solve the problem, the most common requirement is to remove the advertisement for each website. If you are lucky, you will encounter some kind websites. The elements of the advertisement will have some obvious characteristics, such as adding the id or class attribute, so that you can set the nodes to be filtered for filtering.

4. I am still thinking ...............

Finally, let's talk about how to download documents (such as rar and zip packages), not just text. At first, the tool supports downloading images, and I found that it uses WebClient to download files. You can download it in the absolute positioning format. If the image is dynamically generated based on parameters, the Uri format is incorrect.

The download implementation is roughly as follows:

1 Stream stream = _ response. GetResponseStream ();
2 FileStream fs = new FileStream (filePath + fileName, FileMode. Create );
3
4 // 1kb download
5 // byte [] _ buffer = new byte [1, 1024];
6 // int count = stream. Read (_ buffer, 0, _ buffer. Length );
7 // while (count> 0)
8 //{
9 // fs. Write (_ buffer, 0, _ buffer. Length );
10 // count = stream. Read (_ buffer, 0, _ buffer. Length );
11 //}
12
13 // fs. Flush ();
14 // fs. Close ();
15 // stream. Close ();
16
17 // download byte
18 int size;
19 while (size = stream. ReadByte ())! =-1)
20 {
21 fs. WriteByte (Convert. ToByte (size ));
22}
23 stream. Close ();
24 fs. Flush ();
25 fs. Close ();

 




 
 
Currently, downloading is a byte to read from the Stream, which is slower. I also tried to read 1 kb but found that the picture of the read article is as follows:
 
 
The downloaded rar files cannot be opened. If one byte and one byte, everything is OK:
 
 
 
 
 
From jungexingchi
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.