About the reuse of input streams in Java

Last Update:2015-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few days ago to write a crawler, see the online use of jsoup directly to access and crawl the target URL, but the personal feeling Jsoup parsing HTML, its ability to directly connect to the target page is weaker than httpclient, so use the httpclient to connect and download the target page , and simply use Jsoup to parse the page.

Jsoup parses a webpage in several ways: from the input stream, from the local file object, from the URL, from the HTML string. To avoid using Jsoup's own network interface while reducing the IO on the disk, parse is used from the input stream.

InputStream Htmlstream = Entity.getcontent (); Restream = savetolocal (Htmlstream, FilePath);

The input returned by the GetContent function is streamed into the savelocal below to store the Web page file on the local disk. The same input stream object is then handed over to Jsoup for parsing.

Document doc = Jsoup.parse (Htmlin, "UTF-8", "" ");

Then executes the program and finds that the pending set that is parsed from the target input stream is always empty and the breakpoint is viewed. Only to find out because Doc is empty. Originally, the input stream after a read, its internal pointer has been moved to the tail, so for the same input stream, generally can not be reused.

However, one way to do this is to use the Mark method and the Reset method defined in the InputStream interface to implement a function similar to that of markup and return. In many input streams, it appears that only bufferinputstream the input stream has been implemented and rewritten with the appropriate mark and reset functions. That is to say, most of the other input streams do not have this two-time use function. Unless you rewrite the source code again. Therefore, the network stream is packaged directly into Bufferinputsteam to use its tag-backtracking function.

Public Bufferedinputstream savetolocal (inputstream content, String FilePath) throws IOException {FileOutputStream out = New FileOutputStream (FilePath); Bufferedinputstream reader = new Bufferedinputstream (content,8192000); byte[] Containter = new Byte[1024];reader.mark ( Reader.available () + 1); int count = 0;while ((count = Reader.read (containter)) > 0) {out.write (containter);} Out.flush (); Out.close (); Reader.reset (); return reader;}

This means that the network stream is first packaged into a bufferinputstream, then read the input stream before it is marked, then read, and output to the output, call the Reset method, the input stream pointer back to the initial position, and then this will be given to Jsoup to use. This solves the problem of two utilization.

However, in the actual operation of the process, and found a problem. Is that when crawling Baidu Web page without pressure, but when crawling Sina, there is a "reset failure" of the exception. After a period of query solving, it turns out that mark and reset have some limitations to use. He can only work with our bufferinputstream buffer, that is to say:

Assuming that the buffer size is 10 and we are reading a file size of 5, we read the starting position of the pre-tagged file and then execute the read. At this point, the entire file is placed in the buffer from the first to the end, so we perform reset to find the tag while relocating to the location specified by the marker. This situation is not a problem, but if the file and buffer large, our buffer needs to be updated, such as a 100-size file, then at the end of the read, the buffer only stores 90-100 of this section, we reset, we can not find that mark, So the reset function will have an error.

So when using Mark-reset, we should pay attention to the buffer size reasonable setting, if the file to be read is always very large, we can not put the buffer so large, then, we may want to do, for such a large file input stream may really not be used more than the psychological preparation!

This article is from the "Science-mymind" blog, make sure to keep this source http://qkkcoolmax.blog.51cto.com/8843422/1615522

About the reuse of input streams in Java

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

About the reuse of input streams in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

About the reuse of input streams in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support