About the reuse of input streams in Java

Source: Internet
Author: User

A few days ago to write a crawler, see the online use of jsoup directly to access and crawl the target URL, but the personal feeling Jsoup parsing HTML, its ability to directly connect to the target page is weaker than httpclient, so use the httpclient to connect and download the target page , and simply use Jsoup to parse the page.

Jsoup parses a webpage in several ways: from the input stream, from the local file object, from the URL, from the HTML string. To avoid using Jsoup's own network interface while reducing the IO on the disk, parse is used from the input stream.

InputStream Htmlstream = Entity.getcontent (); Restream = savetolocal (Htmlstream, FilePath);

The input returned by the GetContent function is streamed into the savelocal below to store the Web page file on the local disk. The same input stream object is then handed over to Jsoup for parsing.

Document doc = Jsoup.parse (Htmlin, "UTF-8", "" ");

Then executes the program and finds that the pending set that is parsed from the target input stream is always empty and the breakpoint is viewed. Only to find out because Doc is empty. Originally, the input stream after a read, its internal pointer has been moved to the tail, so for the same input stream, generally can not be reused.

However, one way to do this is to use the Mark method and the Reset method defined in the InputStream interface to implement a function similar to that of markup and return. In many input streams, it appears that only bufferinputstream the input stream has been implemented and rewritten with the appropriate mark and reset functions. That is to say, most of the other input streams do not have this two-time use function. Unless you rewrite the source code again. Therefore, the network stream is packaged directly into Bufferinputsteam to use its tag-backtracking function.

Public Bufferedinputstream savetolocal (inputstream content, String FilePath) throws IOException {FileOutputStream out = New FileOutputStream (FilePath); Bufferedinputstream reader = new Bufferedinputstream (content,8192000); byte[] Containter = new Byte[1024];reader.mark ( Reader.available () + 1); int count = 0;while ((count = Reader.read (containter)) > 0) {out.write (containter);} Out.flush (); Out.close (); Reader.reset (); return reader;}

This means that the network stream is first packaged into a bufferinputstream, then read the input stream before it is marked, then read, and output to the output, call the Reset method, the input stream pointer back to the initial position, and then this will be given to Jsoup to use. This solves the problem of two utilization.

However, in the actual operation of the process, and found a problem. Is that when crawling Baidu Web page without pressure, but when crawling Sina, there is a "reset failure" of the exception. After a period of query solving, it turns out that mark and reset have some limitations to use. He can only work with our bufferinputstream buffer, that is to say:

Assuming that the buffer size is 10 and we are reading a file size of 5, we read the starting position of the pre-tagged file and then execute the read. At this point, the entire file is placed in the buffer from the first to the end, so we perform reset to find the tag while relocating to the location specified by the marker. This situation is not a problem, but if the file and buffer large, our buffer needs to be updated, such as a 100-size file, then at the end of the read, the buffer only stores 90-100 of this section, we reset, we can not find that mark, So the reset function will have an error.

So when using Mark-reset, we should pay attention to the buffer size reasonable setting, if the file to be read is always very large, we can not put the buffer so large, then, we may want to do, for such a large file input stream may really not be used more than the psychological preparation!

This article is from the "Science-mymind" blog, make sure to keep this source http://qkkcoolmax.blog.51cto.com/8843422/1615522

About the reuse of input streams in Java

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.