Java get a summary of how to collect Web content

Last Update:2014-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In order to write a Java collection program, from the online learning to 3 ways to get a single page content method, mainly to the Java IO Stream knowledge, not familiar with it, so write a summary.

Import Java.io.bufferedreader;import Java.io.bytearrayoutputstream;import Java.io.ioexception;import Java.io.inputstreamreader;import Java.net.httpurlconnection;import Java.net.url;import Java.util.regex.Matcher;    Import Java.util.regex.pattern;public class Get_html {public static void main (string[] args) throws Exception {        Long start= system.currenttimemillis ();        String str_url= "http://www.hiphop8.com/city/guangdong/guangzhou.php";        Pattern p = pattern.compile ("> (13\\d{5}|15\\d{5}|18\\d{5}|147\\d{4}) <");        String html = get_html_2 (Str_url);        String html = get_html_1 (Str_url);        String html = get_html_3 (Str_url);                Matcher m = p.matcher (HTML);       int num = 0;       while (M.find ()) {System.out.println ("printed number paragraph:" +m.group (1) + "number" + (++num));}              SYSTEM.OUT.PRINTLN (num); Long end = System.currenttimemillis ();    System.out.println ("Time Spent" + (End-start) + "millisecond"); } public static string Get_html_2 (string str_url) thRows ioexception{url url = new URL (str_url); String content= "";                    StringBuffer page = new StringBuffer (); try {bufferedreader in = new BufferedReader (new InputStreamReader (URL . OpenStream (), "Utf-8")); while ((content = In.readline ()) = null) {page.append (content);}}        catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}    return page.tostring ();        public static string Get_html_1 (String str_url) throws ioexception{url url = new URL (str_url);        HttpURLConnection conn = (httpurlconnection) url.openconnection ();          InputStreamReader input = new InputStreamReader (Conn.getinputstream (), "utf-8");          BufferedReader bufreader = new BufferedReader (input);          String line = "";          StringBuilder contentbuf = new StringBuilder ();          while (line = Bufreader.readline ())! = null) {contentbuf.append (line);    } return contentbuf.tostring (); /** * Get the site via website domain URLSOURCE * @param URL * @return String * @throws Exception */public static string Get_html_3 (String Str_url        ) throws Exception {URL url = new URL (str_url);        HttpURLConnection conn = (httpurlconnection) url.openconnection ();        Conn.setrequestmethod ("GET");                        Conn.setconnecttimeout (5 * 1000);  Set Connection Timeout Java.io.InputStream instream = Conn.getinputstream ();  Get HTML binary data via input stream byte[] data = Readinputstream (instream);        Converts binary data to byte byte data string htmlsource = new string (data);    return htmlsource; }/** * translate binary into byte byte array * @param instream * @return byte[] * @throws Exception * * Public St Atic byte[] Readinputstream (Java.io.InputStream instream) throws Exception {Bytearrayoutputstream OutStream = new        Bytearrayoutputstream ();        byte[] buffer = new byte[1204];        int len = 0; while (len = instream.read (buffer))! =-1) {OutStream.Write (Buffer,0,len);        } instream.close ();             return Outstream.tobytearray (); } }

"Test the results of 6 times," I do not know whether the number of pages to obtain a small content, acquisition efficiency is similar, but Method 2 should be the best and easiest.

Get_html_1 967 2658 1132 1199 988 1236
Get_html_2 2323 2244 1202 1166 1081 1011
Get_html_3 978 1219 1527 1133 1192 1774

1, about URL. OpenStream () and Conn.getinputstream ().

Both return the Inputstrema object, and both get the URLConnection object through the OpenConnection () method, and then call the getInputStream () method, so Method 2 and Method 1 are the same, but the former is more convenient.

2, about the BufferedReader class.

"Functionality of this class": a character stream can be placed in a buffer (a small area in memory) for efficient reading.

"Look at the construction method":

BufferedReader (Reader in) creates an input buffer that uses the default size to buffer the character input stream.

BufferedReader (Reader in, int sz) creates a buffered character input stream that uses the specified size input buffer.

"Common methods":ReadLine () can quickly implement line reads of text characters.

3, about the InputStreamReader class

InputStreamReader is a bridge from a stream of bytes to a character stream: it reads into bytes and converts it to a character stream according to the specified encoding, which is a subclass of reader.

and in order to achieve more efficiency, we often use BufferedReader encapsulation InputStreamReader , so we often see the usage is

BufferedReader Buf = new BufferedReader (new InputStreamReader (system.in);

The function of the InputStreamReader class here is to convert the byte stream to a character stream, so the above statement implements the following: Converts a bytes input stream into a stream of characters and places a buffer.

Refer to a Picture:

4, about the Bytearrayoutputstream class

It is an extension class of the OutputStream class whose constructor is Bytearrayinputstream (Byte []buf), which acts as a byte array buf into the form of an input stream and through ToString () or Tobytearray () method or the desired data form. The Readinputstream method in method 3 can instead return a string type, changing the subsequent outstream.tobytearray () to the Outstream.tostring () method, which streamlines the code.

5, about the InputStream class

InputStream and OutputStream: is the base class for the 8-bit byte input/output stream class, mainly used in processing binary data, which is processed by byte. Files in the hard disk or in the transmission are in a byte way, including pictures, etc. are stored in bytes, the rest of the byte stream processing classes are extended to the class, such as the above-mentioned Bytearrayinputstream class .

Since the Inputstream.read () method is read only one byte at a time from the stream, the efficiency is very low. The Inputstream.read (byte[] b) or Inputstream.read (byte[] b,int off,int len) method, which can read multiple bytes at a time, is more efficient, so a byte byte array is created in Method 3. To read more bytes at once. Returns 1 when the Read () method reads a null content.

The other character input output stream base class Reader/writer, and to know 1 characters = 2 bytes, characters are generated in memory, a Chinese account for two bytes, its subclasses contain the above-mentioned Inputstreamread class and Bufferreader class.

Write a few summary, are related to the IO stream of Java, is not should change a title, think or forget, after all, the collection program is an important part of the IO stream, java in the IO stream provides a wealth of class library, learning edge accumulation bar.

Java get a summary of how to collect Web content

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More