Java method to collect Web Content summary, java Web Content
In order to write a java Collection Program, I learned three ways on the Internet to obtain the content of a single webpage. I am not familiar with java I/O Stream, therefore, write a summary.
Import java. io. bufferedReader; import java. io. byteArrayOutputStream; import java. io. IOException; import java. io. inputStreamReader; import java.net. httpURLConnection; import java.net. URL; import java. util. regex. matcher; import java. util. regex. pattern; public class Get_Html {public static void main (String [] args) throws Exception {long start = System. currentTimeMillis (); String str_url = "http://www.hiphop8.c Om/city/guangdong/guangzhou. php "; Pattern p = Pattern. compile ("> (13 \ d {5} | 15 \ d {5} | 18 \ d {5 }| 147 \ d {4 }) <"); // String html = get_Html_2 (str_url); // String html = get_Html_1 (str_url); String html = get_Html_3 (str_url); Matcher m = p. matcher (html); int num = 0; while (m. find () {System. out. println ("printed number section:" + m. group (1) + "Number" + (++ num);} System. out. println (num); long end = System. currentTimeMillis (); Syst Em. out. println ("time spent" + (end-start) + "millisecond");} public static String get_Html_2 (String str_url) throws IOException {URL url = new URL (str_url ); string content = ""; StringBuffer page = new StringBuffer (); try {BufferedReader in = new BufferedReader (new InputStreamReader (url. openStream (), "UTF-8"); while (content = in. readLine ())! = Null) {page. append (content) ;}} catch (IOException e) {// TODO Auto-generated catch blocke. printStackTrace ();} return page. toString ();} public static String get_Html_1 (String str_url) throws IOException {URL url URL = new url (str_url); HttpURLConnection conn = (HttpURLConnection) URL. openConnection (); InputStreamReader input = new InputStreamReader (conn. getInputStream (), "UTF-8"); BufferedReader bufRea Der = new BufferedReader (input); String line = ""; StringBuilder contentBuf = new StringBuilder (); while (line = bufReader. readLine ())! = Null) {contentBuf. append (line);} return contentBuf. toString ();}/*** get the source code of the website through the website domain name URL * @ param url * @ return String * @ throws Exception */public static String get_Html_3 (String str_url) throws Exception {URL url = new URL (str_url); HttpURLConnection conn = (HttpURLConnection) url. openConnection (); conn. setRequestMethod ("GET"); conn. setConnectTimeout (5*1000); // sets the connection timeout in java. io. inputStream I NStream = conn. getInputStream (); // get the html binary data byte [] data = readInputStream (inStream) through the input stream ); // convert binary data into byte data String htmlSource = new String (data); return htmlSource ;} /*** convert binary data into byte array * @ param inStream * @ return byte [] * @ throws Exception */public static byte [] readInputStream (java. io. inputStream inStream) throws Exception {ByteArrayOutputStream outStream = new ByteArrayOutputStream (); Byte [] buffer = new byte [1204]; int len = 0; while (len = inStream. read (buffer ))! =-1) {outStream. write (buffer, 0, len) ;}instream. close (); return outStream. toByteArray ();}}
[Results of 6 tests respectively]I don't know if the number of webpages to be retrieved is small, and the collection efficiency is similar, but method 2 should be the best and easiest.
// Get_Html_1 967 2658 1132 1199 988 1236
// Get_Html_2 2323 2244 1202 1166 1081 1011
// Get_Html_3 978 1219 1527 1133 1192 1774
1. About
Url. openStream () and conn. getInputStream ().
Both return InputStrema objects and obtain URLConnection objects through the openConnection () method, and then call the getInputStream () method. Therefore, method 2 and method 1 are the same, but the former is more convenient.
2. About the BufferedReader class.
[Function of this class]: it can put the swap stream into the buffer zone (a small area in the memory) for efficient reading.
[View constructor ]:
BufferedReader (Reader in) creates an input buffer with the default size to buffer Character Input streams.
BufferedReader (Reader in, int sz) creates a buffer character input stream that uses the specified size input buffer.
[Common method ]:ReadLine () allows you to quickly read rows of text characters.
3. About the InputStreamReader class
InputStreamReader is a bridge from bytes to the bytes stream: It reads bytes and converts it to the bytes stream according to the specified encoding method. It is a subclass of Reader.
To achieve higher efficiency, we often use BufferedReader to encapsulate InputStreamReaderSo we often see the usage is
BufferedReader Buf = new BufferedReader (new InputStreamReader (System. in );
Here, the InputStreamReader class is used to convert byte streams, so the preceding statement implements:Byte input streamConvertCharacter input streamAnd place the buffer zone.
Reference a chart:
4. About the ByteArrayOutputStream class
It is an extension class of the OutputStream class, and its constructor is byteArrayInputStream (byte [] buf). It is used to convert the byte array buf into an input stream and use toString () or toByteArray () method or the desired data format. The readInputStream method in method 3 can be changed to the return String type, and the later outStream. toByteArray () is changed to the outStream. toString () method, which simplifies the code.
5. About the InputStream class
InputStream and OutputStream: Are the base class of the 8-bit input/output stream class. They are mainly used to process binary data. They are processed in bytes. Files are stored in bytes on a hard disk or during transmission, including images. Other byte stream processing classes are extended, as mentioned above, the ByteArrayInputStream class.
Because the InputStream. read () method reads only one byte from the stream each time, the efficiency is very low. InputStream. read (byte [] B) or InputStream. the read (byte [] B, int off, int len) method can read multiple bytes at a time, which is highly efficient. Therefore, method 3 creates a byte array, to read more bytes at a time. If the read content of the read () method is null,-1 is returned.
In addition, the basic Reader/Writer of the character input/output stream must be 1 character = 2 bytes. All characters are generated in the memory. One Chinese Character occupies two bytes, its subclass includes the InputStreamRead class and BufferReader class mentioned above.
I wrote a few summary points, which are related to the java IO stream. Should I change the title? Think about it or forget it. After all, a very important part of the collection program is the IO stream, java provides a wide range of class libraries for IO streams. Learn and accumulate.
How to collect webpage data written in Java
Jsoup is used to collect web page data, which is convenient and easier to use than a general collector .. Baidu jsoup
How to use java for real-time webpage data collection?
This is difficult and troublesome!
Our company's projects are developed in java, but data collection is outsourced to a collection service provider called "youxun software.