Implementation ideas:
1. Use a java.net. url object to bind a webpage address on the network
2. Obtain an httpconnection object through the openconnection () method of the java.net. url object.
3. Use the getinputstream () method of the httpconnection object to obtain the input stream object inputstream of the network file.
4. read each row of data in the stream cyclically, and use the regular expression compiled by the pattern object to partition each row of characters to obtain the email address.
package RegEx; import Java. io. bufferedreader; import Java. io. inputstreamreader; import java.net. URL; import java.net. urlconnection; import Java. util. regEx. matcher; import Java. util. regEx. pattern;/*** web crawler, capture the email address in the webpage */public class webcrawler lersdemo {public static void main (string [] ARGs) throws exception {URL url = new URL ("http://www.tianya.cn/publicforum/content/english/ 1/129176. shtml "); // open the connection urlconnection conn = URL. openconnection (); // set the connection timeout value Conn. setconnecttimeout (1000*10); // read the file bufferedreader bufr = new bufferedreader (New inputstreamreader (Conn. getinputstream (); string line = NULL; string RegEx = "[a-zA-Z0-9 _-] + \ W + \\. [A-Z] + (\\. [A-Z] + )? "; // Match the email's regular pattern P = pattern. Compile (RegEx); While (line = bufr. Readline ())! = NULL) {matcher M = P. matcher (line); While (M. find () {system. out. println (M. group (); // obtain the matched email }}
Result: