Java HTML page capturing instance and java page capturing instance
Import java. io. bufferedReader; import java. io. IOException; import java. io. inputStreamReader; import java. io. unsupportedEncodingException; import java.net. httpURLConnection; import java.net. malformedURLException; import java.net. URL; public class Url {public static void main (String [] args) throws Exception {String html = getURLContent (); System. out. println (html);}/*** get webpage content */private static String ge TURLContent () throws MalformedURLException, IOException, UnsupportedEncodingException {URL urlmy = new URL ("http://www.baidu.com"); HttpURLConnection con = (HttpURLConnection) urlmy. openConnection (); HttpURLConnection. setFollowRedirects (true); con. setInstanceFollowRedirects (false); con. connect (); BufferedReader br = new BufferedReader (new InputStreamReader (con. getInputStream (), "UTF-8"); String s = ""; StringBuffer sb = new StringBuffer (); while (s = br. readLine ())! = Null) {sb. append (s + "\ r \ n");} return sb. toString ();}}
How does java web capture an HTML source code?
Import java. io. BufferedInputStream;
Import java. io. InputStream;
Import java.net. MalformedURLException;
Import java.net. URL;
Public class Test {
Public static void main (String argv []) {
// Test t = new Test ();
// T. first ();
// Test. TSR ();
// Test. testDouPrase ();
Try {
Test. testNetStream ();
} Catch (Exception e ){
E. printStackTrace ();
}
}
Public static void testNetStream () throws Exception {
URL url = null;
Url = new URL ("www.baidu.com ");
InputStream in = url. openStream ();
Byte [] B = new byte [100000];
InputStream ins = url. openStream ();
Ins. read (B );
Ins. close ();
String s = new String (B );
System. out. println (s );
}
}
Java Development, crawling html pages, hurry up,
Obtain the page first
String html = getContent (url, Constants. ENCODING_UTF8 );
Parse page Document doc = Jsoup. parse (html );
Then you get the corresponding tag String tag = doc. getElementsByTag ("title"). first (). text ();
If there are many different tags, You have to judge them. Let's take a look at what is the same. When I capture web page data, the most annoying thing is that the format is different. Many tags are different, I have been doing this for a long time. I have never thought of any other better way. If you have a better way, I hope you can tell me, my work will be much faster. xi. thank you.