Java HTML page capturing instance and java page capturing instance

Source: Internet
Author: User

Java HTML page capturing instance and java page capturing instance

Import java. io. bufferedReader; import java. io. IOException; import java. io. inputStreamReader; import java. io. unsupportedEncodingException; import java.net. httpURLConnection; import java.net. malformedURLException; import java.net. URL; public class Url {public static void main (String [] args) throws Exception {String html = getURLContent (); System. out. println (html);}/*** get webpage content */private static String ge TURLContent () throws MalformedURLException, IOException, UnsupportedEncodingException {URL urlmy = new URL ("http://www.baidu.com"); HttpURLConnection con = (HttpURLConnection) urlmy. openConnection (); HttpURLConnection. setFollowRedirects (true); con. setInstanceFollowRedirects (false); con. connect (); BufferedReader br = new BufferedReader (new InputStreamReader (con. getInputStream (), "UTF-8"); String s = ""; StringBuffer sb = new StringBuffer (); while (s = br. readLine ())! = Null) {sb. append (s + "\ r \ n");} return sb. toString ();}}

How does java web capture an HTML source code?

Import java. io. BufferedInputStream;
Import java. io. InputStream;
Import java.net. MalformedURLException;
Import java.net. URL;

Public class Test {
Public static void main (String argv []) {
// Test t = new Test ();
// T. first ();
// Test. TSR ();
// Test. testDouPrase ();
Try {
Test. testNetStream ();
} Catch (Exception e ){
E. printStackTrace ();
}
}

Public static void testNetStream () throws Exception {
URL url = null;
Url = new URL ("www.baidu.com ");
InputStream in = url. openStream ();
Byte [] B = new byte [100000];
InputStream ins = url. openStream ();
Ins. read (B );
Ins. close ();
String s = new String (B );
System. out. println (s );
}

}

Java Development, crawling html pages, hurry up,

Obtain the page first
String html = getContent (url, Constants. ENCODING_UTF8 );
Parse page Document doc = Jsoup. parse (html );
Then you get the corresponding tag String tag = doc. getElementsByTag ("title"). first (). text ();
If there are many different tags, You have to judge them. Let's take a look at what is the same. When I capture web page data, the most annoying thing is that the format is different. Many tags are different, I have been doing this for a long time. I have never thought of any other better way. If you have a better way, I hope you can tell me, my work will be much faster. xi. thank you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.