How can I grab something valuable from a Web page? Read the following program (very simple), want to crawl from the Web page what information (title, content, Email, price, etc.) can crawl what information.
Packagecatchhtml;ImportJava.io.BufferedReader;Importjava.io.IOException;ImportJava.io.InputStreamReader;Importjava.net.MalformedURLException;ImportJava.net.URL;Importjava.util.ArrayList;Importjava.util.List;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern; Public classGethtmltitle { PublicGethtmltitle (String htmlurl) {System.out.println (The/n------------start reading the Web page ("+ Htmlurl +")-----------"); String Htmlsource= ""; Htmlsource= Gethtmlsource (Htmlurl);//get the source code of Htmlurl Web siteSYSTEM.OUT.PRINTLN ("------------read Web page (" + Htmlurl + ") ends-----------/n"); System.out.println (The results of------------analysis ("+ Htmlurl +") are as follows-----------/n "); String title=GetTitle (Htmlsource); System.out.println ("Site title:" +title); } /*** Return the source code of the webpage according to the website *@paramHtmlurl *@return */ Publicstring Gethtmlsource (string htmlurl) {URL url; StringBuffer SB=NewStringBuffer (); Try{URL=NewURL (Htmlurl); BufferedReader in=NewBufferedReader (NewInputStreamReader (Url.openstream (), "UTF-8"));//Read all the contents of a webpageString temp; while(temp = In.readline ())! =NULL) {sb.append (temp); } in.close (); }Catch(malformedurlexception e) {System.out.println ("There is a problem with the URL format you entered!" Please carefully enter "); }Catch(IOException e) {e.printstacktrace (); } returnsb.tostring (); } /*** Remove the title from the HTML source (string) *@paramHtmlsource *@return */ Publicstring GetTitle (String htmlsource) {List<String> list =NewArraylist<string>(); String title= ""; //Pattern pa = pattern.compile ("<title>.*?</title>", pattern.canon_eq);Pattern pa = pattern.compile ("<title>.*?</title>");//title Regular expression in source codeMatcher ma =Pa.matcher (Htmlsource); while(Ma.find ())//search for El-compliant strings{List.add (Ma.group ());//add an El-compliant string to the list } for(inti = 0; I < list.size (); i++) {title= title +List.get (i); } returnOuttag (title); } /*** Remove tags from HTML source *@paramS *@return */ Publicstring Outtag (string s) {returnS.replaceall ("<.*?>", "" "); } Public Static voidMain (string[] args) {String Htmlurl= "Http://www.157buy.com"; NewGethtmltitle (Htmlurl); }}
Java uses regular expressions to extract site titles from web pages