Here first of all, the internship of a project, at that time did not have the cooperation of the company's access to the news interface, but the project is urgently on-line, so the director let me do a simple crawl, now the main tool class Newsutil.java paste out for everyone's reference.
Newsutil.java
1 PackageOrg.news.util;2 3 ImportJava.io.BufferedReader;4 Importjava.io.IOException;5 ImportJava.io.InputStream;6 ImportJava.io.InputStreamReader;7 ImportJava.net.URL;8 Importjava.net.URLConnection;9 Importjava.util.ArrayList;Ten ImportJava.util.regex.Matcher; One ImportJava.util.regex.Pattern; A - /** - * Auxiliary class for crawling news content the * @authorGEENKDC - * @time 2015-07-28 15:15:04 - */ - Public classNewsutil { + /** - * Catch the link to the news by submitting a URL + * @paramURL A * @return at * @throwsException - */ - Public Staticarraylist<string> findurlbyurl (String URL)throwsException - { -URL url0=Newurl (url); -Arraylist<string> urllist=NewArraylist<string>(); in urlconnection con; -BufferedReader br=NULL; to Try { +Con =url0.openconnection (); -InputStream in=Con.getinputstream (); theBr=NewBufferedReader (NewInputStreamReader (in)); *String str= ""; $ while((Str=br.readline ())! =NULL)Panax Notoginseng { - Urllist.addall (Findurl (str)); the } +}Catch(IOException e) { A Throw NewRuntimeException ("url read-write error:" +e.getmessage ()); the } + if(br!=NULL) - { $ Try { $ br.close (); -}Catch(IOException e) { - Throw NewRuntimeException ("URL stream close exception:" +e.getmessage ()); the } - }Wuyi returnurllist; the } - Wu /**the real implementation class for crawling news URLs - * @paramStr About * @return $ */ - Public StaticArraylist<string>Findurl (String str) - { -Arraylist<string> urllist=NewArraylist<string>(); A //URL to match news +String regex= "Http://[a-za-z0-9_\\.:\ \d/?=&%]+\\.jhtml "; thePattern p=pattern.compile (regex); -Matcher m=P.matcher (str); $ //find a string that matches a regular match the while(M.find ()) the { theString substr=m.group (). substring (M.group (). LastIndexOf ("/") +1, M.group (). LastIndexOf (". jhtml")); the - Try { in if(Substr.matches ("[0-9]*")) { the Urllist.add (M.group ()); the About } the}Catch(Exception e) { the Throw NewRuntimeException ("Error matching News URL:" +e.getmessage ()); the } + } - returnurllist; the }Bayi the /** the * Find the news content based on the URL - * @paramURL - * @return the * @throwsException the */ the Public Staticarraylist<string> findcontentbyurl (String URL)throwsException { theURL url1=Newurl (url); -Arraylist<string> conlist=NewArraylist<string>(); the urlconnection con; theBufferedReader br=NULL; the Try {94Con =url1.openconnection (); theInputStream in=Con.getinputstream (); theInputStreamReader isr=NewInputStreamReader (In, "Utf-8"); theBr=NewBufferedReader (ISR);98String str= ""; AboutStringBuffer sb=NewStringBuffer (); - while((Str=br.readline ())! =NULL)101 {102 sb.append (str);103 }104 Conlist.addall (Findcontent (sb.tostring ())); the}Catch(IOException e) {106 Throw NewRuntimeException ("url read-write error:" +e.getmessage ());107 }108 if(br!=NULL)109 { the Try {111 br.close (); the}Catch(IOException e) {113 Throw NewRuntimeException ("URL stream close exception:" +e.getmessage ()); the } the } the returnconlist;117 }118 119 /** - * The real realization class of crawling news content121 * @paramStr122 * @return123 */124 Public StaticArraylist<string>findcontent (String str) { theArraylist<string> strlist=NewArraylist<string>();126 //Match news content Div127String regex= "<div class=\" con_box\ "> ([\\s\\s]*) </div> ([\\s\\s]*] <div class=\" left_con\ ">"; -Pattern p=pattern.compile (regex);129Matcher m=P.matcher (str); the //find a string that matches a regular match131 while(M.find ()) the {133 Try {134Strlist.add (NewString (M.group ()));135}Catch(Exception e) {136 Throw NewRuntimeException ("Crawl News content Error:" +e.getmessage ());137 }138 }139 returnstrlist; $ }141}
Function Simple Description:
As long as you enter the URL of the homepage of the website, the program automatically obtains the URL of the matching news item, and then crawls the content of the news item according to the URL of each news entry.
Java implementation crawl A company's official website news