Java implementation crawl A company's official website news

Source: Internet
Author: User
Tags addall

Here first of all, the internship of a project, at that time did not have the cooperation of the company's access to the news interface, but the project is urgently on-line, so the director let me do a simple crawl, now the main tool class Newsutil.java paste out for everyone's reference.

Newsutil.java

1  PackageOrg.news.util;2 3 ImportJava.io.BufferedReader;4 Importjava.io.IOException;5 ImportJava.io.InputStream;6 ImportJava.io.InputStreamReader;7 ImportJava.net.URL;8 Importjava.net.URLConnection;9 Importjava.util.ArrayList;Ten ImportJava.util.regex.Matcher; One ImportJava.util.regex.Pattern; A  - /** - * Auxiliary class for crawling news content the  * @authorGEENKDC - * @time 2015-07-28 15:15:04 -  */ -  Public classNewsutil { +     /** - * Catch the link to the news by submitting a URL +      * @paramURL A      * @return at      * @throwsException -      */ -      Public Staticarraylist<string> findurlbyurl (String URL)throwsException -     { -URL url0=Newurl (url); -Arraylist<string> urllist=NewArraylist<string>(); in urlconnection con; -BufferedReader br=NULL; to         Try { +Con =url0.openconnection (); -InputStream in=Con.getinputstream (); theBr=NewBufferedReader (NewInputStreamReader (in)); *String str= ""; $              while((Str=br.readline ())! =NULL)Panax Notoginseng             { - Urllist.addall (Findurl (str)); the             } +}Catch(IOException e) { A             Throw NewRuntimeException ("url read-write error:" +e.getmessage ()); the         } +         if(br!=NULL) -         { $             Try { $ br.close (); -}Catch(IOException e) { -                 Throw NewRuntimeException ("URL stream close exception:" +e.getmessage ()); the             } -         }Wuyi         returnurllist; the     } -      Wu     /**the real implementation class for crawling news URLs -      * @paramStr About      * @return $      */ -      Public StaticArraylist<string>Findurl (String str) -     { -Arraylist<string> urllist=NewArraylist<string>(); A         //URL to match news +String regex= "Http://[a-za-z0-9_\\.:\ \d/?=&%]+\\.jhtml "; thePattern p=pattern.compile (regex); -Matcher m=P.matcher (str); $         //find a string that matches a regular match the          while(M.find ()) the         { theString substr=m.group (). substring (M.group (). LastIndexOf ("/") +1, M.group (). LastIndexOf (". jhtml")); the  -             Try { in                 if(Substr.matches ("[0-9]*")) { the Urllist.add (M.group ()); the                      About                 } the}Catch(Exception e) { the                 Throw NewRuntimeException ("Error matching News URL:" +e.getmessage ()); the             } +         } -         returnurllist; the     }Bayi      the     /** the * Find the news content based on the URL -      * @paramURL -      * @return the      * @throwsException the      */ the      Public Staticarraylist<string> findcontentbyurl (String URL)throwsException { theURL url1=Newurl (url); -Arraylist<string> conlist=NewArraylist<string>(); the urlconnection con; theBufferedReader br=NULL; the         Try {94Con =url1.openconnection (); theInputStream in=Con.getinputstream (); theInputStreamReader isr=NewInputStreamReader (In, "Utf-8"); theBr=NewBufferedReader (ISR);98String str= ""; AboutStringBuffer sb=NewStringBuffer (); -              while((Str=br.readline ())! =NULL)101             {102 sb.append (str);103             }104 Conlist.addall (Findcontent (sb.tostring ())); the}Catch(IOException e) {106             Throw NewRuntimeException ("url read-write error:" +e.getmessage ());107         }108         if(br!=NULL)109         { the             Try {111 br.close (); the}Catch(IOException e) {113                 Throw NewRuntimeException ("URL stream close exception:" +e.getmessage ()); the             } the         } the         returnconlist;117     }118     119     /** - * The real realization class of crawling news content121      * @paramStr122      * @return123      */124      Public StaticArraylist<string>findcontent (String str) { theArraylist<string> strlist=NewArraylist<string>();126         //Match news content Div127String regex= "<div class=\" con_box\ "> ([\\s\\s]*) </div> ([\\s\\s]*] <div class=\" left_con\ ">"; -Pattern p=pattern.compile (regex);129Matcher m=P.matcher (str); the         //find a string that matches a regular match131          while(M.find ()) the         {133             Try {134Strlist.add (NewString (M.group ()));135}Catch(Exception e) {136                 Throw NewRuntimeException ("Crawl News content Error:" +e.getmessage ());137             }138         }139         returnstrlist; $     }141}

Function Simple Description:

As long as you enter the URL of the homepage of the website, the program automatically obtains the URL of the matching news item, and then crawls the content of the news item according to the URL of each news entry.

Java implementation crawl A company's official website news

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.