Realization _java of Java crawler information crawl

Last Update:2017-01-19 Source: Internet

Author: User

Tags stub throwable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, the company has a need to do some of the designated Web site query data crawl, so spent a little time to write a demo for demonstration use.

The idea is simple: a link that is accessed through Java, then the HTML string, and then the data needed to parse the link. Technical use Jsoup Convenient page parsing, of course jsoup very convenient, also very simple, a line of code can know how to use:

Document doc = Jsoup.connect ("http://www.oschina.net/")  
 . Data ("Query", "Java")  //Request Parameters 
 . useragent ("I ' m Jsoup ")//Set user-agent  
 . Cookie (" auth "," token ")//Set Cookies  
 . Timeout (3000)      //Set the connection Timeout 
 . Post ();         Accessing URLs using the POST method

The entire implementation process is described below:

1, analysis needs to resolve the page:

URL: http://www1.sxcredit.gov.cn/public/infocomquery.do?method=publicIndexQuery

Page:

First on this page to do a query: Observe the requested URL, parameters, method, and so on.

Here we use the Chrome Built-in Developer tool (shortcut key F12), the following is the result of the query:

We can see the Url,method, as well as the parameters. Knowing how or how to query the URL, the following starts the code, in order to reuse and extend, I have defined several classes:

1, Rule.java is used to specify query Url,method,params, and so on

Package com.zhy.spider.rule; 
 
  /** * * * @author zhy * * * * * * * * * * * * */public class Rule {/** * link/private String URL; 
  /** * Parameter Set * * Private string[] params; 
 
  /** * Parameters corresponding to the value of * * private string[] values; 
 
  /** * For the returned HTML, the first filter used for the label, please set the type/private String resulttagname; 
   
  /** * class/id/selection * Set the type of Resulttagname, the default is ID * * Private int type = ID;  
   
  /** *get/post * Request type, default get/private int requestmoethod = get; 
  Public final static int get = 0; 
   
 
  Public final static int POST = 1; 
  Public final static int CLASS = 0; 
  Public final static int ID = 1; 
 
  Public final static int SELECTION = 2; Public rule () {} Public rule (String URL, string[] params, string[] values, String resulttagname, 
    t type, int requestmoethod) {super (); 
    This.url = URL; 
    This.params = params; 
    This.values = values; This.resulttagname = Resulttagname; 
    This.type = type; 
  This.requestmoethod = Requestmoethod; 
  Public String GetUrl () {return URL; 
  public void SetUrl (String url) {this.url = URL; 
  String[] Getparams () {return params; 
  } public void SetParams (string[] params) {this.params = params; 
  String[] GetValues () {return values; 
  } public void Setvalues (string[] values) {this.values = values; 
  Public String Getresulttagname () {return resulttagname; 
  } public void Setresulttagname (String resulttagname) {this.resulttagname = Resulttagname; 
  public int GetType () {return type; 
  public void SetType (int type) {this.type = type; 
  public int Getrequestmoethod () {return requestmoethod; 
  The public void Setrequestmoethod (int requestmoethod) {this.requestmoethod = Requestmoethod; 
 } 
 
}

To put it simply: This rule class defines all the information we need during the query process, facilitates our expansion, and the reuse of the code, and it's not possible to write a code for each site that needs to be crawled.

2, the need for data objects, currently only need links, Linktypedata.java

Package Com.zhy.spider.bean; 
  public class Linktypedata {private int id; 
  /** * Link Address * * Private String linkhref; 
  /** * Link Title * * Private String LinkText; 
  /** * Abstract/private String summary; 
  /** * Contents/private String content; 
  public int getId () {return id; 
  The public void setId (int id) {this.id = ID; 
  Public String Getlinkhref () {return linkhref; 
  } public void Setlinkhref (String linkhref) {this.linkhref = Linkhref; 
  Public String Getlinktext () {return linkText; 
  } public void Setlinktext (String linkText) {this.linktext = LinkText; 
  Public String getsummary () {return summary; 
  } public void Setsummary (String summary) {this.summary = summary; 
  Public String getcontent () {return content; 
  public void SetContent (String content) {this.content = content; 
 } 
}

3, Core query class: Extractservice.java

Package Com.zhy.spider.core; 
Import java.io.IOException; 
Import java.util.ArrayList; 
Import java.util.List; 
 
Import Java.util.Map; 
 
Import Javax.swing.plaf.TextUI; 
Import org.jsoup.Connection; 
Import Org.jsoup.Jsoup; 
Import org.jsoup.nodes.Document; 
Import org.jsoup.nodes.Element; 
 
Import org.jsoup.select.Elements; 
Import Com.zhy.spider.bean.LinkTypeData; 
Import Com.zhy.spider.rule.Rule; 
Import com.zhy.spider.rule.RuleException; 
 
Import Com.zhy.spider.util.TextUtil; /** * * * @author zhy * */public class Extractservice {/** * @param rule * @return/public s 
 
    Tatic list<linktypedata> Extract (rule rule) {//Make the necessary verification validaterule (rule) for the rule; 
    list<linktypedata> datas = new arraylist<linktypedata> (); 
    Linktypedata data = null; 
      try {/** * Parse rule */String URL = rule.geturl (); 
      string[] params = Rule.getparams (); String[] values = rule.getvalues (); 
      String resulttagname = Rule.getresulttagname (); 
      int type = Rule.gettype (); 
 
      int requesttype = Rule.getrequestmoethod (); 
      Connection conn = jsoup.connect (URL); 
          Set query parameter if (params!= null) {for (int i = 0; i < params.length; i++) { 
        Conn.data (Params[i], values[i]); 
      }//Set request type Document doc = null; 
        Switch (RequestType) {Case Rule.GET:doc = conn.timeout (100000). get (); 
      Break 
        Case Rule.POST:doc = conn.timeout (100000). POST (); 
      Break 
      }//Processing return data Elements results = new Elements (); 
        Switch (type) {Case Rule.CLASS:results = Doc.getelementsbyclass (resulttagname); 
      Break 
        Case Rule.ID:Element result = Doc.getelementbyid (resulttagname); 
        Results.add (result); 
      Break Case Rule.SELECTION:results = Doc.selecT (Resulttagname); 
      Break  Default://When Resulttagname is empty, the body label if (Textutil.isempty (Resulttagname)) {results 
        = Doc.getelementsbytag ("Body"); 
 
        } for (Element result:results) {Elements links = Result.getelementsbytag ("a"); 
          for (Element link:links) {//necessary filter String Linkhref = link.attr ("href"); 
 
          String LinkText = Link.text (); 
          data = new Linktypedata (); 
          Data.setlinkhref (LINKHREF); 
 
          Data.setlinktext (LinkText); 
        Datas.add (data); 
    A catch (IOException e) {e.printstacktrace ()); 
  return datas;  
    /** * The necessary checksum for incoming parameters * * private static void validaterule (rule rule) {String URL = rule.geturl (); if (textutil.isempty (URL)) {throw new ruleexception ("URL cannot be empty!") 
    "); } if (!url.startswith ("http://")) { 
      throw new Ruleexception ("URL is not in the correct format!") 
    "); } if (Rule.getparams ()!= null && rule.getvalues ()!= null) {if (Rule.getparams (). length!= ru Le.getvalues (). length) {throw new Ruleexception ("") does not match the number of key values for the parameter! 
      "); 
 } 
    } 
 
  } 
 
 
}

4, Inside used an exception class: Ruleexception.java

Package com.zhy.spider.rule; 
 
public class Ruleexception extends RuntimeException 
{public 
 
  ruleexception () 
  { 
    super (); 
    TODO auto-generated constructor stub 
  } public 
 
  ruleexception (String message, throwable cause) 
  { 
    Super (message, cause); 
    TODO auto-generated constructor stub 
  } public 
 
  ruleexception (String message) 
  { 
    super (message); 
    //TODO auto-generated constructor stub 
  } 
 
  Public ruleexception (Throwable cause) 
  { 
    super (cause); 
    TODO auto-generated constructor stub 
  }

5, finally is the test: here used two websites for testing, using a different rules, specific look at the code

Package com.zhy.spider.test; 
 
Import java.util.List; 
Import Com.zhy.spider.bean.LinkTypeData; 
Import Com.zhy.spider.core.ExtractService; 
 
Import Com.zhy.spider.rule.Rule; The public class Test {@org. junit.test public void Getdatasbyclass () {Rule rule = new rule ("http:// Www1.sxcredit.gov.cn/public/infocomquery.do?method=publicindexquery ", new string[] {" Query.enterprisename "," 
    Query.registationnumber "}, new string[] {" Hing net "," "}," Cont_right ", Rule.class, Rule.post); 
    list<linktypedata> extracts = extractservice.extract (rule); 
  printf (extracts); 
        @org. Junit.test public void getdatasbycssquery () {Rule rule = new rule ("Http://www.11315.com/search", 
    New string[] {"Name"}, new string[] {"Hing Net"}, "Div.g-mn Div.con-model", Rule.selection, Rule.get); 
    list<linktypedata> extracts = extractservice.extract (rule); 
  printf (extracts); public void printf (LIST&LT;LINKTYPEDATA&GT 
      Datas) {for (Linktypedata Data:datas) {System.out.println (Data.getlinktext ()); 
      System.out.println (Data.getlinkhref ()); 
    System.out.println ("***********************************"); 

 } 
 
  } 
}

Output results:

Shenzhen Network Hing Technology Co., Ltd. 
http://14603257.11315.com 
*********************************** 
Jinzhou hing Network Road Materials Co., Ltd. 
http:/ /05155980.11315.com 
*********************************** 
Xian City All hing Internet 
Café 
**************************** 
Zichang County Emerging Network City 
# 
*********************************** 
Shaanxi Xing Network Information Co., Ltd. Third branch 
* * * * * * * * * * 
XI ' An Happy Network Technology Co., Ltd. 
# 
*********************************** 
Shaanxi Tong Xing Network Information Co., Ltd. Xian branch 
# 
***********************************

Finally, use a Baidu news to test our code: that our code is generic.

    /** 
 * Use Baidu News, only set URL and keyword and return type 
 * * 
@org. junit.test public 
void Getdatasbycssqueryuserbaidu () 
{rule rule 
  = new rule ("Http://news.baidu.com/ns", 
      new string[] {"word"}, new string[] {"Alipay"}, 
      N Ull,-1, rule.get); 
  list<linktypedata> extracts = extractservice.extract (rule); 
  printf (extracts); 
}

We only set the link, keyword, and request type, and do not set the specific filter criteria.

Results: Some of the junk data is certain, but the data needed must also be crawled out. We can set rule.section, and further restrictions on filter conditions.

Sorted by Time/ns?word= Alipay &ie=utf-8&bs= Alipay &sr=0&cl=2&rn=20&tn=news&ct=0&clk=sortbytime * X javascript:void (0) *********************************** Alipay will jointly build a security fund The first inputs 40 million http://finance.ifeng.com/a/20140409/12081871_0.shtml *********************************** 7 same news/ns?word =%e6%94%af%e4%bb%98%e5%ae%9d+cont:2465146414%7c697779368%7c3832159921&same=7&cl=1&tn=news&rn= 30&AMP;FM=SD *********************************** Baidu Snapshot http://cache.baidu.com/c?m= 9d78d513d9d437ab4f9e91697d1cc0161d4381132ba7d3020cd0870fd33a541b0120a1ac26510d19879e20345dfe1e4bea876d26605f75a09bbfd9178 2a6c1352f8a2432721a844a0fd019adc1452fc423875d9dad0ee7cdb168d5f18c&p=c96ec64ad48b2def49bd9b780b64&newp= C4769a4790934ea95ea28e281c4092695912c10e3dd796&user=baidu&fm=sc&query=%d6%a7%b8%b6%b1%a6&qid= A400f3660007a6c5&p1=1 *********************************** OpenSSL vulnerability involves many sites Alipay said no data leakage http://tech.ifeng.com/ internet/detail_2014_04/09/35590390_0.shtml *********************************** 26 Same news/ns?word=%e6%94%af%e4%bb%98%e5%ae% 
9D+CONT:3869124100&AMP;SAME=26&AMP;CL=1&AMP;TN=NEWS&AMP;RN=30&AMP;FM=SD *********************************** Baidu Snapshot http://cache.baidu.com/c?m= 9f65cb4a8c8507ed4fece7631050803743438014678387492ac3933fc239045c1c3aa5ec677e4742ce932b2152f4174bed843670340537b0efca8e57d fb08f29288f2c367117845615a71bb8cb31649b66cf04fdea44a7ecff25e5aac5a0da4323c044757e97f1fb4d7017dd1cf4&p= 8b2a970d95df11a05aa4c32013&newp=9e39c64ad4dd50fa40bd9b7c5253d8304503c52251d5ce042acc&user=baidu&fm 
=sc&query=%d6%a7%b8%b6%b1%a6&qid=a400f3660007a6c5&p1=2 *********************************** 

 Yahoo Japan since June began to support Alipay payment http://www.techweb.com.cn/ucweb/news/id/2025843 ***********************************

If there are any deficiencies, it can be pointed out; if you feel useful to you, top ~ ~ haha

Download the source, click here.

The above is the Java Crawler information crawl instances, follow-up continue to supplement the relevant information, thank you for your support of this site!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More