Java crawls Web page data, and fetches data after login.

Source: Internet
Author: User

Recently made a small program to fetch data from the network. Mainly on the credit side, some of the blacklisted websites collected from the site crawl into their own systems.

Also found some information, feel that there is not a good, comprehensive example. So make a note here to remind yourself.

First I need a jsoup jar package, I use 1.6.0. is: Http://pan.baidu.com/s/1mgqOuHa

1. Get Web content (core code, limited technology not encapsulated).

2. Crawl Web data after logging in (how to bring a cookie in the request).

3, get the Ajax request method for the Web site (return JSON).

Above these three points I use a class all contain (compare rough hope, direct copy code past, should be able to use)

First, this class has the above three-way method, the main method can be tested directly

PackageCom.minxinloan.black.web.utils;importJava.io.bufferedreader;importJava.io.bytearrayoutputstream;importJava.io.datainputstream;importJava.io.dataoutputstream;importJava.io.file;importJava.io.fileoutputstream;importJava.io.filewriter;importJava.io.ioexception;importJava.io.inputstream;importJava.io.inputstreamreader;importJava.io.outputstream;importJava.io.printwriter;importJava.net.httpurlconnection;importJava.net.url;importJava.net.urlconnection;importJava.net.urlencoder;importJava.nio.charset.charset;importJava.util.arraylist;importJava.util.hashmap;importJava.util.iterator;importJava.util.list;importJava.util.map;importJava.util.map.entry;importJava.util.stringtokenizer;importNet.sf.json.jsonarray;importNet.sf.json.jsonobject;importOrg.jsoup.connection;importOrg.jsoup.connection.method;importOrg.jsoup.jsoup;importOrg.jsoup.nodes.document;importOrg.jsoup.nodes.element;importOrg.jsoup.select.elements;public classCookieutil {public final static String Content_Type = "Content-type"; public static voidMain (string[] args) {//string loginurl = "http://www.p2peye.com/member.php?mod=logging&action=login& loginsubmit=yes&loginhash=lsc66&username=puqiuxiaomao&password=a1234567 "; String listURL = "http://www.p2peye.com/blacklist.php?p=2"; String LogURL = "http://www.p2peye.com/member.php"; ************************************************* try to log in{Connection.response res =Jsoup.connect (LogURL). Data ("mod", "Logging", "Action", "login", "Loginsubmit", "Yes", "Loginhash", "Lsc66", "username", "Puqiuxiaomao", "Password", "a1234567"). Method (Method.post). Execute (); The SessionID here needs to be based on the session cookie name set on the target site to be logged in Connection con=Jsoup.connect (listURL); Set the Access form (computer access, mobile Access): Direct Baidu are parameters set Con.header ("User-agent", "mozilla/4.0" (compatible; MSIE 7.0; Windows NT 5.1) "); Keep cookies for login information as map objects map <String,String> map=Res.cookies (); Iterator<entry<string,string>> it =Map.entryset (). iterator (); While(It.hasnext ()) {entry<string,string> en=It.next (); Put the login information into the request inside con =Con.cookie (En.getkey (), En.getvalue ()); }//Get the Document object again. Document Objectdoc =Con.get (); Elements Elements = objectdoc.getallelements ();//Get this connection back to the source content of the page (not the source code and the source of the same) for(Element element:elements) {//element is an iterative label: such as:<div><span></span></div> Elements elements2= element.getallelements ( );//For(Element element2:elements2) {Element2.text (); Element2.attr ("href");//Get Tag properties. Element2 represents a tag: the href represents the attribute element2.text ();//Gets the label text}}//******************************** does not require a login *************************************************String URL = "http://www.p2peye.com/blacklist.php?p=2"; Document contemp =Jsoup.connect (URL). get (); Elements Elementstemps =Contemp.getallelements (); For(Element elementstemp:elementstemps) {Elementstemp.text (); Elementstemp.attr ("href");//Get Tag properties. Element2 represents a tag: the href represents the attribute elementstemp.text ();//Gets the label text} The//********************************ajax method gets the content ... 。 HttpURLConnection connection = null; BufferedReader reader = null; Try{StringBuffer SB = newStringBuffer (); URL GETURL = newURL (URL); Connection =(httpurlconnection) geturl.openconnection (); reader = new BufferedReader (newInputStreamReader (Connection.getinputstream (), "Utf-8")); String lines; while ((lines = Reader.readline ())! = NULL) {sb.append (lines);}; list<map<string, object>> list = Parsejson2list (sb.tostring ());//json convert to List} catch(Exception e) {} finally{if (reader!=null) Try{Reader.close ();} Catch(IOException e) {}//DisconnectConnection.disconnect (); }} catch(IOException e) {//TODO auto-generated catch blockE.printstacktrace (); }} public static map<string, object>Parsejson2map (String jsonstr) {map<string, object> Map = new hashmap<string, object>(); Outermost parsing jsonobject json =Jsonobject.fromobject (JSONSTR); For(Object K:json.keyset ()) {Object v = Json.get (k);//If the inner layer is an array, continue to parse if (v instanceof  jsonarray) {list<map<string, object>> List = new Arraylist<map<string,object>>  (); Iterator<jsonobject> it =  ((jsonarray) v). Iterator (); while  (It.hasnext ()) {Jsonobject Json2 =  It.next (); List.add (Parsejson2map (json2.tostring ())); } map.put (K.tostring (), list); } else  {map.put (k.tostring (), v);}} return  map;} public static list<map<string, Object>>  Parsejson2list (String jsonstr) {Jsonarray Jsonarr =  jsonarray.fromobject (JSONSTR); list<map<string, object>> list = new Arraylist<map<string,object>>  (); iterator<jsonobject> it =  jsonarr.iterator (); while  (It.hasnext ()) {Jsonobject Json2 =  It.next ( ); List.add (Parsejson2map (json2.tostring ())); } return  list;}              

Two, this is the class that gets the verification code, can study under. (But you need to analyze the site's verification code's request address)

PackageCom.minxinloan.black.web.utils;importJava.io.bufferedreader;importJava.io.datainputstream;importJava.io.dataoutputstream;importJava.io.file;importJava.io.fileoutputstream;importJava.io.filewriter;importJava.io.inputstream;importJava.io.inputstreamreader;importJava.io.printwriter;importJava.net.httpurlconnection;importJava.net.url;importJava.net.urlconnection;importJava.nio.charset.charset;importJava.util.hashmap;importJava.util.list;importJava.util.map;importJava.util.stringtokenizer;public class Utils {//Parse verification code public static Content Getrandom (String method, string surl,//to parse URL map<string, string> parammap,//Map map<string with user name and password, string> requestheadermap,//store cookie Map BooleanIsonlyreturnheader, String path) {content content = NULL; HttpURLConnection httpurlconnection = null; InputStream in = null; Try{URL url = newURL (sURL); Boolean isPost = "POST". Equals (method); if (method = = NULL | | (!" GET ". Equalsignorecase (method) &&!" POST ". Equalsignorecase (method))) {method = "POST"; } URL Resolvedurl =Url URLConnection URLConnection =Resolvedurl.openconnection (); HttpURLConnection =(httpurlconnection) URLConnection; Httpurlconnection.setrequestmethod (method); Httpurlconnection.setrequestproperty ("Accept-language", "zh-cn,zh;q=0.5"); Do not follow redirects, We'll handle redirects ourself httpurlconnection.setinstancefollowredirects (false); Httpurlconnection.setdooutput (True); Httpurlconnection.setdoinput (True); Httpurlconnection.setconnecttimeout (5000); Httpurlconnection.setreadtimeout (5000); Httpurlconnection.setusecaches (False); Httpurlconnection.setdefaultusecaches (False); Httpurlconnection.connect (); int responsecode =Httpurlconnection.getresponsecode (); if (Responsecode = =HTTPURLCONNECTION.HTTP_OK | | Responsecode = =httpurlconnection.http_created) {byte[] bytes = new Byte[0]; if (!Isonlyreturnheader) {DataInputStream ins = newDataInputStream (Httpurlconnection.getinputstream ()); Verify the location of the code dataoutputstream out = newDataOutputStream (New FileOutputStream (path + "/code.bmp"))); byte[] buffer = new byte[4096]; int count = 0; while ((count = ins.read (buffer)) > 0) {out.write (buffer, 0, count); } out.close (); Ins.close (); } String encoding = NULL; if (encoding = = NULL) {encoding =Getencodingfromcontenttype (httpurlconnection. Getheaderfield ("")); Content = new Content (sURL, newString (bytes, encoding), Httpurlconnection.getheaderfields ()); }} catch(Exception e) {return null; } finally{if (httpurlconnection! = null) {Httpurlconnection.disconnect ();}} ReturnContent } public staticString Getencodingfromcontenttype (String contentType) {string encoding = NULL; if (ContentType = = NULL) {return null; } StringTokenizer tok = new StringTokenizer (ContentType, ";"); If(Tok.hasmoretokens ()) {Tok.nexttoken ();(Tok.hasmoretokens ()) {String Assignment =Tok.nexttoken (). Trim (); int eqidx = assignment.indexof (' = ')); if (eqidx! =-1) {String varName = assignment.substring (0, Eqidx). Trim (); if ("CharSet". Equalsignorecase (VarName)) {String varvalue = assignment.substring (eqidx + 1). Trim (); if (Varvalue.startswith ("\" ") && varvalue.endswith ("\" ") {//substring works on indices varvalue = varvalue.substring (1, Varvalue.length ()-1); } if(charset.issupported (Varvalue)) {encoding =Varvalue; }}}} if (encoding = = NULL{return "UTF-8"); } returnEncoding }//This is the output public static BooleanInFile (string content, string path) {PrintWriter out = null; File File = newFile (path); Try{if (!File.exists ()) {file.createnewfile ();} out = new PrintWriter (newFileWriter (file)); Out.write (content); Out.flush (); return True; } catch(Exception e) {E.printstacktrace ();} Finally{Out.close ();} return False; } public staticString Gethtmlreadline (String httpurl) {string currentline = ""; String totalstring = ""; InputStream UrlStream; String content = ""; Try{URL url = newURL (Httpurl); HttpURLConnection connection =(httpurlconnection) URL. OpenConnection (); Connection.connect (); System.out.println (Connection.getresponsecode ()); UrlStream =Connection.getinputstream (); BufferedReader reader = newBufferedReader (New InputStreamReader (UrlStream, "Utf-8")); while ((CurrentLine = Reader.readline ())! = NULL {totalstring + = CurrentLine + "\ n";} content = totalstring;} catch (Exception e) {} return conte nt }}class Content {private string url; private string body; private map<string, list<string>> m_mhe Aders = new hashmap<string, list<string>>(); public Content (string URL, string body, map<string, List <String>> headers) {this.url = url; this.body = body; this.m_mheaders = headers;} public
                                                                                                                       
                                                                                                                         string GetUrl () {return
                                                                                                                         URL;} public String GetBody () {return body;} public map<string, List<s Tring>> getheaders () {return m_mheaders;}}     
                                                                                                                           

Java crawls Web page data and fetches data after login.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.