Java implementation crawl knows the user basic information _java

Source: Internet
Author: User

This example of this article for you to share a Java based on the crawler, grasping the basic user information, based on HttpClient 4.5 for your reference, the specific contents are as follows
details:
Crawl 90w+ User information (basically active users are inside)
general idea:
1. First analog login know, after successful login to serialize cookies to disk, do not have to log in every time (if not analog login, It is also possible to plug cookies directly from the browser.
2. Create two thread pools and one storage. A crawl web thread pool, responsible for executing request requests, and returning the contents of the Web page to the storage. Another is to parse the web thread pool, responsible for removing the content from the Web page from storage and parsing, parsing the user data into the database, resolving the user's concern of the home page, the address request added to crawl Web thread pool. All the way down the loop.
3. About the URL to go heavy, I was directly md5 the visited link to the database, before each visit, to see if the link exists in the database.
So far, has grabbed 100W user, visited the link 220w+. Now crawled users are some of the less active users. The more active users should basically have finished.
Project Address: https://github.com/wycm/mycrawler
Implementation code:

Author: Lying yan silent Link: https://www.zhihu.com/question/36909173/answer/97643000 Source: Know that copyright belongs to the author.
 
 Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.  /** * * @param httpclient HTTP Client * @param context http Contexts * @return/public boolean login (closeablehttpclient
HttpClient, Httpclientcontext context) {String yzm = null;
String loginstate = null;
HttpGet getrequest = new HttpGet ("https://www.zhihu.com/#signin");
Httpclientutil.getwebpage (Httpclient,context, Getrequest, "Utf-8", false);
HttpPost request = new HttpPost ("Https://www.zhihu.com/login/email");
list<namevaluepair> formparams = new arraylist<namevaluepair> (); Yzm = Yzm (httpclient, Context, "https://www.zhihu.com/captcha.gif?type=login");/naked eye Recognition Verification Code Formparams.add (new
Basicnamevaluepair ("Captcha", Yzm)); Formparams.add (New Basicnamevaluepair ("_xsrf", ""))//This parameter can not be Formparams.add (new Basicnamevaluepair ("email", "Mailbox")
);
Formparams.add (New Basicnamevaluepair ("Password", "password"));
Formparams.add (New Basicnamevaluepair ("Remember_me", "true")); UrlencodEdformentity entity = NULL; try {entity = new urlencodedformentity (formparams, "Utf-8");} catch (Unsupportedencodingexception e) {E.printstacktrace
();
} request.setentity (entity); Loginstate = Httpclientutil.getwebpage (httpclient,context, request, "Utf-8", false);/login Jsonobject Jo = new Jsonobject (
Loginstate); if (Jo.get ("R"). ToString (). Equals ("0")) {System.out.println ("login succeeded"); getrequest = new HttpGet ("https://www.zhihu.com
"); Httpclientutil.getwebpage (Httpclient,context, Getrequest, "Utf-8", false);//Visit Home Httpclientutil.serializeobject (
Context.getcookiestore (), "resources/zhihucookies"),//serialization of cookies, the next login directly through the cookie login return true;
}else{System.out.println ("Login failed" + loginstate); return false;}} /** * Visual Identity Verification Code * @param httpclient HTTP Client * @param context http contexts * @param URL Authentication Code address * @return/public Stri Ng Yzm (closeablehttpclient httpclient,httpclientcontext context, String URL) {httpclientutil.downloadfile (httpClient
, Context, URL, "d:/test/", "1.gif", true); Scannersc = new Scanner (system.in);
String Yzm = Sc.nextline ();
return YZM;
 }

Effect Chart:

The above is the entire content of this article, I hope to help you learn.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.