Sometimes when you crawl Web information, some pages need to log in to see what's really going on. The method used in our previous article is not able to capture data directly.
Jsoup Crawl Web information (1) Crawl International disease Code
such as crawling Web pages: Http://www.findacode.com/code-set.php?set=CPT on the CPT codes
Before logging in, the following appears:
Display after Login:
The information we want to crawl is what is displayed after the login.
The way to solve this problem is very simple. Jsoup gives us the function of cookies, so long as we pass our own account login cookies to Jsoup to achieve access to the logged-in user.
1. First, log in to your account.
2. Then, in Chrome Chrome-> setting->content settings->cookies-> "All Cookies and site data" to find Www.findacode.com the corresponding cookies
The following figure:
You can see from the diagram that there are 4 key values for cookies. We need to be concerned.
3. Using Jsoup to load cookies
map<string, string> cookies = null;
cookies = new hashmap<string, string> ();
Cookies.put ("Phpsessid", "pue5b6f642cu21v7qun47hs3b5");
Cookies.put ("__ar_v4", "cqjdwev345ej7p3mlkozx3%3a20151008%3a3%7czfx3a7fmxjarpkt2gz64xz%3a20151008%3a3% 7cyudumc5denc6lmix6uqp3e%3a20151008%3a3 ");
Cookies.put ("Fac_type", "F");
Cookies.put ("show_sign_in", "T");
Document doc = Jsoup.connect (pageurl). Cookies (cookies). get ();
...
4. After the connection is successful, use the method of the first article to crawl the data.