Data Capture (a): Crawl traffic management site in Beijing vehicle illegal information (has ended)

Source: Internet
Author: User
Tags http request chrome developer

http://blog.csdn.net/limenglin0927/article/details/17539171

This address: http://blog.csdn.net/limenglin0927/article/details/17539171

My personal information:

This ape 92 niche, a 10-level three-stream undergraduate software engineering major, began internship in October 2013 this year, said long not long said short time to mingle in small and medium-sized internet companies, mainly engaged in Java research and development work. A more precise duty is the realization of data.

In general, have not completely left my alma mater claws of me, and there is no algorithm in the ground level pre-research Daniel so deep internal strength, there is no more than more than 10 years of the project of the Lion (engineers) so gorgeous moves, but I am a deep love the internet industry coder, even left a little footprint, And I'm going to keep on walking down this road.


My personal wish:

Hope that the great God or prawns, to data capture this piece of research or interested people, can discuss together CuO technology, engineering, hobbies. Thank you.


Recently began to crawl some data, I do not know which industry Daniel once said: as long as the page can be seen, can be obtained, but the difficulty of the degree of the problem.

The internet is like a big network full of mysteries, countless industries, countless opportunities, countless users, countless information (data) ... Tumbling and sinking in it, it is filled with countless wealth, can bring people is also unimaginable.

The heart gave birth to an idea, the recent crawl of different Web site data, there will be analyzed and crawled next

On-line data for filing, maybe after settling for a long time will be a piece of data to crawl the road, but also with you for advice, discussion and sharing. Life is endless, learning is more than.


0, data capture background information:


Beijing Traffic Management official website: http://www.bjjtgl.gov.cn/publish/portal0/tab72/
The "Vehicle Illegal inquiry" module in the left column box

Test data: Beijing (license plate number + engine number)
This seems to be privacy, inconvenient to disclose. So if you own a car, you can test it with your own data.


I. Analysis of sites to crawl


To use the program to automate the acquisition of data for a site, the first step is to manually analyze the site structure, data generation steps, as well as the means of limiting automation, and so on the next automation to help the implementation of information. The enemy can Baizhanbudai.

My personal recommendation here is to master the use of Chrome (Google browser) to analyze the site, skilled use of the tool can not only benefit from the data capture method, but also on your front-end technology understanding, system architecture design has a little bit of knowledge to learn. Thin hair is the king.


First, manually go through the normal query process:




Figure 1-Home Page Query window

Launch Chrome's built-in developer debugging tool by pressing the F12 button under Chrome.
You can see some information about the page, such as HTML source code, page element structure tree, CSS style distribution, and so on.


Figure 2-chrome Developer Debug Tool screenshot

To get back to the point, more chrome usage rules and details are not the focus of our discussion, they have to be mastered and often used by everyone. If necessary, a special post will be written to share and discuss.

After entering the correct information, click the "Query" button,
Page jumps to http://sslk.bjjtgl.gov.cn/jgjwwcx/wzcx/wzcx_preview.jsp this address.

Figure 3-Verification Code Entry page

Walk here can be very clear to see the limitations of the Web page automation conditions, the general process can be figured out to a one or two.
Need to click on the "click to get Verification Code" button to see the verification code, and the verification code for the more difficult to verify, more than a few times to refresh the discovery is about the vehicle driving problem.
(Really haunted "subject one" question ~ ~) O (∩_∩) o~

Open the Debugging tool (F12), select the "Network" button, select the Debug tool's Web request listening module, refresh the page again, you can see the refresh or the access request, your browser sent the URL request information.

There are many requests in the left column box such as JSP server script, CSS text style, JS browser script, jpg (PNG) picture, multimedia and so on, click on the first wzcx_preview.jsp and select the header option on the right, you can see the information submitted by the "Master request". As shown:

Figure 4-Verification code page analysis diagram

A little familiar with the HTTP request, it is easy to find that the verification code page has actually received the previous we filled out the city (SF)-11, car number (Carno)-xxxxxx, Motor Vehicle No. (FDJH)-xxxxx.

So it can be decided that the first form page in the front does not have the necessity of being there at all. Find out more about this page when you click on the "click to get Verification Code" button, a new request is issued at the bottom of the "NetWork" left column box, which is the request to obtain the verification code picture data. Click this request to view the relevant header information, found in the request header information contains the previous access to the JSP page generated by the cookie information. And after effective verification, the image built-in session of the validation code answer and the current access to the cookie value is bound, through the value of the cookie saved to verify the correctness of the user's verification code input, and then you can do later.


Figure 5-Request information for a verification code

(Valid validation: Personal conjecture if the JSP page has not been visited, and the direct get way to request the verification code, the test result is that the YZMIMG?T=XXXXX request will not have the corresponding cookie in the case of response Set-cookie, that is, set a cookie , which confirms the conclusion that I have just made earlier. )

Finally confirms my "the site system is bound to the validation code answer in session and the cookie of the user access session." The conclusion of the event is as follows:
When I right "yzmimg?t=xxxx", select "Open in New tab", also only display a verification code picture, and then F12 debugging, constantly refreshed, found that the verification code image is changing, and the cookie has not changed, and then such as the original JSP Verification code input page Verification code is "show", Now I open the new tab of the verification code after countless times to become "pass", then I am in that JSP page input "pass" is correct. At the end of the day, the server-side session is recorded with the latest verification code answer for this cookie request.

Next, enter the correct verification code, click on the query, go to the main page, similarly, F12 Debug page, analyze the URL request.

Now, after analyzing the request of the last Information main page, the following map can be clearly seen, the final is an action request, and comes with a lot of various branch requests, now we only look at the main request "getwzcxxx.action" can be.



Figure 6-Request structure of the final Information presentation page

Figure 7-action Header information for the request

In the form date column, you can see clearly the form submission data, as well as the cookie setting parameters of the request header.

The general structure of the site, the request logic is basically clear, the most important step is completed, the rest of the matter is very good to run it.
I am using the Java language, either the native network connection class with the HttpClient jar package or the java.net or the spring Xxxtemplate class.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.