A summary of Python reptiles (II.)

Source: Internet
Author: User

To do the work of network crawl must have some basic knowledge of the HTTP protocol, it is better to have some HTML base. First, let's introduce the next HTTP Header. We click a button on the Web page, or request a webpage to send a request header to the server, and the server sends a response header. This header is not visible to us, the way to view the header is: Firefox firebug, Chrome's inspect element, and other grab tools (such as Wireshark (super powerful, but may not be very helpful for beginners) , HttpWatch (can crawl IE's package)). Since the inspect element is embedded in chrome (since Firefox's boot speed---abandoned), and the feature is powerful enough, I usually use it to view the HTTP header. Right-click in the blank of the page to see the inspect Element as shown:

               

Figure A

            

Figure II (SELECT Network and tick Preserve log)

            

Might

We click the login button to see that the server has been sent a header post some data, the contents of the header can be read most of the first glance (do not understand the fork this http://en.wikipedia.org/wiki/List_of_ Http_header_fields), here mainly said Referer and user-agent, some sites will check the two parameters to determine whether it is a malicious crawl, such as crawling CSDN site articles, if not sent these two parameters will be rejected by the server download. You can also see that the requested method is post, and that status Code is 200, which means that the response is normal (fork this http://www.w3.org/Protocols/rfc2616/ rfc2616-sec10.html), we can judge this status code when writing a program to determine if it is a normal access.

Then go down to see some of the data we post to the server and the response headers of the service; we look at the source of the login interface (right-click View Source) can see the login interface of the form code may be confusing, you can find an online format, the general code editing tools are OK, Usually comes with the function of code formatting.

            

Figure Four

<formMethod= "POST"ID= "LoginForm"class= "Login-form"Action= "Http://www.renren.com/PLogin.do">    <DLclass= "Top Clearfix">        <DD>            <inputtype= "text"name= "Email"class= "Input-text"ID= "Email"TabIndex= "1"value="" />        </DD>    </DL>    <DLclass= "pwd Clearfix">        <DD>            <inputtype= "Password"ID= "Password"name= "Password"Error= "Please enter password"class= "Input-text"TabIndex= "2"AutoComplete= "Off" />            <labelclass= "Pwdtip"ID= "Pwdtip" for= "Password">Please enter your password</label>            <aclass= "Forgetpwd"ID= "Forgetpwd"href= "Http://safe.renren.com/findPass.do"Stats= "Home_findpassword">Forgot your password? </a>        </DD>    </DL>    <Divclass= "Caps-lock-tips"ID= "Capslockmessage"style= "Display:none">    </Div>    <DLclass= "Savepassword clearfix">        <DT>            <labeltitle= "To ensure your information is safe, please do not check this in the Internet café or public computer room!" " for= "Autologin"class= "Labelcheckbox">                <inputtype= "checkbox"name= "Autologin"ID= "Autologin"value= "true"TabIndex= "4"                />Next Automatic login</label>        </DT>        <DD>            <spanclass= "GetPassword"ID= "GetPassword">                <ahref= "Http://safe.renren.com/findPass.do"Stats= "Home_findpassword">Forgot your password? </a>            </span>        </DD>    </DL>    <DLID= "Code"class= "Code Clearfix">        <DT>            <label for= "Code">Verification Code:</label>        </DT>        <DD>            <inputID= "Icode"type= "text"name= "Icode"class= "Input-text"TabIndex= "3"AutoComplete= "Off" />            <labelclass= "Codetip"ID= "Codetip" for= "Icode">Please enter a verification code</label>        </DD>    </DL>    <DLID= "Codeimg"class= "Codeimg clearfix">        <DT>        </DT>        <DD>            <imgID= "Verifypic_login"src= "Http://icode.renren.com/getcode.do?t=web_login&rnd=Math.random ()"            />        </DD>        <aclass= "Changeone"href= "Javascript:refreshcode_login ();">Change a</a>    </DL>    <DLclass= "Bottom">        <inputtype= "hidden"name= "Origurl"value= "Http://www.renren.com/home"        />        <inputtype= "hidden"name= "Domain"value= "Renren.com" />        <inputtype= "hidden"name= "key_id"value= "1" />        <inputtype= "hidden"name= "Captcha_type"ID= "Captcha_type"value= "Web_login"        />        <inputtype= "Submit"ID= "Login"class= "Input-submit login-btn"Stats= "Loginpage_login_button"value= "Login"TabIndex= "5" />    </DL></form>
View Code

Through the source code form inside the elements and post to the server information we can know, email is my login username, Autologin is whether to automatically login, and so on, you can see the password is already encrypted password (many sites post is clear text data) , the specific encryption algorithm is not the focus of our discussion (online has the relevant analysis of the article own Google).

The information obtained from the Web page is sufficient, and the next step is how we will program it. Because the HTTP protocol is actually stateless, we just send some of the data to the server, the server thinks we are the corresponding browser, our purpose is achieved.

          

A summary of Python reptiles (II.)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.