Create a personalized word book (i)

Source: Internet
Author: User

These two days need to put the word in the word to local, Youdao itself also provides such a function, but very single, can only export word spelling, phonetic transcription, Chinese semantics, not easy to learn and memorize words. So using Ruby to write this crawler, the whole process of the most important to solve is three questions:

    1. URL redirection
    2. Analog Login Youdao Dictionary
    3. Management of cookies in HTTP access
HTTP redirection

We open to the first page of the dictionary, will find a "My Word book" link below the search box, click on it, but the word page is not displayed, now you see must be the login interface. Yes, everyone's words are not the same, no authentication how to do it. The URL redirection occurred just in the process. We can do the following quiz:

Main_uri=URI ("Http://dict.youdao.com") Main_Page=Net:: Http.get_response(Main_uri) main_html=Nokogiri:: HTML(Main_Page.Body) Words_str=Nilmain_html.Search"a").each Do |Tag|Words_str=Tag.Attribute"href")if Tag.Content=="Word book"Endwords_uri=URI (WORDS_STR) puts Net:: Http.get(Words_uri)

You will see that the message printed to the console is:

URL has moved <a href="http://account.youdao.com/login?service=dict&back_url=http://dict.youdao.com/wordbook/wordlist%3Fkeyfrom%3Dnull">here</a>

This is what we can see in the browser, when we are not logged in is, click on the word of the link will first pop up a login screen, and then after we enter the login information will return to the page of the word book, during which a two-time URL redirection occurred.
After the redirection occurs, the fields in the HTTP header that are returned are displayed in the location updated Url,rubydoc with detailed processing, which in this case can be written as follows:

def fetch(uri)    page=Net::HTTP.get_response(uri)    case page    whenNet::HTTPSuccessthen page    whenNet::HTTPRedirectionthen        location=page[‘location‘]        fetch(URI(location))    endend

So in order to avoid the problem of redirection, we use the fetch method instead of get method, so we solved the first problem.

Analog Login

The processing of the simulated login is a lot of trouble, the analysis login interface, the user's login information is embedded in a form, in addition to the user name and password, buttons and check boxes, other than the hidden property, so not visible, so at the time of submission these hidden fields according name=>value to The look is submitted on the line, just add their own information can be. This package is required to find the form in the entire HTML file Nokogiri . The form is also filled out, then submit the information to the server side, net/http there is a method, all forms in the form of hash into the post_form method, you can submit to the specified URL. Well, this problem can also be solved. Wait, URL? Which URL? Where to submit? Is that the URL of the first step redirection? Well, let's see what the Real browser does.

Yes, the request URL you see is the URL of the real submission form.
The specific implementation code can be written as:

Words_page=fetch (Words_uri) words_uri=words_page. UriWords_page=nokogiri::html (words_page. Body) Params={}words_page. XPath("//form//input"). eachDo |t| Params[t. Attribute("Name"). to_s]=t. Attribute("Value"). to_s If T. Attribute("Name")!=nilendparams["username"]="xxxx"params["Password"]="xxxx"Form_action=uri"Https://logindict.youdao.com/login/acc/login"Words_page=net::http. Post_form (Form_action,params)

Can't wait to run, looking forward to a new page jump into the eye, but if not unexpectedly, when you print it word_page.body must also be a redirected page, but also directed to the previous login page. All of this shows that our login has failed. Because I have already done so, I would advise you to turn off the browser javascript support to see if it can simulate to our login process. If you do this, even if your browser will not be able to log in successfully, only the redirect page over and over again, login several times, the front of the familiar login page. You must have known what was going on. Yes, javascrip It's just that it's hidden in the middle of our browser and the remote server, and maybe the data we've submitted has been tampered with.
To view the source code for the login page:

function Validate (Disablesubmit) {varform = DOCUMENT.F;varName = Form.username.value. Trim ();varPassword = Form.password.value;if((Name.length <1|| name = = _hint) | | (Password.length <1)        || Password.length > -) {//alert ("Please enter the correct netease email user name")ShowWarning1 (true);return false; } form.password.value= Hex_md5 (Form.password.value);varSavelogin = Yd_get_elem ("Savelogin");if(Savelogin.checked) {form.cf.value=7; }if(Disablesubmit) {form.submit.disabled =true; }return true;}

This is a program that validates the user's login data javascript and also adds a secret to our password. Well, it's clear that we should also submit encrypted passwords, just like we javascript did. Also note that it is important to be careful to save the fields in the form that are affected by the login check box cf . This login success, printing will not redirect word_page.body to the login page, but the correct turn to your word book, no matter how said, from the success of one step.

Management cookie

HTTPThe protocol is not memory state, so if this time we use Get method to get the file of your word html , will redirect back to the Damned login page. So after successful login must manage the server to our key- cookie , cookie in the server's return message in the http set-cookie field specified, used to record a specific state.

cookies=words_page.header[‘set-cookie‘]new_uri=URI.parse(words_page[‘location‘])http=Net::HTTP.new(new_uri.host,new_uri.port)res=http.get(new_uri.path,{‘cookie‘=>cookies})res=Nokogiri::HTML(res.body)

Now print is res.body not the word we need to download the page, the next thing to do is to extract from the page we are interested in the data, with nokogiri this package is OK.

Create a personalized word book (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.