These two days need to put the word in the word to local, Youdao itself also provides such a function, but very single, can only export word spelling, phonetic transcription, Chinese semantics, not easy to learn and memorize words. So using Ruby to write this crawler, the whole process of the most important to solve is three questions:
- URL redirection
- Analog Login Youdao Dictionary
- Management of cookies in HTTP access
HTTP redirection
We open to the first page of the dictionary, will find a "My Word book" link below the search box, click on it, but the word page is not displayed, now you see must be the login interface. Yes, everyone's words are not the same, no authentication how to do it. The URL redirection occurred just in the process. We can do the following quiz:
Main_uri=URI ("Http://dict.youdao.com") Main_Page=Net:: Http.get_response(Main_uri) main_html=Nokogiri:: HTML(Main_Page.Body) Words_str=Nilmain_html.Search"a").each Do |Tag|Words_str=Tag.Attribute"href")if Tag.Content=="Word book"Endwords_uri=URI (WORDS_STR) puts Net:: Http.get(Words_uri)
You will see that the message printed to the console is:
URL has moved <a href="http://account.youdao.com/login?service=dict&back_url=http://dict.youdao.com/wordbook/wordlist%3Fkeyfrom%3Dnull">here</a>
This is what we can see in the browser, when we are not logged in is, click on the word of the link will first pop up a login screen, and then after we enter the login information will return to the page of the word book, during which a two-time URL redirection occurred.
After the redirection occurs, the fields in the HTTP header that are returned are displayed in the location
updated Url,rubydoc with detailed processing, which in this case can be written as follows:
def fetch(uri) page=Net::HTTP.get_response(uri) case page whenNet::HTTPSuccessthen page whenNet::HTTPRedirectionthen location=page[‘location‘] fetch(URI(location)) endend
So in order to avoid the problem of redirection, we use the fetch
method instead of get method, so we solved the first problem.
Analog Login
The processing of the simulated login is a lot of trouble, the analysis login interface, the user's login information is embedded in a form, in addition to the user name and password, buttons and check boxes, other than the hidden property, so not visible, so at the time of submission these hidden fields according name=>value
to The look is submitted on the line, just add their own information can be. This package is required to find the form in the entire HTML file Nokogiri
. The form is also filled out, then submit the information to the server side, net/http
there is a method, all forms in the form of hash into the post_form
method, you can submit to the specified URL. Well, this problem can also be solved. Wait, URL? Which URL? Where to submit? Is that the URL of the first step redirection? Well, let's see what the Real browser does.
Yes, the request URL you see is the URL of the real submission form.
The specific implementation code can be written as:
Words_page=fetch (Words_uri) words_uri=words_page. UriWords_page=nokogiri::html (words_page. Body) Params={}words_page. XPath("//form//input"). eachDo |t| Params[t. Attribute("Name"). to_s]=t. Attribute("Value"). to_s If T. Attribute("Name")!=nilendparams["username"]="xxxx"params["Password"]="xxxx"Form_action=uri"Https://logindict.youdao.com/login/acc/login"Words_page=net::http. Post_form (Form_action,params)
Can't wait to run, looking forward to a new page jump into the eye, but if not unexpectedly, when you print it word_page.body
must also be a redirected page, but also directed to the previous login page. All of this shows that our login has failed. Because I have already done so, I would advise you to turn off the browser javascript
support to see if it can simulate to our login process. If you do this, even if your browser will not be able to log in successfully, only the redirect page over and over again, login several times, the front of the familiar login page. You must have known what was going on. Yes, javascrip
It's just that it's hidden in the middle of our browser and the remote server, and maybe the data we've submitted has been tampered with.
To view the source code for the login page:
function Validate (Disablesubmit) {varform = DOCUMENT.F;varName = Form.username.value. Trim ();varPassword = Form.password.value;if((Name.length <1|| name = = _hint) | | (Password.length <1) || Password.length > -) {//alert ("Please enter the correct netease email user name")ShowWarning1 (true);return false; } form.password.value= Hex_md5 (Form.password.value);varSavelogin = Yd_get_elem ("Savelogin");if(Savelogin.checked) {form.cf.value=7; }if(Disablesubmit) {form.submit.disabled =true; }return true;}
This is a program that validates the user's login data javascript
and also adds a secret to our password. Well, it's clear that we should also submit encrypted passwords, just like we javascript
did. Also note that it is important to be careful to save the fields in the form that are affected by the login check box cf
. This login success, printing will not redirect word_page.body
to the login page, but the correct turn to your word book, no matter how said, from the success of one step.
Management
cookie
HTTP
The protocol is not memory state, so if this time we use Get method to get the file of your word html
, will redirect back to the Damned login page. So after successful login must manage the server to our key- cookie
, cookie
in the server's return message in the http
set-cookie
field specified, used to record a specific state.
cookies=words_page.header[‘set-cookie‘]new_uri=URI.parse(words_page[‘location‘])http=Net::HTTP.new(new_uri.host,new_uri.port)res=http.get(new_uri.path,{‘cookie‘=>cookies})res=Nokogiri::HTML(res.body)
Now print is res.body
not the word we need to download the page, the next thing to do is to extract from the page we are interested in the data, with nokogiri
this package is OK.
Create a personalized word book (i)