[Project] Simulate HTTP Post Request to obtain data from Web Page by using Python scrapy Framework

Source: Internet
Author: User
Tags http post ming xpath chrome developer chrome developer tools python scrapy



1. Background



Though It's always difficult-give child a perfect name, the parent never give up trying. One of my friends met a problem now. His baby girl just came to the world, and he want to make a perfect name for her. He found a Web page, in which he can input a baby name, baby birthday and birth time, then the Web page would return 2 Scor Es to indicate whether the name was a good or bad for the baby according to China's old philosophy---"The Book of Changes (I Ching) ". The 2 scores, we just naming it score1 and Score2, is ranged from 0 to 100. My friend asked me could it possible to make a script this input thousands of popular names in batches, he and then can s Elect baby name among top score names, such as names with both Score1 and Score2 over 95.



The website



https://www.threetong.com/ceming/



2. Analysis and Plan



Now has the information is the surname and birth eight, need to get the name, because the child's parents want two words of the name, then can be refined to the word 1 and the word 2. By entering this information into the website, you will get two index points: name score and name with the eight-score, if two score is a good score, For example, two are 100, or above 95, then the name can be entered into an alternative form. I can, for example, make 100,000 commonly used names to enter this website, with the birth of eight, each name can have a score. According to the score from high to low sort, form a list of names, and finally to the child's parents to screen, choose to meet the eye edge, like pronunciation and meaning, and so on, that is their subjective choice. At least in this way, no matter what name, are in line with the book of Zhouyi theory, to get a lifetime of peace and happiness.



Chinese name is usually consist of family name and given name. Usually family name is one or both Chinese characters, my friend ' s family name is one Chinese character. Given name is also usually one or both Chinese characters. Recently, given name with the Chinese characters is more popular. My friend want to make a given name with 2 characters. As the baby girl ' s family name is known, being same with she father, I just need to make thousands of given names that was Su Itable for girl and automatically input at the website, finally obtain the displayed score1, Score2.



3. Step



A,obtain Chinese characters that suitable for naming a girl



Because it is a two-character name, the word 1 and the word 2 can be used in this list, and then use a loop to form each possible combination of Word 1 and Word 2. I chose a list of 800 children so that the last name I entered would have a 800x800,160000 name. The obtained code is very basic scrapy gets the information on the website as follows:



Traditionally, there is some characters for naming a girl. I just find the


#spider的代码
#-*-Coding:utf-8-*-import scrapyfrom getname.items import getnameitemclass downnamespider (scrapy. Spider):    name = ' downname ' start_urls = [' http://xh.5156edu.com/xm/nu.html '] #默认的http request
    # The default HTTP request returns the HTTP response handler function, which is a callback function.    def Parse (self, response): item = getnameitem () item[' ming '] = Response.xpath ('//a[@class = ' Fontbox ']/ Text () '). Extract () yield item  
#定义了一个item来存获取的字
Import scrapyclass getnameitem (scrapy. Item): # define the fields for your item here is like    :    # name = Scrapy. Field ()    ming = scrapy. Field () 


The rest is the auto-generated code for the Scrapy framework.



B, a two-word combination of the name, the surname and birth eight, entered into the eight-name website, get a list of the names of the scores, filter out the low-score names, such as less than 95 points. Presented to the child's parents.



4. The difficulty is detailed, the skill introduction



A, how to quickly get caught on the Web page of the object's XPath path



If you know the XPath syntax well, it's a bit redundant, and you can write it directly according to the grammar rules. If you are not familiar with it, Chrome or Firefox and other browsers have tools to help you.



Take Chrome as an example



Use Chrome to open the page, right-click on the object you want to get, and select Check






Go to Chrome's developer Tools, find the column that contains the object in the source code, right-click, select Copy, select Copy to XPath






Finally, paste in the scrapy.



This will only get the content of this one tag, that is, "static", if you need to get all the content, still need to understand the XPath. As follows:


item[' ming ' = Response.xpath ('//a[@class = ' Fontbox ']/text () '). Extract ()


There is a pairing string in the XPath function that uses the @ attribute, meaning that all the contents of the A-tag with the class name Fontbox are fetched.



As shown






See 3 for details. A's Code implementation



B, how to implement automatic input on the target website



The process of automatic input is not to find the input grid, fill in the data in the form, and then simulate the point of submission. Instead of directly simulating HTTP request to the target Web page to send data, the two common ways, one is HTTP GET, one is HTTP POST, by observing the target site link, we found that the target site "https:// www.threetong.com/ceming/"is the post, then to which web page to send data, and send the format of the data form is, here can also use our good friends, Chrome's developer tools."



First simulate operation, enter data in "https://www.threetong.com/ceming/"






You can then open Chrome Developer Tools to view the source code (elements option)






We can see that this form is sent by post, and the destination is the page with the action.



Then click on the name test, so that it will be like the target page to send data, and then jump to this page






On this page, we click on Chrome's developer Tool, go to the network, select xingmingceshi.php this page, click on the right side of the headers, you can see the details of this page.






You can see how to enter this page, including the Request Url,request method is post, pull down, you can see the submitted form information






So, we just need to simulate the HTTP POST request to send the form information to the request URL.



C, specific code implementation


#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.html
# define item to save the final generated data
Import scrapyclass daxiangnameitem (scrapy. Item): # define the fields for your item here is like    :    # name = Scrapy. Field ()    score1 = scrapy. Field ()    score2 = scrapy. Field ()    name = Scrapy. Field ()   
#-*-Coding:utf-8-*-importScrapyimportCsv
# Introduce the Itemfrom daxiangname.items import defined aboveDaxiangnameitemclassCemingspider (scrapy. Spider): name = ' ceming ' # This is a default entry function for Scrapy, and the program will start from here to run DefStart_requests (self):
# using a double loop, open two CSV files, read one word at a time, note the encoding of the file, in order to prevent garbled characters, use UTF-8 with open (GetAttr (self, ' file ', './ming1.csv '), encoding= ' UTF-8 ‘) as F:reader =Csv. Dictreader (f) for line inReader: #print (line[' \ufeffmingzi ') with open (GetAttr (self, ' file2 ', './ming2.csv '), encoding= ' UTF-8 ') as F2:reader2 =Csv. Dictreader (F2) for line2 inReader2: #print (line) #print (line2) #注意下面的编码, as read from CSV, one more symbol, is tested using the print function Mingzi = line[' \ufeffming1 ']+line2[' \ufeffming2 '] #print (Mingzi) #下面这个函数是核心函数, Scrapy defines a function formrequest that simulates sending an HTTP POST requestScrapy.http.FormRequest (url= ' https://www.threetong.com/ceming/baziceming/xingmingceshi.php ', formdata={' isbz ': ' 1 ', ' txtname ': U ' Liu ', ' name ': Mingzi, ' rdosex ': ' 0 ', ' data_type ': ' 0 ', ' cboyear ': ' 2017 ', ' cbomonth ': ' 7 ', ' cboday ': ' 30 ', ' Cbohour ': U ' 20-meet ', ' Cbominute ': U ' 39 min ',}, callback=The Self.after_login #这是指定回调函数, which is the function to which the returned results are sent after the request. ) yield Formrequest #这里很重要, in Scrapy, all HTTP request to search for Web pages will have a pool that forms a iterator generator through the yield function, accumulating def in the Send pool  After_login (Self, Response): "' #save response body into a file filename = ' source.html ' with open (filename, ' WB ') As F:f.write (response.body) self.log (' Saved file%s '% filename) '  # Here is the score from the returned data, with a little trick of regular expressions to get integers and Number with decimal point score1 = Response.xpath ('/html/body/div[6]/div/div[2]/div[3]/div[1]/span[1]/text () '). Re (' [\d.] + ' ) Score2 = Response.xpath ('/html/body/div[6]/div/div[2]/div[3]/div[1]/span[2]/text () '). Re (' [\d.] + '  ') name = Response.xpath ('/html/body/div[6]/div/div[2]/div[3]/ul[1]/li[1]/text () ' ). Extract () #print ( Score1) #print (score2) print  (name) 
 # keep only the so-called good score if float (score1[0]) >= and float (score2[0]) >= 90: item =  daxiangnameitem () item[' score1 '] =  score1 item[' score2 '] =  score2 item[' name '] =  name Yield  Item # Here is the output of the pool, which forms an output of iterator generator, used when running-0 parameters output all items      


5. PostScript



Finally, more than 4,000 names were returned unexpectedly, which caused great difficulty in the final artificial screening. Finally through the two input CSV file of the word deletion, get dozens of from all aspects of a good name.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.