A small example of crawling HTML information on a webpage using a Python Program
This article mainly introduces a small example of using Python programs to capture the HTML information of web pages. The method used is also the basis for compiling crawlers using Python. For more information, see
There are many ways to capture web page data, such as directly sending code to http, simulating browser request data (usually requiring login verification), and controlling the browser to capture data. This example shows a small example of reading simple webpage data without considering the complexity:
Target Data
Save the hyperlinks of all these contestants on the page of the ittf website.
Data Request
I really like databases that conform to human thinking, such as requests. If you want to take Web text directly, you can do it in one sentence:
?
1 |
Doc = requests. get (url). text |
Parse html to get data
Take beautifulsoup as an example, including obtaining tags, links, and traversing Based on html hierarchies. For more information, see here. The following snippet obtains the link at the specified position on the specified page from the ittf website.
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Url = 'HTTP: // www.ittf.com/ittf_ranking/WR_Table_3_A2.asp? Age_category_1 = & Age_category_2 = & Age_category_3 = & Age_category_4 = & Age_category_5 = & Category = 100 W & Cont = & Country = & Gender = W & Month1 = 4 & Year1 = 2015 & s_Player_Name = & Formv_WR_Table_3 _ Page = '+ str (page) Doc = requests. get (url). text Soup = BeautifulSoup (doc) Atags = soup. find_all ('A ') Rank_link_pre = 'HTTP: // www.ittf.com/ittf_ranking /' Mlfile = open (linkfile, 'A ') For atag in atags: # Print atag If atag! = None and atag. get ('href ')! = None: If "WR_Table_3_A2_Details.asp" in atag ['href ']: Link = rank_link_pre + atag ['href '] Links. append (link) Mlfile. write (link + '\ n ') Print 'fetch link: '+ link Mlfile. close () |
Note <>: For more exciting tutorials, please pay attention to the help house programming