This article mainly introduces the use of Python program to crawl the HTML information of a small example, the use of the method is also the basis for the use of Python to write reptiles, the need for friends can refer to the
There are a number of ideas to crawl Web data, generally: Direct code request HTTP, Analog browser request data (usually require login verification), control browser to achieve data capture. This article does not consider the complexity of the case, put a read simple Web page data Small example:
Target data
Save the hyperlinks to all of these contestants on this page of the ITTF Web site.
Data request
Really like the library of human thinking, such as requests, if you want to directly take the page text, a word to fix:
?
1 |
doc = requests.get (URL). text |
Parse HTML to get data
Take BeautifulSoup as an example, including obtaining tags, links, and traversal based on HTML hierarchies. See here for reference. The following fragment, from the ITTF Web site, gets a link to the specified location on the specified page.
?
1 2 3 4 5 6 7 8 9 a |
URL = ' http://www.ittf.com/ittf_ranking/WR_Table_3_A2.asp? Age_category_1=&age_category_2=&age_category_3=&age_category_4=&age_category_5=&category = 100w&cont=&country=&gender=w&month1=4&year1=2015&s_player_name=&formv_wr_table_3_ Page= ' +str (page) doc = requests.get (URL). Text soup = BeautifulSoup (doc) atags = Soup.find_all (' a ') Rank_link_pre = ' http:/ /www.ittf.com/ittf_ranking/' mlfile = open (Linkfile, ' a ') for Atag in atags: #print atag if Atag!=none and Atag.get ( ' href ')!= none:if "wr_table_3_a2_details.asp" in atag[' href ']: link = rank_link_pre + atag[' href '] links.append (link) ml File.write (link+ ' n ') print ' Fetch link: ' +link mlfile.close () |
Note < : More Wonderful tutorials please focus on the triple Programming