There are many ways to crawl web data, such as: Direct code request HTTP, simulation browser request data (usually need login authentication), control browser implementation data crawl, etc. This article does not consider the complexity of the case, put a simple Web page to read the data of a small example:
Target Data
Save all of these players ' hyperlinks on this page on the ITTF website.
Data request
Really like the library of human thinking, such as requests, if you want to take the page text directly, a sentence to fix:
doc = requests.get (URL). text
Parse HTML to get data
Take BeautifulSoup as an example of how to get tags, links, and traversal based on HTML hierarchies. See here for reference. In the following section, get a link from the ITTF website for the specified location on the specified page.
url = ' http://www.ittf.com/ittf_ranking/WR_Table_3_A2.asp? age_category_1=&age_category_2=&age_category_3=&age_category_4=&age_category_5=&category= 100w&cont=&country=&gender=w&month1=4&year1=2015&s_player_name=&formv_wr_table_3_ Page= ' +str (page) doc = requests.get (URL). Textsoup = BeautifulSoup (doc) atags = Soup.find_all (' a ') Rank_link_pre = ' http:/ /www.ittf.com/ittf_ranking/' mlfile = open (Linkfile, ' a ') for Atag in atags: #print atag if Atag!=none and Atag.get (' href ')! = None: If "wr_table_3_a2_details.asp" in atag[' href ']: link = rank_link_pre + atag[' href ']< C4/>links.append (link) mlfile.write (link+ ' \ n ') print ' Fetch link: ' +linkmlfile.close ()