This article mainly introduces a small example of HTML information captured by a Python program. The method used is also the basis for compiling crawlers using Python, you can refer to the following several ideas for capturing webpage data: directly requesting http code and simulating browser request data (login verification is usually required) control the browser to capture data. This example shows a small example of reading simple webpage data without considering the complexity:
Target data
Save the hyperlinks of all these contestants on the page of the ittf website.
Data request
I really like databases that conform to human thinking, such as requests. if you want to take web text directly, you can do it in one sentence:
doc = requests.get(url).text
Parse html to get data
Take beautifulsoup as an example, including obtaining tags, links, and traversing based on html hierarchies. For more information, see here. The following snippet obtains the link at the specified position on the specified page from the ittf website.
url = 'http://www.ittf.com/ittf_ranking/WR_Table_3_A2.asp?Age_category_1=&Age_category_2=&Age_category_3=&Age_category_4=&Age_category_5=&Category=100W&Cont=&Country=&Gender=W&Month1=4&Year1=2015&s_Player_Name=&Formv_WR_Table_3_Page='+str(page)doc = requests.get(url).textsoup = BeautifulSoup(doc)atags = soup.find_all('a')rank_link_pre = 'http://www.ittf.com/ittf_ranking/'mlfile = open(linkfile,'a')for atag in atags: #print atag if atag!=None and atag.get('href') != None: if "WR_Table_3_A2_Details.asp" in atag['href']: link = rank_link_pre + atag['href'] links.append(link) mlfile.write(link+'\n') print 'fetch link: '+linkmlfile.close()