One, what is a reptile?
A crawler is a tool for acquiring various resources and data on the network. Specific can be self-Baidu.
Second, how to write a simple crawler
1. Get Web Content
It is possible to download the content of the Web page by using Python (3.x) urllib. It's easy to implement.
Import Urllib.requesturl="http://www.baidu.com"response= Urllib.request.urlopen (URL) html_content=response.read ()
You can also use the three-party library requests, the implementation is also very convenient, before use of course you need to install this library: Pip Install requests (Python 3 after the PIP very good)
Import requestshtml_content=requests.get (URL). text
2. Parsing Web content
The content of the Web page html_content, in fact, is HTML code, we need to parse it, get what we need.
There are many ways to parse a webpage, here I introduce is beautifullsoup, because this is a three-way library, before using or install first: Pip install BS4
form BS4 Imort beautifullsoupsoup " Html.parser ")
Third, the case analysis
The best way to understand the principle of reptiles is to analyze a number of instances, reptiles, original aim. Talk less about the dry goods.
=================================== I'm a split line ===================================================
Requirements: Crawl Top N Apps for Xiaomi store
Open Xiaomi App Store bar page via browser, F12 review element
#Coding=utf-8ImportRequests
Import re fromBs4ImportBeautifullsoupdefParser_apks (Self, count=0):" "Xiaomi Application Market" "_root_url="http://app.mi.com"#应用市场主页网址Res_parser={} page_num=1 #设置爬取的页面, crawl from the first page, crawl through the second page of the first page, and so on whileCount:
#获取排行榜页面的网页内容 Wbdata= Requests.get ("http://app.mi.com/topList?page="+str (page_num)). TextPrint("Start Crawl"+str (Page_num) +"page")
#解析页面内容获取 app Download interface connection soup=beautifulsoup (Wbdata,"Html.parser") Links=soup.body.contents[3].find_all ("a", Href=re.compile ("/details?"), Class_ ="", alt="") #BeautifullSoup的具体用法请百度一下吧 ... forLinkinchLinks:detail_link=urllib.parse.urljoin (_root_url, str (link["href"])) Package_name=detail_link.split ("=") [1]
#在下载页面中获取 apk Download Address download_page=Requests.get (detail_link). Text Soup1=beautifulsoup (Download_page,"Html.parser") Download_link=soup1.find (class_="Download")["href"] Download_url=urllib.parse.urljoin (_root_url, str (download_link))
#解析后会有重复的结果, by judging the weight belowifDownload_url not inchres_parser.values (): Res_parser[package_name]=Download_url Count=count-1ifcount==0: Break ifCount >0:page_num=page_num+1Print("the number of crawl apk is:"+str (len (res_parser )))returnRes_parser
def Craw_apks (self, count=1, save_path= "d:\\apk\\"): res_dic=Parser_apks (count) for in res_dic.keys (): print("Downloading app:" +apk) Urllib.request.urlretrieve (res_dic[apk],save_path+apk+ ". apk") print("Download Complete")
if __name__= ="__main__":
Craw_apks (10)
Operation Result:
You are downloading the application: Com.tencent.tmgp.sgame
Download complete
.
.
.
The above is the content of the simple crawler, in fact, the implementation of the crawler is still very complex, different Web pages have different analytic way, but also need to study deeply ...
Python Crawler Starter instance (crawl top app apk in Xiaomi store)