Python Crawler Starter instance (crawl top app apk in Xiaomi store)

Source: Internet
Author: User

One, what is a reptile?

A crawler is a tool for acquiring various resources and data on the network. Specific can be self-Baidu.

Second, how to write a simple crawler

1. Get Web Content

It is possible to download the content of the Web page by using Python (3.x) urllib. It's easy to implement.

Import Urllib.requesturl="http://www.baidu.com"response=  Urllib.request.urlopen (URL) html_content=response.read ()

You can also use the three-party library requests, the implementation is also very convenient, before use of course you need to install this library: Pip Install requests (Python 3 after the PIP very good)

Import requestshtml_content=requests.get (URL). text

2. Parsing Web content

The content of the Web page html_content, in fact, is HTML code, we need to parse it, get what we need.

There are many ways to parse a webpage, here I introduce is beautifullsoup, because this is a three-way library, before using or install first: Pip install BS4

form BS4 Imort beautifullsoupsoup " Html.parser ")

Third, the case analysis

The best way to understand the principle of reptiles is to analyze a number of instances, reptiles, original aim. Talk less about the dry goods.

=================================== I'm a split line ===================================================

Requirements: Crawl Top N Apps for Xiaomi store

Open Xiaomi App Store bar page via browser, F12 review element

#Coding=utf-8ImportRequests
Import re fromBs4ImportBeautifullsoupdefParser_apks (Self, count=0):" "Xiaomi Application Market" "_root_url="http://app.mi.com"#应用市场主页网址Res_parser={} page_num=1 #设置爬取的页面, crawl from the first page, crawl through the second page of the first page, and so on whileCount:
#获取排行榜页面的网页内容 Wbdata= Requests.get ("http://app.mi.com/topList?page="+str (page_num)). TextPrint("Start Crawl"+str (Page_num) +"page")
#解析页面内容获取 app Download interface connection soup=beautifulsoup (Wbdata,"Html.parser") Links=soup.body.contents[3].find_all ("a", Href=re.compile ("/details?"), Class_ ="", alt="") #BeautifullSoup的具体用法请百度一下吧 ... forLinkinchLinks:detail_link=urllib.parse.urljoin (_root_url, str (link["href"])) Package_name=detail_link.split ("=") [1]
#在下载页面中获取 apk Download Address download_page=Requests.get (detail_link). Text Soup1=beautifulsoup (Download_page,"Html.parser") Download_link=soup1.find (class_="Download")["href"] Download_url=urllib.parse.urljoin (_root_url, str (download_link))
#解析后会有重复的结果, by judging the weight belowifDownload_url not inchres_parser.values (): Res_parser[package_name]=Download_url Count=count-1ifcount==0: Break ifCount >0:page_num=page_num+1Print("the number of crawl apk is:"+str (len (res_parser )))returnRes_parser
def Craw_apks (self, count=1, save_path= "d:\\apk\\"):        res_dic=Parser_apks (count)                for   in res_dic.keys ():            print("Downloading app:" +apk)            Urllib.request.urlretrieve (res_dic[apk],save_path+apk+ ". apk")            print("Download Complete")  
if __name__= ="__main__":

Craw_apks (10)

Operation Result:

You are downloading the application: Com.tencent.tmgp.sgame
Download complete
.
.
.

The above is the content of the simple crawler, in fact, the implementation of the crawler is still very complex, different Web pages have different analytic way, but also need to study deeply ...

Python Crawler Starter instance (crawl top app apk in Xiaomi store)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.