How Python crawls the app download link

Last Update:2017-02-24 Source: Internet

Author: User

Tags appdown

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first of all, prep work .

Python 2.7.11: Download Python

Pycharm: Download Pycharm

of which Python2 and Python3 are currently being released synchronously, I use python2 as the environment here. Pycharm is a more efficient Python IDE, but requires a fee.

The basic idea of realization

First of all, our target website: Android Market

Click "Apply" to go to our key page:

After jumping to the application interface we need to focus on three places, the red box marked:

Focus on the URL of the address bar, focus on the free download button, and then focus on the flip option at the bottom. Click on the "Free download" button will immediately download the corresponding app, so our idea is to get this click to download the connection, you can download the app directly.

Writing crawlers

The first point to solve: how do we get the download link above? Here we have to introduce the basic principles of Web browser display. To put it simply, the browser is a parser-like tool that gets HTML and other code to parse the rendering according to the corresponding rules so that we can see the page.

Here I am using the Google Browser, right click on the page, "check", you can see the original HTML code:

See the dazzling HTML code don't worry, Google Browser review element has a handy little function, can help us to locate the page control corresponding HTML code

Position:

As shown, click on the small arrow in the upper rectangle and click on the location of the page, and it will be automatically positioned and highlighted in the HTML code on the right.

Next we navigate to the HTML code corresponding to the download button:

You can see the button corresponding to the code, there is the corresponding download link: "/appdown/com.tecent.mm", prefixed with the full download link is http://apk.hiapk.com/appdown/com.tecent.mm

First use Python to get the entire page of HTML, very simple, using " requests.get(url) ", url fill in the corresponding URL.

Then, when grasping the key information of the page, take the idea of "grasping big first, grasping small again". You can see a page with 10 apps that correspond to 10 item in the HTML code:

Each Li tag contains the individual attributes of the app (name, download link, and so on). So the first step is to extract the 10 Li tags:

def geteveryapp (self,source):  Everyapp = Re.findall (' (<li class= "List_item" .*?</li>) ', Source,re. S)  #everyapp2 = Re.findall (' (<p class= "Button_bg button_1 right_mt" >.*?</p>) ', Everyapp,re. S)  return Everyapp

Simple regular expression knowledge is used here.

Extract the download link from the Li tab:

def getinfo (self,eachclass):  info = {}  str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0))  app_url = Re.search (' "(. *?)" ', str1). Group (1)  Appdown_url = App_url.replace (' AppInfo ', ' Appdown ')  info[' app_url '] = appdown_url  print Appdown_url  return info

The next difficult point is to turn the page, click the button below the page, we can see the address bar has the following changes:

Enlightened, we can change the URL in each request to replace the corresponding ID value to achieve page.

def changepage (self,url,total_page):  now_page = Int (Re.search (' pi= (\d) ', URL). Group (1))  page_group = []  For I in Range (now_page,total_page+1):   link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S)   page_group.append (link)  return page_group

Crawler effects

The key position is finished, we first look at the effect of the last crawler:

Save the results in the TXT file as follows:

Directly copied into the Thunderbolt can be bulk high-speed download.

Attach All Codes

#-*_coding:utf8-*-import requestsimport reimport sysreload (SYS) sys.setdefaultencoding ("Utf-8") class spider (object) : Def __init__ (self): print U ' start crawling content ' Def getsource (self,url): html = requests.get (URL) return Html.text def changepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) Page_group = [] for I in range (now_page,total _page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return Page_group def geteveryapp (self,source): Everyapp = Re.findall (' (<li class= ' List_ Item ".*?</li>) ', Source,re. S) return Everyapp def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= ' (. *?) ") > ', Eachclass). Group (0)) App_url = Re.search (' "(. *?)" ', str1). Group (1) Appdown_url = App_url.replace (' appinfo ', ' app Down ') info[' app_url ' = appdown_url print Appdown_url return info def saveinfo (self,classinfo): F = open (' Info.txt ', ' A ') str2 = "http://apk.hiapk.com" for each in Classinfo:f.write (STR2) f.writelines (EACh[' app_url ' + ' \ n ') f.close () if __name__ = = ' __main__ ': appinfo = [] url = ' Http://apk.hiapk.com/apps/MediaAndVideo?sort =5&pi=1 ' AppUrl = Spider () all_links = Appurl.changepage (URL, 5) for link in all_links:print u ' processing page ' + link html = Appurl.getsource (link) Every_app = Appurl.geteveryapp (HTML) for each in Every_app:info = Appurl.getinfo (each) app Info.append (Info) appurl.saveinfo (appinfo)

Summarize

The relative structure of the selected landing page is simple, which is a relatively basic crawler. Code to write the comparison of the confusion please forgive me, the above is the whole content of this article, I hope we can learn or work to bring a certain help, if there are problems we can message exchange.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More