How Python crawls the app download link

Source: Internet
Author: User
Tags appdown
first of all, prep work .

Python 2.7.11: Download Python

Pycharm: Download Pycharm

of which Python2 and Python3 are currently being released synchronously, I use python2 as the environment here. Pycharm is a more efficient Python IDE, but requires a fee.

The basic idea of realization

First of all, our target website: Android Market

Click "Apply" to go to our key page:

After jumping to the application interface we need to focus on three places, the red box marked:

Focus on the URL of the address bar, focus on the free download button, and then focus on the flip option at the bottom. Click on the "Free download" button will immediately download the corresponding app, so our idea is to get this click to download the connection, you can download the app directly.

Writing crawlers

The first point to solve: how do we get the download link above? Here we have to introduce the basic principles of Web browser display. To put it simply, the browser is a parser-like tool that gets HTML and other code to parse the rendering according to the corresponding rules so that we can see the page.

Here I am using the Google Browser, right click on the page, "check", you can see the original HTML code:

See the dazzling HTML code don't worry, Google Browser review element has a handy little function, can help us to locate the page control corresponding HTML code

Position:

As shown, click on the small arrow in the upper rectangle and click on the location of the page, and it will be automatically positioned and highlighted in the HTML code on the right.

Next we navigate to the HTML code corresponding to the download button:

You can see the button corresponding to the code, there is the corresponding download link: "/appdown/com.tecent.mm", prefixed with the full download link is http://apk.hiapk.com/appdown/com.tecent.mm

First use Python to get the entire page of HTML, very simple, using " requests.get(url) ", url fill in the corresponding URL.


Then, when grasping the key information of the page, take the idea of "grasping big first, grasping small again". You can see a page with 10 apps that correspond to 10 item in the HTML code:

Each Li tag contains the individual attributes of the app (name, download link, and so on). So the first step is to extract the 10 Li tags:


def geteveryapp (self,source):  Everyapp = Re.findall (' (<li class= "List_item" .*?</li>) ', Source,re. S)  #everyapp2 = Re.findall (' (<p class= "Button_bg button_1 right_mt" >.*?</p>) ', Everyapp,re. S)  return Everyapp


Simple regular expression knowledge is used here.

Extract the download link from the Li tab:


def getinfo (self,eachclass):  info = {}  str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0))  app_url = Re.search (' "(. *?)" ', str1). Group (1)  Appdown_url = App_url.replace (' AppInfo ', ' Appdown ')  info[' app_url '] = appdown_url  print Appdown_url  return info


The next difficult point is to turn the page, click the button below the page, we can see the address bar has the following changes:

Enlightened, we can change the URL in each request to replace the corresponding ID value to achieve page.


def changepage (self,url,total_page):  now_page = Int (Re.search (' pi= (\d) ', URL). Group (1))  page_group = []  For I in Range (now_page,total_page+1):   link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S)   page_group.append (link)  return page_group


Crawler effects

The key position is finished, we first look at the effect of the last crawler:

Save the results in the TXT file as follows:

Directly copied into the Thunderbolt can be bulk high-speed download.

Attach All Codes


#-*_coding:utf8-*-import requestsimport reimport sysreload (SYS) sys.setdefaultencoding ("Utf-8") class spider (object) : Def __init__ (self): print U ' start crawling content ' Def getsource (self,url): html = requests.get (URL) return Html.text def changepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) Page_group = [] for I in range (now_page,total _page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return Page_group def geteveryapp (self,source): Everyapp = Re.findall (' (<li class= ' List_ Item ".*?</li>) ', Source,re. S) return Everyapp def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= ' (. *?) ") > ', Eachclass). Group (0)) App_url = Re.search (' "(. *?)" ', str1). Group (1) Appdown_url = App_url.replace (' appinfo ', ' app Down ') info[' app_url ' = appdown_url print Appdown_url return info def saveinfo (self,classinfo): F = open (' Info.txt ', ' A ') str2 = "http://apk.hiapk.com" for each in Classinfo:f.write (STR2) f.writelines (EACh[' app_url ' + ' \ n ') f.close () if __name__ = = ' __main__ ': appinfo = [] url = ' Http://apk.hiapk.com/apps/MediaAndVideo?sort =5&pi=1 ' AppUrl = Spider () all_links = Appurl.changepage (URL, 5) for link in all_links:print u ' processing page ' + link html = Appurl.getsource (link) Every_app = Appurl.geteveryapp (HTML) for each in Every_app:info = Appurl.getinfo (each) app Info.append (Info) appurl.saveinfo (appinfo)


Summarize

The relative structure of the selected landing page is simple, which is a relatively basic crawler. Code to write the comparison of the confusion please forgive me, the above is the whole content of this article, I hope we can learn or work to bring a certain help, if there are problems we can message exchange.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.