Python Crawl app Download link implementation method _python

Source: Internet
Author: User
Tags appdown

First of all, prep work.

Python 2.7.11: Download Python

Pycharm: Download Pycharm

Where Python2 and Python3 are currently in sync, I'm using python2 as an environment here. Pycharm is a more efficient Python IDE, but it needs to be paid for.

The basic idea of realization

First of all, our target website: Android Market

Click "Apply" to enter our key page:

After jumping to the application interface we need to focus on three places, the following figure red box marked:

First focus on the URL of the address bar, then pay attention to the free download button, and then focus on the bottom page options. Click on the "Free download" button will immediately download the corresponding app, so our idea is to get this click to download the connection, you can download the app directly.

Writing Reptiles

The first point to be addressed: how do we get the download link mentioned above? Here we have to introduce the basic principles of the browser to display the Web page. To put it simply, the browser is a parser-like tool, it gets HTML and other code when the corresponding rules parsing rendering, so that we can see the page.

Here I use the Google Browser, right button on the page, click "Check", you can see the original HTML code page:

See the dazzling HTML code do not worry, Google Browser review element has a useful small function, can help us locate page control corresponding HTML code

Position:

As shown in the above picture, click on the small arrow in the top rectangular box, click on the page corresponding to the location, in the right side of the HTML code will be automatically positioned and highlighted.

Next we navigate to the HTML code for the download button:

You can see the button corresponding to the code, there are corresponding download links: "/appdown/com.tecent.mm", plus the prefix, the complete download link is http://apk.hiapk.com/appdown/com.tecent.mm

First use Python to get the entire page of HTML, very simple, using " requests.get(url) ", the URL to fill in the appropriate URL.


Then, in the grasp of the page key information, take the "first catch big, then grasp the small" mentality. You can see that there are 10 apps in a page that correspond to 10 item in the HTML code:

Each of the Li tags also contains the properties of each app (name, download link, etc.). So the first step, we'll extract the 10 Li tags:

def geteveryapp (self,source):
  Everyapp = Re.findall (' <li class= ' List_item ' .*?</li>) ', Source,re. S)
  #everyapp2 = Re.findall (' (<div class= "Button_bg button_1 right_mt" >.*?</div>) ", Everyapp,re. S) return
  Everyapp

Here's a simple regular expression knowledge.

Extract the download link from the LI tag:

def getinfo (self,eachclass):
  info = {}
  str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0))
  app_url = Re.search (' (. *?) ', str1). Group (1)
  Appdown_url = App_url.replace (' AppInfo ', ' Appdown ')
  info[' app_url '] = appdown_url
  print appdown_url return
  info

Next need to say the difficulty is to turn the page, click on the page button below we can see the address bar has the following changes:

Suddenly, we can replace the corresponding ID value in the URL in each request to make the page.

def changepage (self,url,total_page):
  now_page = Int (Re.search (' pi= (\d) ', URL). Group (1))
  page_group = []
  for I in Range (now_page,total_page+1):
   link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S)
   page_group.append (link) return
  page_group

Reptile effect

The key position is finished, let's look at the effect of the last crawler:

Save the results in TXT file as follows:

Directly copied into the Thunderbolt can be bulk high-speed download.

Enclose all code

#-*_coding:utf8-*-Import requests import re import sys reload (SYS) sys.setdefaultencoding ("Utf-8") class spider (object) : Def __init__ (self): print U ' start crawling content ' Def getsource (self,url): html = requests.get (URL) return html.text def Cha Ngepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) Page_group = [] for I in range (Now_ page,total_page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return Page_group def geteveryapp (self,source): Everyapp = Re.findall (' <li class= ' List_item ".*?</li>", Source,re. S) return Everyapp def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0)) App_url = Re.search (' "(. *?)" ", str1). Group (1) Appdown_url = App_url.replace (' appinfo ', ' AP Pdown ') info[' app_url '] = appdown_url print Appdown_url return info def saveinfo (self,classinfo): F = open (' Info . txt ', ' a ') str2 = "http://apk.hiapk.com" for each in ClassiNfo:f.write (STR2) f.writelines (each[' app_url '] + ' \ n ') f.close () if __name__ = = ' __main__ ': appinfo = [] url  = ' Http://apk.hiapk.com/apps/MediaAndVideo?sort=5&pi=1 ' AppUrl = Spider () all_links = Appurl.changepage (URL, 5) for link in all_links:print U ' processing page ' + Link HTML = appurl.getsource (link) Every_app = Appurl.geteveryapp (HTML) for Each in Every_app:info = Appurl.getinfo (each) appinfo.append (info) appurl.saveinfo (appinfo)

Summarize

The selected target Web page has a relatively clear and simple structure, which is a relatively basic reptile. Code to write a more messy please forgive me, the above is the entire content of this article, I hope to be able to learn or work to bring certain help, if there are questions you can message exchange.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.