First of all, prep work.
Python 2.7.11: Download Python
Pycharm: Download Pycharm
Where Python2 and Python3 are currently in sync, I'm using python2 as an environment here. Pycharm is a more efficient Python IDE, but it needs to be paid for.
The basic idea of realization
First of all, our target website: Android Market
Click "Apply" to enter our key page:
After jumping to the application interface we need to focus on three places, the following figure red box marked:
First focus on the URL of the address bar, then pay attention to the free download button, and then focus on the bottom page options. Click on the "Free download" button will immediately download the corresponding app, so our idea is to get this click to download the connection, you can download the app directly.
Writing Reptiles
The first point to be addressed: how do we get the download link mentioned above? Here we have to introduce the basic principles of the browser to display the Web page. To put it simply, the browser is a parser-like tool, it gets HTML and other code when the corresponding rules parsing rendering, so that we can see the page.
Here I use the Google Browser, right button on the page, click "Check", you can see the original HTML code page:
See the dazzling HTML code do not worry, Google Browser review element has a useful small function, can help us locate page control corresponding HTML code
Position:
As shown in the above picture, click on the small arrow in the top rectangular box, click on the page corresponding to the location, in the right side of the HTML code will be automatically positioned and highlighted.
Next we navigate to the HTML code for the download button:
You can see the button corresponding to the code, there are corresponding download links: "/appdown/com.tecent.mm", plus the prefix, the complete download link is http://apk.hiapk.com/appdown/com.tecent.mm
First use Python to get the entire page of HTML, very simple, using " requests.get(url)
", the URL to fill in the appropriate URL.
Then, in the grasp of the page key information, take the "first catch big, then grasp the small" mentality. You can see that there are 10 apps in a page that correspond to 10 item in the HTML code:
Each of the Li tags also contains the properties of each app (name, download link, etc.). So the first step, we'll extract the 10 Li tags:
def geteveryapp (self,source):
Everyapp = Re.findall (' <li class= ' List_item ' .*?</li>) ', Source,re. S)
#everyapp2 = Re.findall (' (<div class= "Button_bg button_1 right_mt" >.*?</div>) ", Everyapp,re. S) return
Everyapp
Here's a simple regular expression knowledge.
Extract the download link from the LI tag:
def getinfo (self,eachclass):
info = {}
str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0))
app_url = Re.search (' (. *?) ', str1). Group (1)
Appdown_url = App_url.replace (' AppInfo ', ' Appdown ')
info[' app_url '] = appdown_url
print appdown_url return
info
Next need to say the difficulty is to turn the page, click on the page button below we can see the address bar has the following changes:
Suddenly, we can replace the corresponding ID value in the URL in each request to make the page.
def changepage (self,url,total_page):
now_page = Int (Re.search (' pi= (\d) ', URL). Group (1))
page_group = []
for I in Range (now_page,total_page+1):
link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S)
page_group.append (link) return
page_group
Reptile effect
The key position is finished, let's look at the effect of the last crawler:
Save the results in TXT file as follows:
Directly copied into the Thunderbolt can be bulk high-speed download.
Enclose all code
#-*_coding:utf8-*-Import requests import re import sys reload (SYS) sys.setdefaultencoding ("Utf-8") class spider (object) : Def __init__ (self): print U ' start crawling content ' Def getsource (self,url): html = requests.get (URL) return html.text def Cha Ngepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) Page_group = [] for I in range (Now_ page,total_page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return Page_group def geteveryapp (self,source): Everyapp = Re.findall (' <li class= ' List_item ".*?</li>", Source,re. S) return Everyapp def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0)) App_url = Re.search (' "(. *?)" ", str1). Group (1) Appdown_url = App_url.replace (' appinfo ', ' AP Pdown ') info[' app_url '] = appdown_url print Appdown_url return info def saveinfo (self,classinfo): F = open (' Info . txt ', ' a ') str2 = "http://apk.hiapk.com" for each in ClassiNfo:f.write (STR2) f.writelines (each[' app_url '] + ' \ n ') f.close () if __name__ = = ' __main__ ': appinfo = [] url = ' Http://apk.hiapk.com/apps/MediaAndVideo?sort=5&pi=1 ' AppUrl = Spider () all_links = Appurl.changepage (URL, 5) for link in all_links:print U ' processing page ' + Link HTML = appurl.getsource (link) Every_app = Appurl.geteveryapp (HTML) for Each in Every_app:info = Appurl.getinfo (each) appinfo.append (info) appurl.saveinfo (appinfo)
Summarize
The selected target Web page has a relatively clear and simple structure, which is a relatively basic reptile. Code to write a more messy please forgive me, the above is the entire content of this article, I hope to be able to learn or work to bring certain help, if there are questions you can message exchange.