first of all, prep work .
Python 2.7.11: Download Python
Pycharm: Download Pycharm
of which Python2 and Python3 are currently being released synchronously, I use python2 as the environment here. Pycharm is a more efficient Python IDE, but requires a fee.
The basic idea of realization
First of all, our target website: Android Market
Click "Apply" to go to our key page:
After jumping to the application interface we need to focus on three places, the red box marked:
Focus on the URL of the address bar, focus on the free download button, and then focus on the flip option at the bottom. Click on the "Free download" button will immediately download the corresponding app, so our idea is to get this click to download the connection, you can download the app directly.
Writing crawlers
The first point to solve: how do we get the download link above? Here we have to introduce the basic principles of Web browser display. To put it simply, the browser is a parser-like tool that gets HTML and other code to parse the rendering according to the corresponding rules so that we can see the page.
Here I am using the Google Browser, right click on the page, "check", you can see the original HTML code:
See the dazzling HTML code don't worry, Google Browser review element has a handy little function, can help us to locate the page control corresponding HTML code
Position:
As shown, click on the small arrow in the upper rectangle and click on the location of the page, and it will be automatically positioned and highlighted in the HTML code on the right.
Next we navigate to the HTML code corresponding to the download button:
You can see the button corresponding to the code, there is the corresponding download link: "/appdown/com.tecent.mm", prefixed with the full download link is http://apk.hiapk.com/appdown/com.tecent.mm
First use Python to get the entire page of HTML, very simple, using " requests.get(url)
", url fill in the corresponding URL.
Then, when grasping the key information of the page, take the idea of "grasping big first, grasping small again". You can see a page with 10 apps that correspond to 10 item in the HTML code:
Each Li tag contains the individual attributes of the app (name, download link, and so on). So the first step is to extract the 10 Li tags:
def geteveryapp (self,source): Everyapp = Re.findall (' (<li class= "List_item" .*?</li>) ', Source,re. S) #everyapp2 = Re.findall (' (<p class= "Button_bg button_1 right_mt" >.*?</p>) ', Everyapp,re. S) return Everyapp
Simple regular expression knowledge is used here.
Extract the download link from the Li tab:
def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= "(. *?)" > ', Eachclass). Group (0)) app_url = Re.search (' "(. *?)" ', str1). Group (1) Appdown_url = App_url.replace (' AppInfo ', ' Appdown ') info[' app_url '] = appdown_url print Appdown_url return info
The next difficult point is to turn the page, click the button below the page, we can see the address bar has the following changes:
Enlightened, we can change the URL in each request to replace the corresponding ID value to achieve page.
def changepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) page_group = [] For I in Range (now_page,total_page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return page_group
Crawler effects
The key position is finished, we first look at the effect of the last crawler:
Save the results in the TXT file as follows:
Directly copied into the Thunderbolt can be bulk high-speed download.
Attach All Codes
#-*_coding:utf8-*-import requestsimport reimport sysreload (SYS) sys.setdefaultencoding ("Utf-8") class spider (object) : Def __init__ (self): print U ' start crawling content ' Def getsource (self,url): html = requests.get (URL) return Html.text def changepage (self,url,total_page): now_page = Int (Re.search (' pi= (\d) ', URL). Group (1)) Page_group = [] for I in range (now_page,total _page+1): link = re.sub (' pi=\d ', ' pi=%s '%i,url,re. S) page_group.append (link) return Page_group def geteveryapp (self,source): Everyapp = Re.findall (' (<li class= ' List_ Item ".*?</li>) ', Source,re. S) return Everyapp def getinfo (self,eachclass): info = {} str1 = str (re.search (' <a href= ' (. *?) ") > ', Eachclass). Group (0)) App_url = Re.search (' "(. *?)" ', str1). Group (1) Appdown_url = App_url.replace (' appinfo ', ' app Down ') info[' app_url ' = appdown_url print Appdown_url return info def saveinfo (self,classinfo): F = open (' Info.txt ', ' A ') str2 = "http://apk.hiapk.com" for each in Classinfo:f.write (STR2) f.writelines (EACh[' app_url ' + ' \ n ') f.close () if __name__ = = ' __main__ ': appinfo = [] url = ' Http://apk.hiapk.com/apps/MediaAndVideo?sort =5&pi=1 ' AppUrl = Spider () all_links = Appurl.changepage (URL, 5) for link in all_links:print u ' processing page ' + link html = Appurl.getsource (link) Every_app = Appurl.geteveryapp (HTML) for each in Every_app:info = Appurl.getinfo (each) app Info.append (Info) appurl.saveinfo (appinfo)
Summarize
The relative structure of the selected landing page is simple, which is a relatively basic crawler. Code to write the comparison of the confusion please forgive me, the above is the whole content of this article, I hope we can learn or work to bring a certain help, if there are problems we can message exchange.