Mmtao
Ajax data crawling (for example, for Amoy girls)
If in doubt, go to the Wiki
Amoy Model Crawl Tutorial
Website: https://0x9.me/xrh6z
Determine if a page is an Ajax-loaded method:
View the Web page source code, find the page loaded data information, if not shown in the source code, the proof is Ajax loading.
If the site source code contains the information to be crawled, then simply use the regular data out of the line.
But if the source of the Web page is not, then it is Ajax, you can grab the packet to find the relevant interface to get the data, the operation is as follows (to crawl girl beauty Information for example):
First, the primary
- Find API interface: Get a list of models.
If you are using Chrome, you can first select XHR to find out the API to get the data faster, if in XHR there is no more to JS inside a search.
- The URL to the API found is: https://mm.taobao.com/alive/list.do
After trying, the parameters can be removed, the default page is 1, so if you want to get all the pages, you need to use the For loop to get the model list for each page separately.
- Then open a model's details page and use the red box to place all the data we want to get.
- Open the developer tools and then take the same grab action as just similar. First select XHR to quickly find out the API interface to get the number, you can easily find this address:
Second, intermediate
Below we crawl all the sister data into the file:
- It's not hard to find the address to get background data is: https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8
- But we found that there is only one _input_charset=utf-8 in the get parameter of the address, and the default is to get the first page of the sister list, normally we can see in the Get parameter page=1 similar items, but there is no, then obviously it does not use get to use POST, and it turns out that this is really the case.
- So, this is simple, using the Requests library Post request data to save the requested JSON data into a table, the work is over.
Paste the following code:
A. myheaders.py----This file holds some common headers header information
#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 19:40:50# @Author: Cnsimo ([email protected]) # @Lin k:http://www.scriptboy.com# @Version: 1.0import randomuastr = "mozilla/5.0" (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) mozilla/5.0 (Windows NT 6.1; rv,2.0.1) gecko/20100101 firefox/4.0.1mozilla/5.0 (Windows NT 6.1; rv,2.0.1) G Ecko/20100101 firefox/4.0.1mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) Mqqbrowser/26 mozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "def Getua (): Ualist = Uastr.spli T (' \ n ') length = Len (ualist) return Ualist[random.randint (0,length-1)]if __name__ = = ' __main__ ': Print (Getua ())
B. mmtao.py-----Main program
#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 23:11:08# @Author: Cnsimo ([email protected]) # @Lin k:http://www.scriptboy.com# @Version: 1.0from myheaders import getuaimport requestsimport reimport timeimport csvmmList URL = ' https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8 ' Mmurl = ' # Gets the total number of pages Def gettotalpage (): headers = {' User-agent ': Getua ()} req = Requests.get (Mmlisturl, headers=headers) res = Req.json () return res[' Dat A ' [' Totalpage ']# gets the list of functions def getmmlist (cpage = 1): headers = {' User-agent ': Getua ()} payload = {' CurrentPage ': cpage , ' pageSize ': +, ' sorttype ': ' Default ', ' Viewflag ': ' A '} req = Requests.post (Mmlisturl, Headers=headers, Data=payload res = Req.json () If ' data ' in Res.keys (): Return res[' data ' [' searchdolist '] else:returnif __nam e__ = = ' __main__ ': Totalpage = Gettotalpage () with open (R ' mmlist.csv ', ' w+ ', newline= ') as Fs:count = 1 Cpage = 1 Csvwriter = Csv.writer (FS, dialect= ' Excel ') Page1 = Getmmlist (cpage) Csvwriter.writerow (Page1[0].keys ()) PR int (' processing page%s ' ... '% cpage) for mm in Page1:csvwriter.writerow (Mm.values ()) print (str (count) + ", end=") Count + = 1 print () while cpage < Totalpage:cpage + = 1 print (' processing page%s ' ... '% cpage) time.sleep (2) mmlist = Getmmlist (cpage) if not mmlist:break for MM in MmList:csvwriter.writerow (Mm.values ()) print (str (count) + ", end=") Count + = 1 print (') print (' All data processing finished! ')
The exported data is as follows:
Third, advanced
Although the data has come out, but the description of the model is not specific enough to want more specific data to be obtained through their model card, for example: https://mm.taobao.com/self/model_info.htm?spm=719.7800510. a312r.22.bkq7m9&user_id=277949921
The information here is more comprehensive, so we only get the model ID from the list, and then get more detailed information through the Model tab.
- Analyzing the Model card page first, or through the developer tools, we can easily find the url:https://mm.taobao.com/self/info/model_info_show.htm?user_id=277949921 to get the data.
- The data for this response is not formatted, but it doesn't matter, and we can use regular expressions to match the information.
- That way, we're just a few more steps to analyze the model card than the program we just wrote, and we'll be able to write this code soon.
Part of the data:
Code See: mmtao_plus.py, if in doubt, go to Wiki
Ajax data crawling (for example, for Amoy girls)