Amoy Girls Model Information Crawl tutorialSource Address: Cnsimo/mmtao
Website: https://0x9.me/xrh6z
determine if a page is Ajax method of loading:
View the Web page source code, find the page loaded data information, if not shown in the source code, the proof is Ajax loading.
If the site source code contains the information to be crawled, then simply use the regular data out of the line.
But if the source of the Web page is not, then it is Ajax, you can grab the packet to find the relevant interface to get the data, the operation is as follows (to crawl girl beauty Information for example):
First, the primary
- Find API interface: Get a list of models.
If you are using chrome, you can first select XHR to find out the API to get the data faster, if in xhr there is no more to JS inside a search.
- The URL to the API found is: https://mm.taobao.com/alive/list.do
After trying, the parameters can be removed, the default page is 1, so if you want to get all the pages, you need to use the For loop to get the model list for each page separately.
- Then open a model's details page and use the red box to place all the data we want to get.
- Open the developer tools and then take the same grab action as just similar. First check XHR to quickly find out the API interface to get the data, it is easy to find this address:
Second, intermediate
Below we crawl all the sister data into the file:
- It's not hard to find the address to get background data is: https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8
- But we found that there is only one _input_charset=utf-8 in the get parameter of the address, and the default is to get the first page of the sister list, normally we can see in the Get parameter page=1 similar items, but there is no, Then obviously it does not use get to use post, the result is found to be true.
- So, this is simple, using the Requests library Post request data to save the requested JSON data into a table, the work is over.
Paste the following code:
- headers.py -----------This file holds some common headers header information
#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 19:40:50# @Author: Cnsimo ([email protected]) # @Link: http://www.scriptboy.com# @Version: 1.0import Random uastr = "mozilla/5.0" (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) mozilla/5.0 (Windows NT 6.1; rv,2.0.1) gecko/20100101 firefox/4.0.1mozilla/5.0 (Windows NT 6.1; rv,2.0.1) G Ecko/20100101 firefox/4.0.1mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) MQqbrowser/26 mozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "def Getua (): Ualist = Uastr.split (' \ n ') length = Len (ualist) return Ualist[random.randint (0,length-1)] if __name__ = = ' __main__ ': Print (Getua ())
- mmtao.py ------------Main program
#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 23:11:08# @Author: Cnsimo ([email protected]) # @Link: http://www.scriptboy.com# @Version: 1.0 from myheaders import getuaimport requestsimport reimport timeimport C SV Mmlisturl = ' https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8 ' mmurl = ' # Get total number of pages def Gettotalpage (): headers = {' User-agent ': Getua ()} req = Requests.get (Mmlisturl, headers=headers) res = Req.json () retu RN res[' data ' [' Totalpage '] # get list of Functions def getmmlist (cpage = 1): headers = {' User-agent ': Getua ()} payload = {' CurrentPage ': Cpage, ' pageSize ': +, ' sorttype ': ' Default ', ' Viewflag ': ' A '} req = Requests.post (Mmlisturl, Headers=headers, data=p Ayload) res = Req.json () If ' data ' in Res.keys (): Return res[' data ' [' searchdolist '] else:return i F __name__ = = ' __main__ ': Totalpage = Gettotalpage () with open (R ' mmlist.csv ', ' w+ ', newline= ') as Fs:count = 1 cpage = 1 CSVwriter = Csv.writer (FS, dialect= ' Excel ') Page1 = Getmmlist (cpage) Csvwriter.writerow (Page1[0].keys ()) Print (' processing page%s ' ... '% cpage) for mm in Page1:csvwriter.writerow (Mm.values ()) print (str (count) + ', end= ') count + = 1 print () while Cpage < totalpage:cpage + = 1 print (' processing page%s ' ... '% cpage) time.sleep (2) mmlist = Getmmlist (cpage) if not mmlist: Break for MM in MmList:csvwriter.writerow (Mm.values ( ) Print (str (count) + ', end= ') count + = 1 print ( Print (' All data processing finished! ')
The exported data is as follows:
Third, advanced
Although the data has come out, but the description of the model is not specific enough to want more specific data to be obtained through their model card, for example: https://mm.taobao.com/self/model_info.htm?spm=719.7800510. a312r.22.bkq7m9&user_id=277949921,
The information here is more comprehensive, so we only get the model ID from the list, and then get more detailed information through the Model tab.
- Analyzing the Model card page first, or through the developer tools, we can easily find the url:https://mm.taobao.com/self/info/model_info_show.htm?user_id=277949921 to get the data.
- The data for this response is not formatted, but it doesn't matter, and we can use regular expressions to match the information.
- That way, we're just a few more steps to analyze the model card than the program we just wrote, and we'll be able to write this code soon.
Code see: mmtao_plus.py in Compressed files
Source Address: Cnsimo/mmtao
Ajax asynchronous information fetching method