Ajax data crawling (for example, for Amoy girls)

Last Update:2018-02-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mmtao

Ajax data crawling (for example, for Amoy girls)
If in doubt, go to the Wiki

Amoy Model Crawl Tutorial

Website: https://0x9.me/xrh6z

Determine if a page is an Ajax-loaded method:

View the Web page source code, find the page loaded data information, if not shown in the source code, the proof is Ajax loading.
If the site source code contains the information to be crawled, then simply use the regular data out of the line.
But if the source of the Web page is not, then it is Ajax, you can grab the packet to find the relevant interface to get the data, the operation is as follows (to crawl girl beauty Information for example):

First, the primary

Find API interface: Get a list of models.

If you are using Chrome, you can first select XHR to find out the API to get the data faster, if in XHR there is no more to JS inside a search.

The URL to the API found is: https://mm.taobao.com/alive/list.do

After trying, the parameters can be removed, the default page is 1, so if you want to get all the pages, you need to use the For loop to get the model list for each page separately.

Then open a model's details page and use the red box to place all the data we want to get.

Open the developer tools and then take the same grab action as just similar. First select XHR to quickly find out the API interface to get the number, you can easily find this address:

Second, intermediate

Below we crawl all the sister data into the file:

It's not hard to find the address to get background data is: https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8

But we found that there is only one _input_charset=utf-8 in the get parameter of the address, and the default is to get the first page of the sister list, normally we can see in the Get parameter page=1 similar items, but there is no, then obviously it does not use get to use POST, and it turns out that this is really the case.

So, this is simple, using the Requests library Post request data to save the requested JSON data into a table, the work is over.

Paste the following code:

A. myheaders.py----This file holds some common headers header information

#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 19:40:50# @Author: Cnsimo ([email protected]) # @Lin k:http://www.scriptboy.com# @Version: 1.0import randomuastr = "mozilla/5.0" (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) mozilla/5.0 (Windows NT 6.1; rv,2.0.1) gecko/20100101 firefox/4.0.1mozilla/5.0 (Windows NT 6.1; rv,2.0.1) G Ecko/20100101 firefox/4.0.1mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11opera/9.80 (Windows NT 6.1; U EN) presto/2.8.131 version/11.11mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) Mqqbrowser/26 mozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "def Getua (): Ualist = Uastr.spli T (' \ n ') length = Len (ualist) return Ualist[random.randint (0,length-1)]if __name__ = = ' __main__ ': Print (Getua ())

B. mmtao.py-----Main program

#!/usr/bin/env python#-*-coding:utf-8-*-# @Date: 2018-02-02 23:11:08# @Author: Cnsimo ([email protected]) # @Lin k:http://www.scriptboy.com# @Version: 1.0from myheaders import getuaimport requestsimport reimport timeimport csvmmList    URL = ' https://mm.taobao.com/tstar/search/tstar_model.do?_input_charset=utf-8 ' Mmurl = ' # Gets the total number of pages Def gettotalpage (): headers = {' User-agent ': Getua ()} req = Requests.get (Mmlisturl, headers=headers) res = Req.json () return res[' Dat A ' [' Totalpage ']# gets the list of functions def getmmlist (cpage = 1): headers = {' User-agent ': Getua ()} payload = {' CurrentPage ': cpage , ' pageSize ': +, ' sorttype ': ' Default ', ' Viewflag ': ' A '} req = Requests.post (Mmlisturl, Headers=headers, Data=payload res = Req.json () If ' data ' in Res.keys (): Return res[' data ' [' searchdolist '] else:returnif __nam        e__ = = ' __main__ ': Totalpage = Gettotalpage () with open (R ' mmlist.csv ', ' w+ ', newline= ') as Fs:count = 1 Cpage = 1 Csvwriter = Csv.writer (FS, dialect= ' Excel ') Page1 = Getmmlist (cpage) Csvwriter.writerow (Page1[0].keys ()) PR int (' processing page%s ' ...            '% cpage) for mm in Page1:csvwriter.writerow (Mm.values ()) print (str (count) + ", end=") Count + = 1 print () while cpage < Totalpage:cpage + = 1 print (' processing page%s ' ...            '% cpage) time.sleep (2) mmlist = Getmmlist (cpage) if not mmlist:break                for MM in MmList:csvwriter.writerow (Mm.values ()) print (str (count) + ", end=") Count + = 1 print (') print (' All data processing finished! ')

The exported data is as follows:

Third, advanced

Although the data has come out, but the description of the model is not specific enough to want more specific data to be obtained through their model card, for example: https://mm.taobao.com/self/model_info.htm?spm=719.7800510. a312r.22.bkq7m9&user_id=277949921

The information here is more comprehensive, so we only get the model ID from the list, and then get more detailed information through the Model tab.

Analyzing the Model card page first, or through the developer tools, we can easily find the url:https://mm.taobao.com/self/info/model_info_show.htm?user_id=277949921 to get the data.
The data for this response is not formatted, but it doesn't matter, and we can use regular expressions to match the information.
That way, we're just a few more steps to analyze the model card than the program we just wrote, and we'll be able to write this code soon.

Part of the data:

Code See: mmtao_plus.py, if in doubt, go to Wiki

Ajax data crawling (for example, for Amoy girls)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ajax data crawling (for example, for Amoy girls)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ajax data crawling (for example, for Amoy girls)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support