Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

Last Update:2016-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About crawlers, began to think that only crawling web data, and then know that the app can crawl. As a result, the use of idle time in schools took two weeks to achieve data capture and simple data analysis.

Goal, crawl Super Curriculum XX University (in fact, it is our university ...) ) Students 20,000 post information. Ideas are as follows:

　　　　STEP1: Find the entrance for our reptile

　　　　The app requests the data, also through the network protocol, so that we grab the packet to locate the entrance, here I use fiddler. Refer to this article for information about setting up your phone and fiddler.

Find the login entry for: http://120.55.151.61/V2/StudentSkip/loginCheckV4.action

　　　　STEP2: Login

　　　　The landing here and the principle of analog landing is the same, that is, to the landing portal post data, and here is also more simple, because fiddler grab bag ( in order to avoid messing up the idea, here to grasp the packet analysis to skip the first, there is time to catch up on the packet analysis this piece bar ... you can get the data directly from the post, eliminating the steps to build the form yourself, see the code below. Once the login is successful, it will return the JSON data, which is your personal information.

　　　　STEP3: Locating data sources

　　　　To crawl is the post information, mobile phone Click, fiddler grab packet positioning. The location of the entry is: Http://120.55.151.61/Treehole/V4/Cave/getList.action,POST mode, the data is also returned in JSON form. It is important to note that in order to make a continuous request, it is necessary to take the time stamp, which is simply a timestamp from the nth request to get the return data, then apply to the n+1 request, so loop, please refer to the code. Similarly, in the way I used to, I used a simple recursive return to realize the loop crawl.

　　　　STEP4: Parse json file, clean data and save

　　　　Here I still use the Excel table, the recommended database ... Because of the large amount of data (this crawl contains the content of the post, the amount of data is not too small ...) ), open Excel table when the big card ...

With the idea of the front, crawling is the physical life ... On the Code: (Statement: For privacy and security considerations, the code Logindata and the initial req_data has changed, this should be replaced by their own data to grab the packet)

Import requestsimport jsonimport reimport timeimport simplejsonimport xlsxwriterworkbook = xlsxwriter. Workbook (' kebiao_test.xlsx ') worksheet = Workbook.add_worksheet () worksheet.set_column (' a:a ', ten) worksheet.set_ Column (' B:b ', ') worksheet.set_column (' C:c ', 5) worksheet.set_column (' d:d ', ') worksheet.set_column (' E:e ', 20) Worksheet.set_column (' F:f ', $) worksheet.write (0,0, ' School ') worksheet.write (0,1, ' theme ') Worksheet.write (0,2, ' sex ') Worksheet.write (0,3, ' Date ') worksheet.write (0,4, ' time ') worksheet.write (0,5, ' content ') # Landing section # with requests. Session () Record cookies = requests. Session () loginurl = ' http://120.55.151.61/V2/StudentSkip/loginCheckV4.action ' logindata = ' phonebrand=xiaomi& platform=1&devicecode=867711022104024&account=6603135b883f9b40ded6374a22593&phoneversion=19& Password=98a09b5680df25a934acf9b3614af4ea&channel=xiaomimarket&phonemodel=hm+2a&versionnumber= 7.4.1& ' headers = {' Content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' user-agent ': ' Dalvik/1.6. 1 ILinux; U Android 4.4.4; HM 2A miui/v7.3.1.0.khlcndd) ', ' Host ': ' 120.55.151.61 ', ' Connection ': ' keep-alive ', ' accept-encoding ': ' gzip ', ' Co Ntent-length ': ' 213 ',}# submit login information, start logging data = S.post (Url=loginurl, Data=logindata, Headers=headers, Stream=true, verify= False) Loginresult = data.text# Print login information, check if login successful # return personal information print (Loginresult) # define a function, get information and save # Incoming form data to post, i.e. Req_ Datadef Get_data (req_data): Req_url = ' http://120.55.151.61/Treehole/V4/Cave/getList.action ' data_r = S.post (url=req_  URL, Data=req_data, headers=headers) data_r = Data_r.text # Returns a Boolean value of lowercase, converted here to true Boolean value true = True false = False #   Simple processing JSON data, direct parsing seems to have coding problems, not very clear ... Data_j = eval (data_r) Data_js = Json.dumps (data_j) data_dict = Simplejson.loads (data_js) # get timestamp data = data_dict[ ' Data '] Timestamplong = data[' Timestamplong '] #print (timestamplong) Messagebo = data[' Messagebos '] # processing the number of JSON caught Messagebo:if ' Studentbo ' in Each:print (each), get the target data and save it #topicDict= {} If Each.get (' content ', False): Schoolnamex = each["Schoolname"] worksheet.write (I,0,sch Oolnamex) tag = each[' Moodtagbo ' [' Moodtagname '] worksheet.write (i,1,tag) Genderx = each[ ' Studentbo ' [' Gender '] worksheet.write (i,2,genderx) contentx = each[' content '] #print (con TENTX) Worksheet.write (i,5,contentx) #topicDict [' messageId '] = each[' messageId ' time_f = List (time.localtime (int (str (each[' issuetime ') [: -3]))) time_s = str (time_f[0]) + '/' +str (time_f[1]) + '/' +str (time_               F[2]) H = time_f[3] m = time_f[4] sec = time_f[5] If h < 10: h = ' 0 ' +str (h) if m < 10:m = ' 0 ' +str (m) if sec < 10:sec = ' 0 ' +st R (sec) time_t = str (h) + ': ' +str (m) + ': ' +str (sec) Datex = time_s # print (Datex) W Orksheet.write (I,3,datex) Worksheet.write (i,4,time_t) i + = 1 Global I # build new post form data with acquired timestamp, continuous loop crawl New_req_da Ta = ' timestamp= ' +str (timestamplong) + ' &premoodtimestap=1461641185406&phonebrand=xiaomi&platform=1 &phoneVersion=19&channel=XiaoMiMarket&phoneModel=HM+2A&versionNumber=7.4.1& ' Print ('-------      -------------------new Page-------------------------------') #print (new_req_data) # defines the number of crawl post bars if I <= 20000:      Try:get_data (New_req_data) except:print (' Fail ') Finally:workbook.close () Else: Workbook.close () print (' All crawled successfully!!! ') # I is the number of fetch bars ID i = # here the post data for the first page is passed in, then the Loop get_data (Req_data= ' timestamp=146646454916&premoodtimestap= 1461641185406&phonebrand=xiaomi&platform=1&phoneversion=19&channel=xiaomimarket&phonemodel =hm+2a&versionnumber=7.4.1& ')

　　　　The result of the code output is as follows:

　　　　The Excel table data is as follows:

At this point, the crawl is all done. But, after all, I learned statistics, how can not analyze the data, the next I would like to do a simple visualization of the data.

Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support