Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

Source: Internet
Author: User

About crawlers, began to think that only crawling web data, and then know that the app can crawl. As a result, the use of idle time in schools took two weeks to achieve data capture and simple data analysis.

Goal, crawl Super Curriculum XX University (in fact, it is our university ...) ) Students 20,000 post information. Ideas are as follows:

    STEP1: Find the entrance for our reptile

    The app requests the data, also through the network protocol, so that we grab the packet to locate the entrance, here I use fiddler. Refer to this article for information about setting up your phone and fiddler.

Find the login entry for: http://120.55.151.61/V2/StudentSkip/loginCheckV4.action

    STEP2: Login

    The landing here and the principle of analog landing is the same, that is, to the landing portal post data, and here is also more simple, because fiddler grab bag ( in order to avoid messing up the idea, here to grasp the packet analysis to skip the first, there is time to catch up on the packet analysis this piece bar ... you can get the data directly from the post, eliminating the steps to build the form yourself, see the code below. Once the login is successful, it will return the JSON data, which is your personal information.

    STEP3: Locating data sources

    To crawl is the post information, mobile phone Click, fiddler grab packet positioning. The location of the entry is: Http://120.55.151.61/Treehole/V4/Cave/getList.action,POST mode, the data is also returned in JSON form. It is important to note that in order to make a continuous request, it is necessary to take the time stamp, which is simply a timestamp from the nth request to get the return data, then apply to the n+1 request, so loop, please refer to the code. Similarly, in the way I used to, I used a simple recursive return to realize the loop crawl.

    STEP4: Parse json file, clean data and save

    Here I still use the Excel table, the recommended database ... Because of the large amount of data (this crawl contains the content of the post, the amount of data is not too small ...) ), open Excel table when the big card ...

With the idea of the front, crawling is the physical life ... On the Code: (Statement: For privacy and security considerations, the code Logindata and the initial req_data has changed, this should be replaced by their own data to grab the packet)

Import requestsimport jsonimport reimport timeimport simplejsonimport xlsxwriterworkbook = xlsxwriter. Workbook (' kebiao_test.xlsx ') worksheet = Workbook.add_worksheet () worksheet.set_column (' a:a ', ten) worksheet.set_ Column (' B:b ', ') worksheet.set_column (' C:c ', 5) worksheet.set_column (' d:d ', ') worksheet.set_column (' E:e ', 20) Worksheet.set_column (' F:f ', $) worksheet.write (0,0, ' School ') worksheet.write (0,1, ' theme ') Worksheet.write (0,2, ' sex ') Worksheet.write (0,3, ' Date ') worksheet.write (0,4, ' time ') worksheet.write (0,5, ' content ') # Landing section # with requests. Session () Record cookies = requests. Session () loginurl = ' http://120.55.151.61/V2/StudentSkip/loginCheckV4.action ' logindata = ' phonebrand=xiaomi& platform=1&devicecode=867711022104024&account=6603135b883f9b40ded6374a22593&phoneversion=19& Password=98a09b5680df25a934acf9b3614af4ea&channel=xiaomimarket&phonemodel=hm+2a&versionnumber= 7.4.1& ' headers = {' Content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' user-agent ': ' Dalvik/1.6. 1 ILinux; U Android 4.4.4; HM 2A miui/v7.3.1.0.khlcndd) ', ' Host ': ' 120.55.151.61 ', ' Connection ': ' keep-alive ', ' accept-encoding ': ' gzip ', ' Co Ntent-length ': ' 213 ',}# submit login information, start logging data = S.post (Url=loginurl, Data=logindata, Headers=headers, Stream=true, verify= False) Loginresult = data.text# Print login information, check if login successful # return personal information print (Loginresult) # define a function, get information and save # Incoming form data to post, i.e. Req_ Datadef Get_data (req_data): Req_url = ' http://120.55.151.61/Treehole/V4/Cave/getList.action ' data_r = S.post (url=req_  URL, Data=req_data, headers=headers) data_r = Data_r.text # Returns a Boolean value of lowercase, converted here to true Boolean value true = True false = False #   Simple processing JSON data, direct parsing seems to have coding problems, not very clear ... Data_j = eval (data_r) Data_js = Json.dumps (data_j) data_dict = Simplejson.loads (data_js) # get timestamp data = data_dict[ ' Data '] Timestamplong = data[' Timestamplong '] #print (timestamplong) Messagebo = data[' Messagebos '] # processing the number of JSON caught Messagebo:if ' Studentbo ' in Each:print (each), get the target data and save it #topicDict= {} If Each.get (' content ', False): Schoolnamex = each["Schoolname"] worksheet.write (I,0,sch Oolnamex) tag = each[' Moodtagbo ' [' Moodtagname '] worksheet.write (i,1,tag) Genderx = each[ ' Studentbo ' [' Gender '] worksheet.write (i,2,genderx) contentx = each[' content '] #print (con TENTX) Worksheet.write (i,5,contentx) #topicDict [' messageId '] = each[' messageId ' time_f = List (time.localtime (int (str (each[' issuetime ') [: -3]))) time_s = str (time_f[0]) + '/' +str (time_f[1]) + '/' +str (time_               F[2]) H = time_f[3] m = time_f[4] sec = time_f[5] If h < 10: h = ' 0 ' +str (h) if m < 10:m = ' 0 ' +str (m) if sec < 10:sec = ' 0 ' +st R (sec) time_t = str (h) + ': ' +str (m) + ': ' +str (sec) Datex = time_s # print (Datex) W Orksheet.write (I,3,datex) Worksheet.write (i,4,time_t) i + = 1 Global I # build new post form data with acquired timestamp, continuous loop crawl New_req_da Ta = ' timestamp= ' +str (timestamplong) + ' &premoodtimestap=1461641185406&phonebrand=xiaomi&platform=1 &phoneVersion=19&channel=XiaoMiMarket&phoneModel=HM+2A&versionNumber=7.4.1& ' Print ('-------      -------------------new Page-------------------------------') #print (new_req_data) # defines the number of crawl post bars if I <= 20000:      Try:get_data (New_req_data) except:print (' Fail ') Finally:workbook.close () Else: Workbook.close () print (' All crawled successfully!!! ') # I is the number of fetch bars ID i = # here the post data for the first page is passed in, then the Loop get_data (Req_data= ' timestamp=146646454916&premoodtimestap= 1461641185406&phonebrand=xiaomi&platform=1&phoneversion=19&channel=xiaomimarket&phonemodel =hm+2a&versionnumber=7.4.1& ')

 

    The result of the code output is as follows:

    The Excel table data is as follows:

At this point, the crawl is all done. But, after all, I learned statistics, how can not analyze the data, the next I would like to do a simple visualization of the data.

Crawler re-exploration (v) ——— crawl app data-super curriculum "one"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.