Sina Weibo crawl notes (4): Data cleanup

Source: Internet
Author: User

Data cleanup is a lot of, in fact, the process of crawling data to do the interval of data cleansing, are very trivial and complicated work. Summing up the experience, it is:

1, must use the database to store data

(I am not quite a database, in order to "Save learning Time", all data items are stored in txt, until the end of a number of kinds of search between categories, the folder tree becomes more complex, only to think that even using MySQL will improve efficiency)

2. There are not too many statements to handle exceptions

3, the processing of the data of the script is best packaged into a function, to minimize the need to change the source before the opportunity, variables from the external transfer

4, the work process to be written in whole to draw a diagram convenient to find, steps and documents more will be a bit confusing

Take the processing time as an example:

I need to get the user's latency this time, that is, from the micro-blog sent to the user to forward the interval.

A micro-blog message line in TXT is this:

One forward:

Using Python to deal with such a line, the basic need for the function is . Split () and various list operations;

Take the processing time section as an example. The modules used to process the time in Python are mostly datetime and datetime, which is useful for getting file information such as creation, modification time, and so on.

The time section has different formats, such as:

Today 15:002014-12-04 20:49:4822:00

This will write a different matching scheme:

1        ifLen (Re.compile ('\d+-\d+-\d+'). FindAll (time_area[0]) = = 1:#2014-12-04 20:49:482[year, month, day] = [i forIinchRe.compile ('\d+'). FindAll (Time_area[0])3[Hour, minute, sec] = [i forIinchRe.compile ('\d+'). FindAll (Time_area[1]) [0:3]]4T = [year, month, day] +[Hour, Minute, sec]5t = [Int (i) forIinchT]6Resulttime = Datetime.datetime (*t)7         elifLen ([Int (i) forIinchRe.compile ('\d+'). FindAll (time_area[0])) = = 1:#minutes ago8Posttime = File_time-datetime.timedelta (minutes = Int (Re.compile ('\d+'). FindAll (time_area[0]) [0] )9Resulttime =PosttimeTen         elifLen ([Int (i) forIinchRe.compile ('\d+'). FindAll (time_area[0])) = = 2:#Geneva 22:00 One[year, month, day] = [File_time.year, Re.compile ('\d+'). FindAll (time_area[0]) [0], Re.compile ('\d+'). FindAll (Time_area[0]) [1]] A[Hour, minute, sec] = [Re.compile ('\d+'). FindAll (time_area[1]) [0], Re.compile ('\d+'). FindAll (Time_area[1]) [1], 30] -T = [year, month, day] +[Hour, Minute, sec] -t = [Int (i) forIinchT] theResulttime = Datetime.datetime (*t) -         elifLen (Re.compile ('\d+'). FindAll (time_area[0]) = = 0:#Today 15:00 -[year, month, day] =[File_time.year, File_time.month, File_time.day] -[Hour, minute, sec] = [Re.compile ('\d+'). FindAll (time_area[1]) [0], Re.compile ('\d+'). FindAll (Time_area[1]) [1], 30] +T = [year, month, day] +[Hour, Minute, sec] -t = [Int (i) forIinchT] +Resulttime = Datetime.datetime (*t) A         Else: at             Print "Unexpected time type, check plz" -Sys.exit (0)

It is important to note that many details, such as Len () , can be used to find the length of the list, or the length of the string; re.compile () matches the result of a string, and so on.

The data structures generated after each run can be stored in pickle, and in my case the dictionary dict is stored. The usage is this:

 1  import   Pickle  2  try  :  4  with open ( '  foredic.pickle   ' ,  '  

This is the data structure that was last stored at the beginning of the import. pickle.load () If an empty file is encountered, the exception is handled, and if the file is empty or the file is not found, a new empty dictionary is created.

1 with open ('foredic.pickle'wb') as F: 2 pickle.dump (result, f)

Finally, the result dictionary is stored in the pickle file.

 

Sina Weibo crawl notes (4): Data cleanup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.