"Detailed" Python handles a large number of data and dict traversal optimization problems __python

Source: Internet
Author: User
Tags comparison table

Our demand for this example is to write a script that runs at 0 o ' Day. This script extracts data from a database that is updated in real time.

Run an Excel table every day, showing the difference between 0 points of the day and yesterday 0 o'clock.


In fact, a very simple demand, the key problem encountered is the amount of data. The amount of data in this case is too large to take 20多万条 data from the database each time.

There are a number of problems with large amounts of data:

1. Data cannot be loaded into Excel tables, because using Python to handle Excel data, insert up to 65536 rows of data, will be more error;

2. Traverse the filtering problem. We get two days of data to compare and then generate a diff table. You need to traverse the data between the two tables, the amount of data is too large, traversing the time spent too long.


We are now making an elaboration of these two key issues.

"Problem one: Change Excel table to CSV table"

We found that in CSV format, there is no limit to the number of rows, we can insert 20多万条 data directly into the CSV table.

"problem two: Traversal of dict type data"

According to our previous experience, the dictionary code to generate the contrast information is as follows:

def getCurrentCompareMessageDict0 (Dict0, Dict1):
    ' not optimized for obtaining current contrast information dictionary '
    dlist0=list (Dict0.keys ())
    Dlist1=list (Dict1.keys ())
    dict2={} for
    I in range (len (dlist1)):
        if dlist1[i] not in dlist0:
            key= Dlist1[i]
            value=[0, Dict1[dlist1[i]]]
            dict2[key]=value
        Else:
            if dict1[dlist1[i]]/100.0!= dict0[ Dlist1[i]]:
                key=dlist1[i]
                value=[dict0[dlist1[i], Dict1[dlist1[i]]
                dict2[key]=value
    Return Dict2
That is, the two Dict key list is built first.

Then, with the length of the KEY list as the upper bound, for the loop, the Dict[key] is used to filter the list data.

The operation of this method is super slow.

After the study we have improved the method as follows:

def getcurrentcomparemessagedict (Dict0, Dict1):
    ' optimized get current contrast information dictionary '
    dict2={}
    i=0 for
    D, X in Dict1.items ():
        if Dict0.has_key (str (d)):
            if x/100.0!= string.atof (dict0[str (d)):
                key=d
                value=[ String.atof (Dict0[str (d)]), X]
                Dict2[key] = value
        else:
            key=d
            value=[0, X]
            Dict2[key]=value return
    Dict2
After adopting this method, the comparison of two groups of 20多万条 data was screened, and completed in 1 seconds.

After testing, the optimization method is about 400 times times faster.

Where is this method optimized?

first, the method of traversing Dict is changed to

For D, X in Dict1.items ():
where d is the value of key,x. I can actually write this.

for (d, X) in Dict1.items ():
Data found on the Internet that the parentheses in the 200-time traversal efficiency is higher, without parentheses in more than 200 times the traversal efficiency is higher. (Reference link: Python Two comparison of traversal methods)

We did not go to the test and adopted the method without parentheses.

Secondly, the method of detecting whether a key exists in Dict is changed to

If Dict0.has_key (str (d)):
This Has_key function returns a Boolean value of TRUE or false.

The original detection method:

If dlist1[i] not in dlist0:
Abandon.

Actually improve the efficiency of the part on two steps, traversal and detection. As to exactly which step was raised, ... It should have been improved.

Because these two steps of the code is not separate, is linked together.

Only a for-d,x in Dict.items () traversal method is used to directly use the two parameters of D and X without taking a value.



The key issue is two. There are several problems encountered in the process to be elaborated:

1. Python compares the elements in two arrays to be completely equal.

>>> a = [(1,1), (2,2), (3,3), (4,4)]
>>> b = [(4,4), (1,1), (2,2), (3,3)]
>>> a.sort ()
>>> b.sort ()
>>> a==b
True
That is, first sort after comparison. Only verify that the elements are consistent, regardless of the order effect.

Reference link: Python compares elements in two arrays to be exactly equal

2.python How to convert a string to a number.

In the final code, we used the

String.atof (floating-point string)
String.atoi (integer string)

Note: you need

Import string

3. Read CSV file

We've all been writing CSV files before. Here you need to read and load the data into dict for ease of use.

The method is as follows:

def gethandledatadict (filename):
    ' Get yesterday 0 point data Dictionary '
    dict={}
    csvfile=file (filename, ' RB ')
    reader= Csv.reader (csvfile) for
    i-reader:
        key=i[0]
        value=i[1]
        dict[key]=value return
    dict
Two lines of critical code:

    Csvfile=file (FileName, ' RB ')
    Reader=csv.reader (CSVFile)
    For I in Reader:
I is every piece of data in Dict. Each i is a list, i[0] is a key,i[1] is value.


4.Python of Keyerror

This mistake is not the first time we encounter, here is highlighted, to show attention

Keyerror means: This key is not present in Dict. In this case, if we dict[key] to fetch the value of this key, we will report the Keyerror error.

It is possible that the key 's data type is wrong, or that it may not exist, both of which are considered.

We encountered a data type error in this example, so there are 2 problems, the string into a digital Blabala ....


"Scripting Ideas"

There is also the idea of writing a script that first pastes the final version of the code as follows.

#!/usr/bin/python #-*-coding:utf-8-*-__author__ = "$Author: wangxin.xie$" __version__ = "$Revision: 1.0 $" __date__ = "$Date: 2015-01-05 10:01$" ############################################################### #功能: Current 0 points and yesterday 0 balance information comparison table,
Run ############################################################### import sys import datetime import XLWT daily 00:00 Import CSV import string from Myyutil. Dbutil import dbutil ###################### #全局变量 #################################### memberdbutil = DBUtil (' Moyoyo_ Member '] today = Datetime.datetime.today () Todaystr = Datetime.datetime.strftime (Today, "%y-%m-%d") Handledate = today-
Datetime.timedelta (1) handledatestr = Datetime.datetime.strftime (handledate, "%y-%m-%d") Filedir = ' d://' Handlecsvfilename= filedir+handledatestr+ ' _balance_data.csv ' currentcsvfilename = fileDir+todayStr+ ' _balance_ Data.csv ' currentexcelfilename= filedir+todaystr+ ' _balance_compare_message.xls ' style1 = xlwt. Xfstyle () Font1 = XLWT. Font () font1.height = Font1.name = ' SimSun ' Style1.font = Font1 csvfile1=file (currentcsvfilename, ' WB ') Writer1 = Csv.writer (csvfile1, dialect= ' Excel ') ### ############################################################### def gencurrentbalancedata (): ' Get current balance data ' sql = ' SELECT member_id, (Temp_balance_amount + temp_frozen_amount) from Moyoyo_member.
    Money_info WHERE (Temp_balance_amount + temp_frozen_amount)!= 0 ' rs = memberdbutil.querylist (sql, ()) If not rs:return the None return Rs def getcurrentdatadict (RS): ' Assembles the current data into the dictionary ' dict={} for I-in range (
    Len (RS)): Key=rs[i][0] value=rs[i][1] dict[key]=value return dict def writecsv (x,writer): 
    ' CSV data write function ' Writer.writerow (x) def writecurrentcsvfile (): ' Write CSV file containing current data ' rs=gencurrentbalancedata () Dict=getcurrentdatadict (RS) for D, X in Dict.items (): Writecsv ([D, x/100.0], writer1) csvfile1.close ( ) Return dict def GethandledatadiCT (filename): ' Get yesterday 0 point data Dictionary ' dict={} csvfile=file (filename, ' RB ') Reader=csv.reader (csvfile) for I I N reader:key=i[0] value=i[1] dict[key]=value return dict def getcurrentcomparemessagedict (di
            Ct0, Dict1): ' Get Current contrast Information dictionary ' dict2={} for D, X in Dict1.items (): If Dict0.has_key (str (d)):
                If x/100.0!= string.atof (dict0[str (d))): Key=d Value=[string.atof (Dict0[str (d)), X]
    Dict2[key] = value Else:key=d value=[0, X] Dict2[key]=value Return Dict2 def writeexcelheader (): ' Write Excel table header ' WB = XLWT. Workbook (encoding = "UTF-8", style_compression = True) sht0 = Wb.add_sheet ("Balance Information comparison list", CELL_OVERWRITE_OK = True) s  Ht0.col (0). width=3000 sht0.col (1). width=4000 Sht0.col (2). width=4000 num=today.day (0, 0, ' User ID ', Style1) sht0.write (0, 1, str (num-1) + ' Day 0 point balance ', Style1) Sht0.write (0, 2, str (num) + ' Day 0 point balance ', Style1 return WB def writecurrentcomparemessageinfo (sht,dict): ' Write current contrast information data ' dlist =list (Dict.keys ()) for I in Range (len (dlist)): Sht.write (i+1, 0, Dlist[i], Style1) sht.write (i+1, 1, D Ict[dlist[i]][0], Style1) sht.write (i+1, 2, dict[dlist[i]][1]/100.0, Style1) def writecurrentcomparemessageexcel ( dict): ' Write current contrast information Excel table ' WB = Writeexcelheader () sheet0 = Wb.get_sheet (0) Writecurrentcomparemessageinfo (Sheet0, Dict) Wb.save (Currentexcelfilename) def main (): print "===%s start===%s"% (sys.argv[0), Datetime.datetime . Strftime (Datetime.datetime.now (), "%y-%m-%d%h:%m:%s")) currentdatadict=writecurrentcsvfile () handledatadict = ge Thandledatadict (handlecsvfilename) currentcomparemessagedict = Getcurrentcomparemessagedict (HandleDataDict, currentdatadict) Writecurrentcomparemessageexcel (currentcomparemessagedict) print "===%s end===%s"% (sys.argv[0), D Atetime.datetime.strftime (datetime. DateTime.Now (), "%y-%m-%d%h:%m:%s")) if __name__ = = ' __main__ ': Try:main () finally:if Memberd BUtil:memberDBUtil.close ()
The reason to say, scripting ideas.

Because we are writing this script, we need to pay attention to a lot of issues, not to be taken seriously.

Especially the process aspect. What to do first, what to do, how to deal with the data. There are no steps to omit.

Should be noted when writing each method.


thought one: The guiding role of script running time

The script needs to say that the script needs to fetch the data at 0 points per day. The data in the database is changed in real time.

So since it requires 0 points to take the data, so the method of fetching data must be put in the front.

That is, the way the script is arranged is closely related to the running time required by the script.

Why does the script choose to run at 0 o ' hour, what to do at 0 o ' time, we need to consider more.

because, the final effect is the accuracy of the data.

That is, if we run a different method first, such as reading the CSV file at the 0 point yesterday.

After reading for more than 20 seconds, this method of fetching data is not run. This time the data is not 0 point data.


thought two: Do not repeat the work.

Let's analyze the data flow in this example.

DICT0--------data from yesterday's 0 in CSV.

Dict1--------The data that the script takes from the database when it runs 0 o'clock the same day. Write to the CSV first.

Dict2--------yesterday's data and just ran out of data, after contrasting the filtered difference data dictionary.


What you need to be aware of is the operation of the code when the DICT2 is generated. Dict0 's data is naturally directly fetched, DICT1 data exists in the Dict1 of the code and can be directly return.

But before we made a mistake, we extracted the DICT1 data from the CSV file we just generated.

This is not necessary. We can take it directly from the code. This data is in the code, do not need to be extracted from the file.


Because of this unwarranted extension of the script's running time. is a basic logical omission.

So this method in the final version of the code.

Def writecurrentcsvfile ():
    ' Write a CSV file containing current data '
    rs=gencurrentbalancedata ()
    dict=getcurrentdatadict ( RS)
    for D, X-Dict.items ():
        writecsv ([D, x/100.0], writer1)
    csvfile1.close () return
    dict
After the CSV file is written, the used dict directly return, because it will be used later.

The generated CSV file is only for comparison with tomorrow's data.


Thought III: The significance of data generation.

Made the above mistake. We can reflect on the role of data. There is also the role of documents.

We generate dict for what, of course, the data may have more than one effect, this should be noted.

Csv0 is designed to provide dict0,dict0 in contrast to the data of the day.

Dict1 is to generate tomorrow's CSV, and to generate the dict2 of the day.

That is, CSV1 does not exist for Dict1 at all. Just to prepare for tomorrow.

With that in mind, it would not be foolish to take dict1 from the CSV1.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.