Recently, we helped people solve a problem of loop optimization.

Source: Internet
Author: User
These two days I took a task and helped me solve a problem of loop optimization. I contacted the developer and figured out that this was the case. Because the database is relatively slow, these people decided to take the data into the memory for processing. Two screenshots (10 million records left)

These two days I took a task and helped me solve a problem of loop optimization. I contacted the developer and figured out that this was the case. Because the database is relatively slow, these people decided to take the data into the memory for processing. Two screenshots (10 million records left)

These two days I took a task and helped me solve a problem of loop optimization.

I contacted the developer and figured out that this was the case. Because the database is relatively slow, these people decided to take the data into the memory for processing. Two tables of about rows have a nested loop...


I heard the statement that the database data is loaded into memory, and immediately thought about whether a memcached cache can be added. But when I look at SQL, I still forget all kinds of complicated logic and multi-Table connections...

I told him that your idea is totally incorrect. If there is a problem with the database, explain SQL to see where the database is slow... Should the cache mechanism at the vfs layer be used to load memory. If you want to load the database into the memory for processing, isn't it equivalent to creating an rdbms? What can I do.

Then I asked him, do the data in the memory need to be synchronized with the data source? He said that the data will be lost once it is used up ..

Okay... It seems that the index is meaningless ..


Finally, only one O (m + n) algorithm can be provided. I made a demo for him, which is very simple:

#! /Usr/bin/env python # coding: UTF-8 # a simple demonstration of the internal connection between data1 and data2 to the key. More expressions need to be re-designed. For example, you can write another SQL parsing engine # the SQL functions you want to implement are as follows: # select * from data1 where data1.value <5 and data2.value like 'a % 'and data1.key = data2.keyimport random # --------------------------------------------- class Data1 (object ): ''' the structure of the first table ''' _ slot __= ('key', 'value') def _ init _ (self, key, value ): self. key = keyself. value = valuedef _ str _ (self): return str (self. key, self. value) def _ repr _ (self): return str (self. key, self, value) class Data2 (object): ''' the structure of the second table ''' _ slot __= ('key', 'value') def _ init _ (self, key, value): self. key = keyself. value = valuedef _ str _ (self): return str (self. key, self. value) def _ repr _ (self): return str (self. key, self. value) # ----------------------------------------- # where matching policy, each condition becomes a tool class. The design is very rough and only used for algorithm demonstration. You can use bool logic to combine class WhereStratege (object): @ staticmethoddef judge (data): passclass Little_Then_5 (WhereStratege ): @ staticmethoddef judge (data): if data. value <5: return Trueelse: return Falseclass Begin_With_A (WhereStratege): @ staticmethoddef judge (data): if data. value. startswith ('A'): return Trueelse: return False # -------------------------------------------------------- # below is the basic logic Implementation of the algorithm. Python comes with hashtable as its basic data structure .. The basic library of the net platform also has the implementation of hashtable. # In the namespace of System. Collectons. These hashtable functions must be automatically resized. Because I'm not sure how this implementation works with a lot of data #, but it should be okay. If hashtable Performance drops significantly under a certain magnitude. It indicates that the # collision of this hashtable is very serious. It is likely that the load factor is too high. Under normal circumstances, the time for reading and writing data in hashtable remains at a # stable level as the number increases. Occasionally, there may be a collision between two or three times. A slow insert occurs at intervals, indicating that hashtable is # resizing. The expansion speed will slow down as the table grows, but the next expansion will take longer after each expansion. Therefore, the read/write performance of hashtable # is O (1. # The result in this demo indicates the result array. In actual operation, various fields of the two tables may be taken. Just use a struct. Because the address location of the matching data in the two tables # is known. You can generate the required structure as needed. In addition, the interface design must be modified. Each table has a matching # logical array. If you want to connect multiple fields at the same time, use a hashtable if it is a link. If the relationship is or, You Need To # create a hashtable. The same applies to multi-table join. # Note that both the hash table and the result buffer are large. It must be generated in the heap instead of in the thread stack. Python is not very good at this. # In addition, hashtable can only be used to implement equal connection conditions and cannot implement partial order relationships, but I think there should be less partial order or more strange connection requirements. Def connect_data (data_list1, data_list2, method1, method2): hashtable = {} result = [] for data in data_list1: if method1.judge (data): hashtable [data. key] = datafor data in data_list2: if data. key not in hashtable: continueelif method2.judge (data): result. append (hashtable [data. key]) return result # ------------------------------------------------------- # Only 10*10 tables are tested to verify the correctness of the program. Because the algorithm itself is very simple, it is easy to see that it is O (m + n), the performance test is not performed in the demo. # The keys and values of the first table are random values less than 10. The key of the second table is 5 ~ 14. value is a random combination of abc letters. # I originally wanted to create two tables as generators. Data is automatically generated during each iteration to save memory space. But I thought about it again. This is not in line with the actual environment. Def gen_string (length, chartable): ''' in the chartable character table, select a character to generate a length string ''' return ''. join ([random. choice (chartable) for I in range (length)]) def test (): '''simple test case ''' keylist1 = range (10) keylist2 = range (5, 15) valuelist1 = range (10) valuelist2 = [gen_string (3, ['A', 'B', 'C']) for I in range (10)] random. shuffle (keylist1) random. shuffle (valuelist1) random. shuffle (keylist2) datalist1 = map (Data1, keylist1, valuelist1) datalist2 = map (Data2, keylist2, valuelist2) print 'data of the first table: \ n' for data in datalist1: print dataprint '\ n data of the second table: \ n' for data in datalist2: print dataprint' \ n connected data: \ n' r = connect_data (datalist1, datalist2, little_Then_5, Begin_With_A) for data in r: print dataif _ name __= = '_ main _': test ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.