Python handles a huge number of mobile phone numbers I. Task description
Last week, the boss gave me a small task: Batch build mobile phone number and go to heavy. Gave me an Excel table, which is the mobile phone number in each region of the first 7 (such as), there are 13 tables, each table of the phone number prefix is estimated to be 8,000, need these 7-digit number generation of each generated after 4 bits composed of 11-digit mobile phone number, It is said that every cell phone number in the box to generate 10,000 mobile phone number. And also, the server has already used a number of mobile phone numbers, in the generated number list to remove the batch has been used. Already used this batch of numbers have been exported to a batch of txt text, about 4000w, each txt has 10w number, there are duplicates, the boss estimated the actual is about 3000w. The boss can assign me a Windows server that uses a 16G memory, 8-core CPU to run the program.
II. Mission Analysis
To deal with a large amount of data, so the execution of the program efficiency and memory can not be too high, should be able to run in the 4G memory of the development machine, that is, to turn off all programming IDE and Server software, only with notepad++ and browser (used to check the data) is not a card machine. The tasks that may be used in this task are: Program Excel processing, file traversal and read and write, large array operations, multithreading concurrency. Estimated task completion period: one week (under the premise of normal daily work).
Three, technical analysis
PHP: Very familiar, but the execution efficiency and memory consumption is not good enough, may be card machine, to achieve multithreading seems a bit complicated. (pending)
Javascript: More familiar, execution efficiency and memory footprint is not clear, but weak types are usually more worrying, various callbacks are more disturbing thinking. (not considered)
Java: Slightly understand, learning and writing are more troublesome, development efficiency is relatively slow. (pending)
C #: Useless, harder to learn (easier than Java, harder than scripting). (pending)
C + +: Slightly understand, array processing, multithreading these two seem to be more difficult to engage in. (not considered)
Python: Not used, it is said that it is easy to learn, there is a graduate student use it to do physical operations, etc., the implementation efficiency should be not low. Try
So just try to hit the mentality, opened the rookie tutorial (Runoob) of the Python tutorial probably looked at, the directory has a few arrays (list, tuple, dictionary), file io, files and multithreading, looked at the routine is really good simple. Once again the mother of the Python processing Excel, decisive simple and quick! Then opened the road of playing snakes.
Iv. merger (./4000w/hebing.py)
At the beginning of this article, there is a 4000w list of used numbers, but there are duplicates, and the subsequent processing requires these numbers. And see these more than 400 txt files add up to about 500M, so read it all into the process can also withstand. So first read these files into the merge to heavy, output into a TXT file. The last obtained number has more than 1800w, output txt file size about 209M.
hebing.py
Where the elements within the array are merged with the Txtarr = List (set (Txtarr)). It's amazing, right, what are these two functions? In fact, these are both functions of the conversion type. It is converted to a set type and then to a list type (lists/arrays). The set (set) type of Python is an unordered set of elements, so the list is converted to set and then automatically removed, and of course the order is disturbed, but the order is not important.
The last array converted to a string is also directly used to string the array is converted, do not use for loop, very very time-consuming.
V. Excel processing (./prenum.py)
According to the online routines directly read the contents of the first table in Excel, merge the array into strings into the text inside. In the turn into a string when the error appears to be said that the data type is not correct, only to know that the original Python and PHP, JS different, is a strong type of =_=! So first in Excel to convert the data inside the table into a string format (Excel is exactly called text format), the Excel table in the upper left corner of the data is a small green triangle. Each table is processed separately to generate a file, and a file contains about 8,000 phone number prefixes.
prenum.pyVi. generate numbers and remove used
First of all think, there are 13 tables, a table has about 8,000 number prefixes, each prefix generated 10,000 numbers, each number to the previous 18 million of the used mobile phone number to be compared to the weight. What would you do???
↓↓
↓↓
My idea is to do it in 13 times, one table at a time, 8,000 more numbers per table, multiple threads, more than 8,000 threads, each thread generating 10,000 numbers and a match to that 18 million number.
The core code for generating and de-weighing is as follows, with each number generated for approximately 0.5 seconds. So an estimate of the time, 10000*0.5s≈83min, 8,000 threads about 1.5 hours or so.
#这里是重复的号码 #fileobj = open (' TestBugNum.txt '); #测试的fileobj = open (' List-dataall.txt '); #实际的bugTxtStr = Fileobj.read (); Fileobj.close (); Bugtxtarr = Bugtxtstr.splitlines (); #已用过的号码的列表while J < 10000: #生成一万个同前缀号码 newnum = str (int (txtnum) *10000+j); J + = 1;# print newnum; If not newnum in Bugtxtarr: #如果不在已用过的号码列表里 numarr.append (' +86 ' +newnum); print ' +86 ' +newnum;
I probably ran a bit on this machine, observed a few minutes, it seems that the thread was created slowly, some have run to more than 10, some have just created the thread. Thread scheduling well, not in order, in addition to the output is very messy, there seems to be no other problem. So he went up to the server and ran, and then went home from work a few hours later. The next day back to the company, even on the server to see, unexpectedly ah, see the output of the information inside, those lines friend run to more than 300, God when you can run to 10,000 ah. And then there's the pit daddy, and occasionally I see some thread creation failed ⊙o⊙
Vii. thinking (./dodata.py)
From the previous running information, the use of multithreading here does not seem to speed up the operation of the program Ah, this is why? If you can't use multithreading, then the generation of the right place is going to change to another more efficient way, right?
First, from the Internet to find the answer, it is true that in the computation-intensive program, multithreading is worse than a single thread, because a CPU is so few cores, different threads are still the same to occupy CPU resources, plus thread scheduling time and space, is really a sinkhole. Second, there are functions in PHP that remove another array from an array (officially called the Difference Set Array_diff () of the array), then Python should also have such a function, the first 10,000-number array and then the difference will not be compared with the original number of the combination of efficiency faster?
In practice, it proves that the efficiency is extremely high, the array difference set in Python is Numarr = List (set (Numarr)-Set (Bugtxtarr)) measured, about three seconds to complete a number of numbers (a group of approximately equal to 10,000 numbers), Before is half a second a number \ (^▽^)/. But also note that now generated 10,000 number sorting is chaotic, because the intermediate conversion to the set type is unordered, if you need to order from small to large, then add a function to sort it, asked the boss said not in order, then directly this.
During debugging, it is found that using multithreading sometimes error unhandled exception in thread started by Sys.excepthook is missing and so on, online check information is because the main process has been executed, Then the thread it creates will be turned off. So my approach is to let the main process to execute the last empty statement, very much like the same time in the C language to do single-chip computer practice it →_→ although the last without multi-threading, direct single-threaded processing, security and stability.
dodata.pyViii. Merger and consolidation (./hbdata.py)
After about six hours of number generation, the last is to put an Excel table generated more than 8,000 files, each 10 files to synthesize one, each file about 100,000 numbers, each number preceded by "+86". Just walk through the catalogue, there is no technical point on the unknown.
hbdata.py
Something, a total of only 13 tables, these programs change slightly, the execution 13 times on the line. It is worth noting that my program here almost each has a global variable tblindex, is in case of a file inside a directory name and file name modification, negligence may result in data coverage.
Summarize
- There is a very important point in scripting language: to try to use the language provided by the function, do not implement the algorithm itself, especially the kind of loop, execution speed is not an order of magnitude.
- To process large quantities of data, split the steps to generate intermediate files. A large number of complex data operations to small batches of small batches of slowly debugging, the result is correct to gradually switch to real data.
- Multithreading in the operation-intensive scene is no authorizing, even if the CPU is multicore also no use, but will cause the order of random not easy to observe, thread instability error prone, thread switching memory consumption increased and so on.
- It is said that the GPU can be used for mining, brute force, and so on, this scenario of high concurrency, simple logic of the operation should also be very suitable for future use of GPU programming (for the recommended tutorial).
Python handles massive mobile phone numbers