Bulk large text Filter tool development record

Last Update:2015-07-04 Source: Internet

Author: User

Tags emit keyword list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bulk large text Filter tool development record

This week took two or three days to do a large text data filtering tool, mainly for Excel open very slow or unable to open the hundreds of trillion or even a few grams of CSV, TXT file, provide general data filtering, statistics and output functions. This large-text sieve-order requirement is more common for data selection and data analysis in production. This article simply records the development process as follows:

What development language is used?
How to ensure the user experience?
How to maintain optimization?

What development language is used?

That's a bit of crap. I'm familiar with Python, which is fast enough to be developed and flexible enough, especially because its powerful eval function can execute string code directly, and the string code can contain variables and functions, which means I can set specific variables in the string instead of each row of the file. Then execute the corresponding method to determine whether the line should not output, which is quite appropriate for custom filter rules. As for processing speed, Python is the first choice for processing millions of of rows of data in a matter of minutes, within a tolerable range.

How to ensure the user experience?

The user of this tool is mainly the production staff and the analyst, for them, the efficiency speed is second, simple answer to use, save most of the time on the line. So I break down the user experience into two parts: easy to operate and interface friendly. Most users usually use Excel to view the filtered data, so it is better to provide similar data viewing interface and sieve ordering method in Excel. This involves the question of what framework to use to develop the interface. Interface frame selection I am still familiar with the principle of priority, it is certainly QT, its signal and groove mechanism used to really call a cool. Although the previous use of QT is under C + +, but QT python version-pyqt interface are similar, there is no understanding of the direct look at the document on the line.

Simple Operation principle

Refer to the Data import function of Excel, engaged in the morning, the design interface is as follows:

The encoding format of the data is generally GBK, GB18030, UTF-8, etc., but many users do not know at all and do not care about the encoding format of the data (so when they open a CSV to see a bunch of garbled when it may say, how is garbled ah?) , so when I import the data I used the Chardet module to predict the encoding format of the data, eliminating the user's choice, the code is as follows:

with‘rb‘as rf:    #这里读取2kb内容是为了提高识别的准确度    charset = chardet.detect(rf.read(2048))[‘encoding‘]    if‘GB2312‘‘GBK‘

For large text, users are unlikely to see everything, and they generally know that the format of the data is sufficient, so I set each file to display only the first 100 rows of data. At the same time, in order to make it easier for users to see different files in the same directory, I set up a preview file list selection box and choose to change the contents of the preview table immediately. The preview content is also updated immediately when the user changes the file encoding format, if the file header is included, and the column delimiter. Displays the current preview file in the status bar.

Given that data filtering is generally text and numeric filtering, only string operations and integer and floating-point operations are provided when filtering conditions are set. The specific filtering operation is also reference to Excel, but given that the user may be filtering the data through text (for example, to provide a keyword list file, to filter out the data containing the keywords in the file), so for the string I added "included in (file)", "not included in (file)" Two actions for the user to select a file. Given that some filters are not suitable for direct user selection, such as Filter field 1 the first two characters and 2 after the 3 characters of the combination of the value of the data or some fields are numeric operations, so I still provide a direct filter expression editing function, Set more complex filtering conditions for users who are familiar with Python, such as: ((row["field1"][:2]+row["field2"][-3:])=="value") .
When the user set the filter conditions can click the Filter Test button directly to the current preview of the file filter test, the test results will be displayed directly in the preview of the Import Settings table in the original location, so instead of the frame display is intended to facilitate the user to see which row of data will be filtered out, Easy to compare and verify the correctness of the filter expression.

For the user to output the result file is CSV or TXT does not matter, so for the sake of simplicity I choose to provide only text output function. Under this interface, users can set which fields to output, as well as the encoding and column separators for the output. Because it is under Windows, so the default is set to GBK, so that users can not directly pull the file into Excel in a look is garbled, that feeling is not good.

interface-friendly Principles

As for the interface friendly, I feel mainly in the process of the embodiment of the progress, if the user click Start processing after a half-day to see the processing progress, it is likely to feel that the tool is not hanging or what, perhaps also the tool to close the stop. This requires that each file have real-time processing progress feedback.
Of course, multithreading is the preferred scenario, using multithreading to process data, updating the interface in the main process. At first I used the python thread threads, but in the test process found that the tool ran a period of time will be inexplicable direct crash, I set a lot of abnormal capture but nothing caught, that one night to stay close to 2 points have not been resolved, almost angry.
After a nap, the idea is clear, first of all to determine whether to filter the code potential bugs caused by the crash or because of the interface processing bugs caused. So I have all the actual filter code independent and executed alone, and found that the code is completely normal, so it must be the click on the START Process button interface processing when the problem occurs. 1:30 after the meeting or can not find the cause of the QT interface crashes, can only be helpless to calculate 逑 bar not to find, directly around the past good! Since do not run the interface alone, the entire configuration file, independent of the actual filter execution module runtime by reading the configuration file settings, and the Qt interface is only responsible for the user set parameters synchronized to the configuration file, and then open up a separate process to run the actual filtering execution module, And the progress of the process output through the pipeline to get and update the interface well. So I found, hey, pretty good, data processing and interface separation is more obvious, the stability of data processing and code maintenance has improved AH.
Say dry, after the code back up (this habit is too important, otherwise once the wrong recovery back is not easy), the first is to allow the direct command line to run the actual execution module, in this process encountered the problem of relative path import, the solution is simple, but because the amount of code is passed by me. It is also important to note that the Print method is used to set Flush=true to avoid buffering when printing progress, otherwise the QT master process can not get real-time data when it reads progress data through the pipeline, and then opens another thread in the main interface, using the thread's Run method.subprocess.Popen([‘python‘, scriptfile], stdout=subprocess.PIPE)Open a new process to execute the actual filter module. When another thread is turned on or using the Python thread, the result of the test is finally caught with an exception:
QObject::connect: Cannot queue arguments of type ‘QVector<int>‘ (Make sure ‘QVector<int>‘ is registered using qRegisterMetaType().)
Know the exception, solve the problem is good to do, Google quickly found a solution, the original QT control is only in the main thread inside the access! is ashamed, before using QT has not used multithreading so also do not pay attention, now finally eat to suffer. The solution is very convenient, I inherited the Qthread, in the Qthread run method inside the launch progress update signal, and then update the UI on the main thread of the interface on the line, the code is as follows:

 class workthread(qthread):filefinished = pyqtsignal (str) completed = Pyqtsignal () def __init__(self, parent=none):Super (Qthread, self). __init__ (parent) def run(self):        # Use the command line to start a real filtering program and use the pipeline to communicateScriptFile = Os.path.join (Os.path.dirname (__file__),' core/__init__.py ') Popen = subprocess. Popen ([' python ', ScriptFile], stdout=subprocess. PIPE) while True:Try: Next_line = Popen.stdout.readline ()Try: Next_line = Next_line.decode (' UTF8 ')exceptUnicodedecodeerror:next_line = Next_line.decode (' GBK ')ifNext_line = ="'  andPopen.poll ()! =None: BreakSelf.fileFinished.emit (Next_line)exceptException asEx:print (str (ex)) Popen.terminate () Self.completed.emit ()

The final progress update interface is as follows:

How to maintain optimization?

This tool is for the production, analysis personnel to use, who do not know how they will use, will deal with what data, so put them as testers, the background secretly recorded their operation and operation anomalies. For the purpose of mailbox security, I sent the tool background record to the specified platform and notified me by email. Of course, the processing of what data, data, such as the amount of data I will collect, convenient to do performance reporting, haha.

It's over

Well, it's pretty much the same, and it's going to be a lightning strike. After writing, I found that it is easy to read articles and write articles.

Bulk large text Filter tool development record

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More