Python efficient log Parsing

Source: Internet
Author: User

 

There are generally three important steps for python scripts to parse log files into the database: Reading files, parsing files, and receiving data. Working on these three aspects ensures optimal performance (concurrency is not discussed here)

1. Read a file: one row is read at a time. The disk IO is too large and the efficiency is low. If you read all the file content at a time, the memory may not be enough. Take a compromise to read several bytes each time (the specific size depends on the actual situation ).

After testing, we can conclude that we should write code like this.

F = open (path, 'R ')

For line in f:

.............

This is a method provided by the system to read files, which is generally faster than reading several bytes (f. read (20000) each time. It is still unclear how to do this. In short, the system provides,

It should not be bad, or it will be too shameful. Aha!

2. If you use a regular expression to parse logs, compile the Regular Expression and search for it again. This increases the speed. For example:

Regex0 = re. compile ("(^ |;) mobile = (\ d + )")

Mobile_number = regex0.search (self. resp_log). group (2)

Of course, this is a big aspect. We need to work hard on the details of regular expressions. The efficient writing of Python Regular Expressions will be written in one article later.

3. warehouse receiving: We recommend that you write executenames () online. insert into tablename (xx, xx) values (yy, yy), (yy, yy ).... but the form will be much faster,

We should directly splice our SQL statements into this form, which is much more efficient than executebench (). Test the specific number of rows to be inserted at a time. 1 W per second should be okay.

========================================================== ==============================================

According to the above statement, we can ensure that every step of reading, parsing, and storing files is optimal, but there is still room for optimization in the overall structure, as shown below:

1. Start a thread readThread to read files only, and then put the read content in queue Queue1;

2. Start a thread, manageThread, to parse only the file content, and then put the parsed items in Queue2;

3. Start the third thread writeDB to store the parsed file content into the database;

4. Enable a background thread to monitor, record, and process the running status of three threads, namely, 1, 2, and 3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.