Python (Stackless) + MongoDBApache log (2 GB) Analysis

Last Update:2018-06-10 Source: Internet

Author: User

Tags apache log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why Stackless? Www. stackless. comStackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVEO

Why Stackless? Http://www.stackless.com Stackless can be simply considered as an enhanced version of Python, the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVE O

Why Stackless? Http://www.stackless.com

Stackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong to it. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project was launched by EVE Online and is indeed strong in concurrency and performance. Similar to Python, you can consider replacing Python with the original system. :)

Why MongoDB? Http://www.mongodb.org

We can see many popular apps using MongoDB on the official website, such as sourceforge and github. What are the advantages of RDBMS? First, the advantage in speed and performance is the most obvious. It can be used not only as a KeyValue database, but also contains some database queries (Distinct, Group, random, index, and other features ). Another feature is simplicity. Applications, documents, and third-party APIs can be used after a few clicks. However, it is a pity that the stored data file is large, which is 2-4 times the normal data. The Apache Log size tested in this article is 2 GB, and the data files produced are 6 GB. Cold... I hope I can scale down in the new version. Of course, this is also an obvious consequence of changing the speed of space.

Apart from the two software mentioned above, you also need to install the pymongo module. Http://api.mongodb.org/python/

The module installation methods include source code compilation and easy_install.

Analyze the information to be saved from Apache logs, such as IP address, time, GET/POST, and return status code.

fmt_str  = '(?P[.\d]+) - - \[(?P.*?)\] "(?P.*?) (?P.*?) HTTP/1.\d" (?P\d+) (?P.*?) "(?P.*?)" "(?P.*?)"'fmt_name = re.findall('\?P<(.*?)>', fmt_str)fmt_re   = re.compile(fmt_str)

A regular expression is defined to extract the content of each log line. Fmt_name is the variable name in the brackets.

Defines MongoDB-related variables, including the names that need to be stored in the collection. The default Host and port are used for Connection.

conn     = Connection()apache   = conn.apachelogs     = apache.logs

Save log lines

def make_line(line):    m = fmt_re.search(line)    if m:        logs.insert(dict(zip(fmt_name, m.groups())))

Read Apache Log Files

def make_log(log_path):    with open(log_path) as fp:        for line in fp:            make_line(line.strip())

Run.

if __name__ == '__main__':    make_log('d:/apachelog.txt')

The script is roughly like this. Some stackless code is not put here. You can refer to the following code:

import stacklessdef print_x(x):    print xstackless.tasklet(print_x)('one')stackless.tasklet(print_x)('two')stackless.run()

Tasklet operations only put similar operations into the queue, and run is the real operation. This is mainly used to replace the behavior of the original multi-thread threading for Parallel Analysis of Multiple logs.

Supplement:

The Apache Log size is 2 GB, and about 6.71 million lines. The generated database has 6 GB.

Hardware: Intel (R) Core (TM) 2 Duo CPU E7500 @ 2.93 GHz Desktop

System: RHEL 5.2 File System ext3

Others: Stackless 2.6.4 MongoDB 1.2

Everything works normally when it is saved for about 3 million. Both the CPU, memory, and insertion speed are good. There may be 8-entries/second. The test results are basically the same as those in the previous notebook. In the future, the memory consumption will soar and the insertion speed will also decrease. The CPU usage reaches 5 million at around 40%, and the memory consumption reaches 2.1 GB. It seems that the speed and efficiency have increased when the second 2G data file is generated. The stored results are not satisfactory.

I tested 10 million data again in my notebook, and the speed is much higher than the above 6.71 million. There are two initial issues that may affect performance and speed:

File System differences. The notebook is a Ubuntu 9.10, ext4 system. Search for the differences between ext3 and ext4 in reading and writing large files.
Regular Expression matching. Single Row Operations are extracted by matching. There should be room for optimization on large files.

Original article address: Python (Stackless) + MongoDB Apache Log (2G) analysis, thanks to the original author for sharing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More