Python (Stackless) + MongoDBApache log (2 GB) Analysis

Source: Internet
Author: User
Tags apache log
Why Stackless? Www. stackless. comStackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVEO

Why Stackless? Http://www.stackless.com Stackless can be simply considered as an enhanced version of Python, the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVE O

Why Stackless? Http://www.stackless.com

Stackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong to it. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project was launched by EVE Online and is indeed strong in concurrency and performance. Similar to Python, you can consider replacing Python with the original system. :)

Why MongoDB? Http://www.mongodb.org

We can see many popular apps using MongoDB on the official website, such as sourceforge and github. What are the advantages of RDBMS? First, the advantage in speed and performance is the most obvious. It can be used not only as a KeyValue database, but also contains some database queries (Distinct, Group, random, index, and other features ). Another feature is simplicity. Applications, documents, and third-party APIs can be used after a few clicks. However, it is a pity that the stored data file is large, which is 2-4 times the normal data. The Apache Log size tested in this article is 2 GB, and the data files produced are 6 GB. Cold... I hope I can scale down in the new version. Of course, this is also an obvious consequence of changing the speed of space.

Apart from the two software mentioned above, you also need to install the pymongo module. Http://api.mongodb.org/python/

The module installation methods include source code compilation and easy_install.

  1. Analyze the information to be saved from Apache logs, such as IP address, time, GET/POST, and return status code.

fmt_str  = '(?P[.\d]+) - - \[(?P.*?)\] "(?P.*?) (?P.*?) HTTP/1.\d" (?P\d+) (?P.*?) "(?P.*?)" "(?P.*?)"'fmt_name = re.findall('\?P<(.*?)>', fmt_str)fmt_re   = re.compile(fmt_str)

A regular expression is defined to extract the content of each log line. Fmt_name is the variable name in the brackets.

  1. Defines MongoDB-related variables, including the names that need to be stored in the collection. The default Host and port are used for Connection.

conn     = Connection()apache   = conn.apachelogs     = apache.logs

  1. Save log lines

def make_line(line):    m = fmt_re.search(line)    if m:        logs.insert(dict(zip(fmt_name, m.groups())))

  1. Read Apache Log Files

def make_log(log_path):    with open(log_path) as fp:        for line in fp:            make_line(line.strip())

  1. Run.

if __name__ == '__main__':    make_log('d:/apachelog.txt')

The script is roughly like this. Some stackless code is not put here. You can refer to the following code:

import stacklessdef print_x(x):    print xstackless.tasklet(print_x)('one')stackless.tasklet(print_x)('two')stackless.run()

Tasklet operations only put similar operations into the queue, and run is the real operation. This is mainly used to replace the behavior of the original multi-thread threading for Parallel Analysis of Multiple logs.

Supplement:

The Apache Log size is 2 GB, and about 6.71 million lines. The generated database has 6 GB.

Hardware: Intel (R) Core (TM) 2 Duo CPU E7500 @ 2.93 GHz Desktop

System: RHEL 5.2 File System ext3

Others: Stackless 2.6.4 MongoDB 1.2

Everything works normally when it is saved for about 3 million. Both the CPU, memory, and insertion speed are good. There may be 8-entries/second. The test results are basically the same as those in the previous notebook. In the future, the memory consumption will soar and the insertion speed will also decrease. The CPU usage reaches 5 million at around 40%, and the memory consumption reaches 2.1 GB. It seems that the speed and efficiency have increased when the second 2G data file is generated. The stored results are not satisfactory.

I tested 10 million data again in my notebook, and the speed is much higher than the above 6.71 million. There are two initial issues that may affect performance and speed:

  1. File System differences. The notebook is a Ubuntu 9.10, ext4 system. Search for the differences between ext3 and ext4 in reading and writing large files.

  2. Regular Expression matching. Single Row Operations are extracted by matching. There should be room for optimization on large files.

Original article address: Python (Stackless) + MongoDB Apache Log (2G) analysis, thanks to the original author for sharing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.