Why Stackless? Www. stackless. comStackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVEO
Why Stackless? Http://www.stackless.com Stackless can be simply considered as an enhanced version of Python, the most eye-catching non-"micro-thread" does not belong. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project is created by EVE O
Why Stackless? Http://www.stackless.com
Stackless can be simply considered as an enhanced version of Python, and the most eye-catching non-"micro-thread" does not belong to it. A micro-thread is a lightweight thread, which consumes less resources than a thread, making it easier to share data within the thread. It is more concise and readable than multi-threaded code. This project was launched by EVE Online and is indeed strong in concurrency and performance. Similar to Python, you can consider replacing Python with the original system. :)
Why MongoDB? Http://www.mongodb.org
We can see many popular apps using MongoDB on the official website, such as sourceforge and github. What are the advantages of RDBMS? First, the advantage in speed and performance is the most obvious. It can be used not only as a KeyValue database, but also contains some database queries (Distinct, Group, random, index, and other features ). Another feature is simplicity. Applications, documents, and third-party APIs can be used after a few clicks. However, it is a pity that the stored data file is large, which is 2-4 times the normal data. The Apache Log size tested in this article is 2 GB, and the data files produced are 6 GB. Cold... I hope I can scale down in the new version. Of course, this is also an obvious consequence of changing the speed of space.
Apart from the two software mentioned above, you also need to install the pymongo module. Http://api.mongodb.org/python/
The module installation methods include source code compilation and easy_install.
- Analyze the information to be saved from Apache logs, such as IP address, time, GET/POST, and return status code.
fmt_str = '(?P[.\d]+) - - \[(?P.*?)\] "(?P.*?) (?P.*?) HTTP/1.\d" (?P\d+) (?P.*?) "(?P.*?)" "(?P.*?)"'fmt_name = re.findall('\?P<(.*?)>', fmt_str)fmt_re = re.compile(fmt_str)
A regular expression is defined to extract the content of each log line. Fmt_name is the variable name in the brackets.
- Defines MongoDB-related variables, including the names that need to be stored in the collection. The default Host and port are used for Connection.
conn = Connection()apache = conn.apachelogs = apache.logs
- Save log lines
def make_line(line): m = fmt_re.search(line) if m: logs.insert(dict(zip(fmt_name, m.groups())))
- Read Apache Log Files
def make_log(log_path): with open(log_path) as fp: for line in fp: make_line(line.strip())
- Run.
if __name__ == '__main__': make_log('d:/apachelog.txt')
The script is roughly like this. Some stackless code is not put here. You can refer to the following code:
import stacklessdef print_x(x): print xstackless.tasklet(print_x)('one')stackless.tasklet(print_x)('two')stackless.run()
Tasklet operations only put similar operations into the queue, and run is the real operation. This is mainly used to replace the behavior of the original multi-thread threading for Parallel Analysis of Multiple logs.
Supplement:
The Apache Log size is 2 GB, and about 6.71 million lines. The generated database has 6 GB.
Hardware: Intel (R) Core (TM) 2 Duo CPU E7500 @ 2.93 GHz Desktop
System: RHEL 5.2 File System ext3
Others: Stackless 2.6.4 MongoDB 1.2
Everything works normally when it is saved for about 3 million. Both the CPU, memory, and insertion speed are good. There may be 8-entries/second. The test results are basically the same as those in the previous notebook. In the future, the memory consumption will soar and the insertion speed will also decrease. The CPU usage reaches 5 million at around 40%, and the memory consumption reaches 2.1 GB. It seems that the speed and efficiency have increased when the second 2G data file is generated. The stored results are not satisfactory.
I tested 10 million data again in my notebook, and the speed is much higher than the above 6.71 million. There are two initial issues that may affect performance and speed:
File System differences. The notebook is a Ubuntu 9.10, ext4 system. Search for the differences between ext3 and ext4 in reading and writing large files.
Regular Expression matching. Single Row Operations are extracted by matching. There should be room for optimization on large files.
Original article address: Python (Stackless) + MongoDB Apache Log (2G) analysis, thanks to the original author for sharing.