Comparison of text processing performance between awk, grep/SED/awk combination and Perl

Source: Internet
Author: User




1. Introduction

Retrieve various types of data from the original data file, remove the header, and output it to different files named according to the data type. This sorting and formatting task is a basic part of data analysis and processing. You can use awk, grep/SED/awk combination, or Perl to implement this task.




To compare the performance of these three implementation methods, we chose the same data for testing, programming in three ways respectively to implement the same function and compare its running time performance.

The tools and operating system versions we use are as follows:
Grep (GNU grep) 2.5.1
Sed GNU sed version 4.1.2
Awk GNU awk 3.1.3
Perl this is Perl, v5.8.5 built for i386-linux-thread-multi
Linux version 2.6.9-42. elsmp (bhcompile@hs20-bc1-1.build.redhat.com) (GCC version 3.4.6 20060404 (Red Hat 3.4.6-2) #1 SMP Wed Jul 12 23:27:17 EDT 2006

2. algorithm Overview

In the three methods, awk and Perl are used for one scan of all files, while grep/SED/awk is used for multiple scans.

Awk uses its own field splitting function, and uses its system function gsub to implement the header removal function.

In grep/SED/awk mode, grep, sed, and awk are used to complete the tasks they are good at. grep performs string SEARCH, sed performs text replacement, and awk performs Field Segmentation and output.

In Perl, the split function is used to perform two-step cutting, one-time cutting of fields (Tab Division) and one-time cutting of headers and data, output Data to different files based on the DT header field.

3. Running result

Time

Awk

Grep/SED/awk

Perl

Running result

Start... t = (16:37:19)

End. t = (19:01:53)

Start... t = (21:13:56)

End. t = (22:01:47)

Start... t = (11:34:01)

End. t = (12:21:50)

Time

144m34s

47m51s

47m49s

4. Conclusion

For text search, the following three methods are available: Use gred/SED/awkCombination with PerlThe running time of the program is almost the same, but the awk is used purely.The program running time is about 3 of the programs written in the preceding two methods.Times

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.