Operational Dimension Search Tool grep, ACK, ag efficiency comparison "drawings"

Source: Internet
Author: User
Tags ack file size glob imap php and pow learn perl egrep

Objective

I often see a lot of programmers, The_silver_searcher in code Search using ACK, and even AG (a), and I work 95% are using grep, the rest is AG. I think it's necessary to have a chat about this topic. I used to be an operational dimension, and I also wanted to find the best and quickest tool to use in every aspect of my work. But I'm curious as to why AG and ACK are not built into Linux distributions. The built-in is always grep. My original understanding is subject to a variety of open source agreements, or the distribution of boss preferences. Then I did an experiment and studied who the hell they were. At that time the practice is nothing more than running a few real ground wire log to see when. Then I also have one of my understanding: most of the time to use grep is OK, when the log is very large with AG.

ACK original domain name is betterthangrep.com, now is beyondgrep.com. All right. In fact, I understand the use of ACK students, but also understand the reasons for the ACK. There's a story here.

In the beginning I do the operation of the use of shell, often do some analysis log work. At that time, often write more complex shell code to achieve some specific requirements. Then came a classmate who would be Perl. I wrote a shell to do a thing, wrote more than 20 lines of shell code, run for about 5 minutes, the classmate to rewrite in Perl, 4 lines, a minute can run. Blinded by our eyes, from then on, I felt the need to learn Perl, so that later python.

Perl is a language that is inherently used for text parsing, and the ACK is really efficient. I think it might be the reason why you think it's quicker and more appropriate. In fact, this matter depends on the scene. Why do I have to compare the ' Earth ' grep? Look at this article, hope to give you some inspiration




Experimental conditions





PS: Serious statement, this experiment by personal practice, I try to be reasonable. You can try other angles when you see a disagreement. and discuss it with me.





I used one of the company's development machines (Gentoo)





I tested 2 kinds of pure English and Chinese, Chinese use the Dictionary of stuttering participle, English use the dictionary provided in MiscFiles





# If you're ubuntu:sudo apt-get install MiscFiles


wget Https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big





Preparation before the experiment





I will be divided into English and Chinese 2 kinds of files, file size 1MB, 10MB, 100MB, 500MB, 1GB, 5GB. No more is I think in the actual business does not have a single log file too large. There is no need to test (even if there is a trend to see the results below)





Cat make_words.py


# Coding=utf-8





Import OS


Import Random


From Cstringio import Stringio





En_word_file = '/usr/share/dict/words '


Cn_word_file = ' Dict.txt.big '


With open (En_word_file) as F:


En_data = F.readlines ()


With open (Cn_word_file) as F:


Cn_data = F.readlines ()


MB = POW (1024, 2)


Size_list = [1, 10, 100, 500, 1024, 1024 * 5]


En_result_format = ' Text_{0}_en_mb.txt '


Cn_result_format = ' Text_{0}_cn_mb.txt '








Def write_data (f, size, data, Cn=false):


Total_size = 0


While 1:


s = Stringio ()


For x in range (10000):


Cho = random.choice (data)


Cho = Cho.split () [0] if CN else Cho.strip ()


S.write (CHO)


S.seek (0, OS. Seek_end)


Total_size + + S.tell ()


Contents = S.getvalue ()


F.write (contents + ' \ n ')


If total_size > Size:


Break


F.close ()








For index, size in enumerate ([


MB,


MB * 10,


MB * 100,


MB * 500,


MB * 1024,


MB * 1024 * 5]):


Size_name = Size_list[index]


En_f = open (En_result_format.format (size_name), ' A + ')


Cn_f = open (Cn_result_format.format (size_name), ' A + ')


Write_data (en_f, size, en_data)


Write_data (cn_f, size, Cn_data, True)





Well, it's less efficient, right? I have no VPS myself, company server I can't just fill up all the cores of the CPU (not a few years of operation). If you don't mind htop the multi-core CPU floats red, you can do this, time-consuming is the time of each file generated short board:





# Coding=utf-8





Import OS


Import Random


Import multiprocessing


From Cstringio import Stringio





En_word_file = '/usr/share/dict/words '


Cn_word_file = ' Dict.txt.big '


With open (En_word_file) as F:


En_data = F.readlines ()


With open (Cn_word_file) as F:


Cn_data = F.readlines ()


MB = POW (1024, 2)


Size_list = [1, 10, 100, 500, 1024, 1024 * 5]


En_result_format = ' Text_{0}_en_mb.txt '


Cn_result_format = ' Text_{0}_cn_mb.txt '





inputs = []





def map_func (args):


Def write_data (f, size, data, Cn=false):


f = open (f, ' A + ')


Total_size = 0


While 1:


s = Stringio ()


For x in range (10000):


Cho = random.choice (data)


Cho = Cho.split () [0] if CN else Cho.strip ()


S.write (CHO)


S.seek (0, OS. Seek_end)


Total_size + + S.tell ()


Contents = S.getvalue ()


F.write (contents + ' \ n ')


If total_size > Size:


Break


F.close ()





_f, size, data, CN = Args


Write_data (_f, size, data, CN)








For index, size in enumerate ([


MB,


MB * 10,


MB * 100,


MB * 500,


MB * 1024,


MB * 1024 * 5]):


Size_name = Size_list[index]


Inputs.append ((En_result_format.format (size_name), size, En_data, False)


Inputs.append ((Cn_result_format.format (size_name), size, Cn_data, True)





Pool = multiprocessing. Pool ()


Pool.map (Map_func, inputs, chunksize=1)





After waiting for a while, the directory is like this:





$ls-LH


Total 14G


-rw-rw-r--1 vagrant vagrant 2.2K Mar 05:25 BENCHMARKS.IPYNB


-rw-rw-r--1 vagrant vagrant 8.2M Mar 15:43 Dict.txt.big


-rw-rw-r--1 vagrant vagrant 1.2K Mar 15:46 make_words.py


-rw-rw-r--1 vagrant vagrant 101M Mar 15:47 text_100_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 101M Mar 15:47 text_100_en_mb.txt


-rw-rw-r--1 vagrant vagrant 1.1G Mar 15:54 text_1024_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 1.1G Mar 15:51 text_1024_en_mb.txt


-rw-rw-r--1 vagrant vagrant 11M Mar 15:47 text_10_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 11M Mar 15:47 text_10_en_mb.txt


-rw-rw-r--1 vagrant vagrant 1.1M Mar 15:47 text_1_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 1.1M Mar 15:47 text_1_en_mb.txt


-rw-rw-r--1 vagrant vagrant 501M Mar 15:49 text_500_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 501M Mar 15:48 text_500_en_mb.txt


-rw-rw-r--1 vagrant vagrant 5.1G Mar 16:16 text_5120_cn_mb.txt


-rw-rw-r--1 vagrant vagrant 5.1G Mar 16:04 text_5120_en_mb.txt





Confirm Version





➜test ack--version # ack is called ' Ack-grep ' under Ubuntu


Ack 2.12


Running under Perl 5.16.3 At/usr/bin/perl





Copyright 2005-2013 Andy Lester.





This are free software. You may modify or distribute it


Under the terms of the artistic License v2.0.


➜test AG--version


AG version 0.21.0


➜test grep--version


grep (GNU grep) 2.14


Copyright (C) Free Software Foundation, Inc.


License gplv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.


This is the free software:you are and redistribute it.


There is NO WARRANTY and to the extent permitted by.





Written by Mike Haertel and others, <http://git.sv.gnu.org/cgit/grep.git/tree/authors&gt.





Experimental design





In order not to generate parallel execution of the reciprocal response, I chose the inefficient synchronization execution, I used the Ipython provided by the%timeit. On the Code





Import re


Import Glob


Import subprocess


Import Cpickle as Pickle


From collections Import Defaultdict





IMAP = {


' CN ': (' watercress ', ' little mingming '),


' En ': (' Four ', ' Python ')


}


OPTIONS = (', '-I ', '-V ')


FILES = Glob.glob (' text_*_mb.txt ')


En_res = Defaultdict (dict)


Cn_res = Defaultdict (dict)


RES = {


' En ': en_res,


' CN ': Cn_res


}


REGEX = Re.compile (R ' Text_ (\d+) _ (\w+) _mb.txt ')


Call_str = ' {command} {option} {Word} {filename} >/dev/null 2>&1 '





For filename in FILES:


Size, xn = regex.search (filename). Groups ()


For word in IMAP[XN]:


_r = Defaultdict (dict)


For command in [' grep ', ' ack ', ' AG ']:


For option in OPTIONS:


rs =%timeit-o-n10 subprocess.call (Call_str.format (Command=command, Option=option, Word=word, Filename=filename), Shell=true)


Best = Rs.best


_r[command][option] = Best


Res[xn][word][size] = _r





# Save it.





data = Pickle.dumps (RES)





With open (' result.db ', ' W ') as F:


F.write (data)





Warm hints, this is a grey often time-consuming test. After the start of the implementation of the tea to drink a long time ...





I have come to Qinhuangdao to finish the work (more than 11 days), continue our experiment.


I want the effect

I would like to work in general are used without parameters/with-I (Ignore case)/-v (find unmatched) these three kinds. So here's the test:

English Search/Chinese search
2 search terms selected (inefficient or more likely to be selected)
Test execution of '/'-I '/'-V ' three parameters, respectively
Use%timeit, perform 10 times for each condition, and choose the results of the most efficient one
Each diagram code a search term, 3 search command, an option to compare efficiency when searching for files of different sizes

Multi-graph warning, I'll start with the conclusion.

Using grep, ACK, or even an AG is not as much of a sensory difference as the total amount of data being searched is small
When the total amount of data in the search is large, grep is a lot more inefficient and does not choose at all
ACK does not have a high grep effect in some scenarios (for example, when using the-v Soso)
AG can replace ack/grep without using an option feature not implemented by AG

The gist of the rendered picture can be seen here benchmarks.ipynb. His data came from the top run. Files saved after serialization

Figures

The Silver Search (AG) is faster than Ack-grep








Today, with the search for android4.1 's code, I find that it is always matched by a long, collapsing. json file (Type AG:: layout) To get a healthy view of ag.el results.





A look at the code AG to know this long line print suppression is still an unfinished function





Src/options.h:48:30:int Print_long_lines; * Todo:support this in print.c * *








Fortunately, the author's C code is concise, so that I can quickly write a patch, using the-m parameter to suppress long line output. The patch message has been sent. I also backup myself:











Diff--git A/src/options.c B/SRC/OPTIONS.C


Index b08903f. 2f4b18e 100644


---a/src/options.c


+++ B/SRC/OPTIONS.C


@@ -77,6 +77,8 @@ -77,6 options:\n\


-P--path-to-agignore string\n\


Use. agignore file at string\n\


--print-long-lines print matches on very long lines (Default: >2k characters) \n\


+-m--max-printable-line-length NUM \n\


+ Skip Print the matching lines that have a length bigger than num\n \


-Q--literal Don ' t parse pattern as a regular expression\n\


-S--case-sensitive Match case sensitively (Enabled by default) \n\


-S--smart-case Match case insensitively unless pattern contains\n\


@@ -115,6 +117,7 @@ -115,6 init_options () {


Opts.color_path = Ag_strdup (Color_path);


Opts.color_match = Ag_strdup (Color_match);


Opts.color_line_number = Ag_strdup (Color_line_number);


+ opts.max_printable_line_length = default_max_printable_line_length;


}





void Cleanup_options () {


@@ -221,6 +224,7 @@ -221,6 parse_options (int argc, char **argv, Char **base_paths[], char **paths[]) {


{"Version", No_argument, &version, 1},


{"Word-regexp", No_argument, NULL, ' W '},


{"Workers", Required_argument, NULL, 0},


+ {"Max-printable-line-length", Required_argument, NULL, ' M '},


{NULL, 0, NULL, 0}


};





@@ -253,7 +257,7 @@ -253,7 parse_options (int argc, char **argv, Char **base_paths[], char **paths[]) {


Opts.stdout_inode = Statbuf.st_ino;


}





-while (ch = getopt_long (argc, argv, "A:ab:c:dg:g:fhillm:np:qrrssvvtuuwz", Longopts, &opt_index))!=-1) {


+ while (ch = getopt_long (argc, argv, "A:ab:c:dg:g:fhillm:m:np:qrrssvvtuuwz", Longopts, &opt_index))!=-1) {


Switch (CH) {


Case ' A ':


Opts.after = Atoi (Optarg);


@@ -305,6 +309,9 @@ -305,6 parse_options (int argc, char **argv, Char **base_paths[], char **paths[]) {


Case ' m ':


Opts.max_matches_per_file = Atoi (Optarg);


Break


+ Case ' M ':


+ opts.max_printable_line_length = atoi (Optarg);


+ Break;


Case ' n ':


Opts.recurse_dirs = 0;


Break


Diff--git A/src/options.h B/src/options.h


Index 5049AB5. b4d2468 100644


---a/src/options.h


+++ B/src/options.h


@@ -7,6 +7,7 @@


#include <pcre.h>





#define Default_context_len 2


+ #define DEFAULT_MAX_PRINTABLE_LINE_LENGTH 2000





Enum Case_behavior {


Case_sensitive,


@@ -45,6 +46,7 @@ -45,6 struct {


int print_heading;


int print_line_numbers;


int print_long_lines; * Todo:support this in print.c * *


+ int max_printable_line_length;


Pcre *re;


Pcre_extra *re_extra;


int recurse_dirs;


Diff--git A/src/print.c B/src/print.c


Index dc11594. 4be9874 100644


---a/src/print.c


+++ B/SRC/PRINT.C


@@ -34,6 +34,15 @@ -34,6 print_binary_file_matches (const char* path) {


fprintf (OUT_FD, "Binary file%s matches.\n", path);


}





+static void check_printable (int len, int *printable) {


+ if (len > Opts.max_printable_line_length) {


+ *printable = FALSE;


+ fprintf (out_fd, "+evil+mark+very+long+lines+here\n");


+} else {


+ *printable = TRUE;


+    }


+}


+


void print_file_matches (const char* path, const char* BUF, const int Buf_len, const match matches[], const int Matches_len ) {


int line = 1;


char **context_prev_lines = NULL;


@@ -49,6 +58,7 @@ -49,6 print_file_matches (const char* path, const char* BUF, const int Buf_len, CO


int I, J;


int in_a_match = FALSE;


int printing_a_match = FALSE;


+ int printable = TRUE;





if (opts.ackmate) {


Sep = ': ';


@@ -129,7 +139,8 @@ -129,7 print_file_matches (const char* path, const char* BUF, const int Buf_len, CO


}


j = Prev_line_offset;


/* Print up to current char * *


-for (; J <= I; j + +) {


+ check_printable (I-prev_line_offset, &printable);


+ for (; J <= i && printable; j + +) {


FPUTC (Buf[j], out_fd);


}


} else {


@@ -141,7 +152,8 @@ -141,7 print_file_matches (const char* path, const char* BUF, const int Buf_len, CO


if (Printing_a_match && opts.color) {


fprintf (OUT_FD, "%s", Opts.color_match);


}


-for (j = prev_line_offset J <= i; j +) {


+ check_printable (I-prev_line_offset, &printable);


+ for (j = prev_line_offset J <= i && printable; j + +) {


if (j = = Matches[last_printed_match].end && Last_printed_match < Matches_len) {


if (Opts.color) {


fprintf (OUT_FD, "%s", Color_reset);


Searching for text tools in Files grep command usage


grep (Global search Regular expression (RE) and print out of the line, a comprehensive search for regular expressions and print out rows) is a powerful text search tool that uses regular expressions to search for text. and print out the matching rows.





grep command Options


-A Do not ignore binary data.


-A < shows the number of columns > displays the contents of the row, except for the line that conforms to the template style.


-B Displays the contents of the row before the line that conforms to the template style.


-C calculates the number of columns that conform to the template style.


-C < Show number of columns > or-< the number of columns > In addition to displaying the column that conforms to the template style, the contents of the column before it are displayed.


-D < action > You must use this parameter when you specify that you want to find a directory other than a file, otherwise the grep command returns information and stops the action.


-e < template styles > Specify strings as template styles for finding the contents of a file.


-E uses a generic notation that extends the template style, meaning that you can use an extended regular expression.


-F < template file > Specifies a template file with one or more template styles that lets grep find the contents of the file that match the template criteria, in the form of a template style for each column.


-F treats the template style as a list of fixed strings.


G uses the template style as a normal representation.


-H does not indicate the name of the file to which the column belongs before displaying the column that conforms to the template style.


-h indicates the file name of the column before displaying the column that conforms to the template style.


I-Hu column character case difference.


-l lists the file names for which the contents of the file conform to the specified template style.


-l lists the file names for which the contents of the file do not conform to the specified template style.


-N marks the column number before displaying the column that conforms to the template style.


-Q does not display any information.


-R/-r The effect of this parameter is the same as the specified "-D recurse" argument.


-S does not display an error message.


-V Reverse lookup.


-W displays only the columns that match the whole word.


-X displays only the columns that are eligible for the column.


-y This parameter effect is the same as "-I".


-O outputs only the parts of the file that match.





Common usage of GREP commands


Search for a word in a file, and the command returns a line of text containing "Match_pattern":


# grep Match_pattern file_name


# grep "Match_pattern" file_name





Find in multiple files:


# grep "Match_pattern" File_1 file_2 file_3 ...





All lines except the output-V option:


# grep-v "Match_pattern" file_name





Tag matching color? Color=auto options:


# grep "Match_pattern" file_name--color=auto





Use the regular expression-e option:


# grep-e "[1-9]+"


Or


# egrep "[1-9]+"





Only the part-o option matched to in the output file is exported:


# the ' echo ' is ' a ' test line. | Grep-o-E "[A-z]+\."


Line.


# the ' echo ' is ' a ' test line. | Egrep-o "[A-z]+\."


Line.





Number of rows in a statistic file or text that contains a matching string-C option:


# grep-c "Text" file_name





Output the number of rows containing a matching string-N option:


# grep "Text"-N file_name


Or


# cat File_name | grep "Text"-N


#多个文件


# grep "Text"-N file_1 file_2





The character or byte offset at which the print style match is located:


# echo gun is not Unix | Grep-b-O "not"


7:not


#一行中字符串的字符便宜是从该行的第一个字符开始计算, the starting value is 0. Option-b-o is always used in conjunction with.





Search for multiple files and find out which files the matching text is in:


# grep-l "text" File1 file2 file3 ...





grep Recursive search file


To search for text recursively in a multilevel directory:


# grep "Text". -r-n


#. Represents the current directory.





Ignore character capitalization in matching styles:


# echo "Hello World" | Grep-i "HELLO"


Hello





Option-e braking multiple matching styles:


# echo This is a-text line | Grep-e "is"-e "line"-O


Is


Line


#也可以使用-F option to match multiple styles and write out the characters that need to be matched line by row in the style file.


# Cat Patfile


Aaa


Bbb


# echo AAA BBB CCC DDD Eee | Grep-f Patfile-o





Include or exclude specified files in grep search results:


Recursive search for characters in the #只在目录中所有的. PHP and. html files "main ()"


# grep "Main ()". -R--include *. {php,html}


#在搜索结果中排除所有README文件


# grep "Main ()". -R--exclude "README"


#在搜索结果中排除filelist文件列表里的文件


# grep "Main ()". -R--exclude-from FileList





grep and Xargs using the 0-valued byte suffix:


#测试文件:


# echo "AAA" > File1


# echo "BBB" > File2


# echo "AAA" > File3


# grep "AAA" file*-lz | xargs-0 RM


#执行后会删除file1和file3, the grep output uses the-Z option to specify a 0-value byte as the non-terminal file name (), xargs-0 read input and non-terminal the file name with a 0-value byte, and then deletes the matching file, which is usually used in conjunction with-L.





grep Silent Output:


# grep-q "Test" filename


#不会输出任何信息, if the command runs successfully returning 0, the failure returns a value other than 0. Typically used for conditional testing.





Print out lines before or after matching text:


#显示匹配某个结果之后的3行, use the-a option:


# seq 10 | grep "5"-A 3


5


6


7


8





#显示匹配某个结果之前的3行, use the-B option:


# seq 10 | grep "5"-B 3


2


3


4


5





#显示匹配某个结果的前三行和后三行, use the-C option:


# seq 10 | grep "5"-C 3


2


3


4


5


6


7


8





#如果匹配结果有多个, use "--" as the separator between the matching results:


# echo-e "A\NB\NC\NA\NB\NC" | grep a-a 1


A


B


--


A


B

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.