Performance comparison of three Regex libraries in the #墙裂推荐Boost regex# c,c++11,boost

Source: Internet
Author: User
Tags autoload soap regex expression



In a recent project, it was found that the previous regular matching modules had a very bad performance penalty for long string matching, so the various regular matches under the long string were slightly studied and attached with examples. This article refers to the blog http://www.cnblogs.com/pmars/archive/2012/10/24/2736831.html (hereinafter referred to as Article 1), this article also compares three kinds of regex libraries, But there are some places where I have my own opinions, which are listed below, thanks to the author of this article.



1.C Regex Library



Because the C Regex library is always used in the project, the C regex is studied first. For C Regex's various interfaces and parameters can refer to Article 1, if you want to look more professional, you can refer to this article, http://pubs.opengroup.org/onlinepubs/000095399/functions/regcomp.html (hereinafter referred to as Article 2).



The performance test for C regex is like this, first of all, by convention we encapsulate the various calls provided by C regex into a single invocation whose incoming arguments are a pattern string and a string to be matched, and if they match, return 1, otherwise 0.



The process in this call, as described in article 1, first calls Regcomp to compile the pattern string, which can be specified at compile time, such as the extended/underlying regular syntax, whether the case is ignored, whether the result is stored, and the setting for the line break (see article 2 for this parameter). After that, call Regexec to match the target string to match the compiled regular expression, the following parameters can be set to match the string to save the location, as well as the pattern string in the beginning of the character (^) and the line tail character ($) is treated as ordinary characters, the details can be referred to in article 1, article 2. Finally, call RegFree to free up memory.





int match(const char* pattern, const char* target)
{
	regex_t oRegex;
	int nErrCode = 0;
	char szErrMsg[1024] = {0};
	size_t unErrMsgLen = 0;

	if ((nErrCode = regcomp(&oRegex, pattern, 0)) == 0)
	{
		if ((nErrCode = regexec(&oRegex, target, 0, NULL, 0)) == 0)
		{
			regfree(&oRegex);
			return 1;
		}
	}

	unErrMsgLen = regerror(nErrCode, &oRegex, szErrMsg, sizeof(szErrMsg));
	unErrMsgLen = unErrMsgLen < sizeof(szErrMsg) ? unErrMsgLen : sizeof(szErrMsg) - 1;
	szErrMsg[unErrMsgLen] = '\0';

	regfree(&oRegex);
	return 0;
After the test program sets the pattern string and the target string to be matched, it makes 10,000 calls to the function, taking the difference between the start and end times as the basis for performance evaluation, and the following c++11 and boost use this method, which is simple and effective.





Considering that it is repeated calls 10,000 times, we will naturally think, that I do not in this match inside the same pattern string to compile, match, free memory, we want to test just match performance Ah, compile once not good, OK, we can improve, encapsulated into this call, You see, we pass in the compiled regular expression regex_t pointer is not good, then loop 10,000 times, and then release its memory is not perfect?





int match_pre_comp(regex_t * pattern, const char* target)
{
	int nErrCode = 0;
	char szErrMsg[1024] = {0};
	size_t unErrMsgLen = 0;

	if ((nErrCode = regexec(pattern, target, 0, NULL, 0)) == 0)
	{
		return 1;
	}

	unErrMsgLen = regerror(nErrCode, pattern, szErrMsg, sizeof(szErrMsg));
	unErrMsgLen = unErrMsgLen < sizeof(szErrMsg) ? unErrMsgLen : sizeof(szErrMsg) - 1;
	szErrMsg[unErrMsgLen] = '\0';

	return 0;
}
In order to test the performance of both methods, we use macro definitions to compile different versions of the executable program, in which we deliberately use the long target string (1000) to test to achieve the purpose, when specifying #define NOT_PRE_COMP, Using the first method, the second method is used when specifying a # define PRE_COMP, both cannot exist at the same time and cannot exist at the same time. Use # define LOOP_COUNT (XXX) to specify the number of cycles. Of course, after these changes, you have to recompile.







/*
 * Program: 
 *   This program for test c regex performance
 * Platform
 *   Ubuntu14.04     gcc-4.8.2
 * History:
 *   weizheng     2014.11.05    1.0
 */

#include <sys/time.h>
#include <stdio.h>
#include "regex.h"
#define LOOP_COUNT ( 10000 )
#define PRE_COMP

/*
 * you must choice PRE_COMP or NOT_PRE_COMP to decide pre_complie the regex expression or not
 */
#if defined(PRE_COMP) && defined(NOT_PRE_COMP)
	#error can not define PRE_COMP and NOT_PRE_COMP at the same time
#elif !defined(NOT_PRE_COMP) && !defined(PRE_COMP)
	#error please define PRE_COMP or NOT_PRE_COMP
#endif

/************************************ main ************************************/

int main(void)
{
	char pattern[] = "ywfuncFlag.*rzgw/rzdk_rzzl.jsp";
	char target[] = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

#ifdef PRE_COMP
	regex_t oRegex;
	if (regcomp(&oRegex, pattern, 0))
		printf("regex complie error\n");
#endif

	/*
	 * record the start time
	 */
	struct timeval tv_start, tv_end;
	gettimeofday(&tv_start, NULL);

	/*
	 * matching
	 */
	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
#ifdef PRE_COMP
		if(match_pre_comp(&oRegex, target))
#endif

#ifdef NOT_PRE_COMP
		if(match(pattern, target))
#endif
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	gettimeofday(&tv_end, NULL);
	unsigned long time_used = (tv_end.tv_sec * 1000000 + tv_end.tv_usec - (tv_start.tv_sec * 1000000 + tv_start.tv_usec))/1000;

#ifdef PRE_COMP
	regfree(&oRegex);
#endif
	printf("used:   %lu ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}

Compile and run the two versions, the results are as follows:





Do not precompile a pattern string, that is, it is recompiled in a loop every time:





[email protected]:~/test$ ./c_regex_main 
used:   418 ms
matched 10000 times





Pre-compiling the pattern string before the loop, and then passing in the compiled regular expression for looping without recompiling:





[email protected]:~/test$ ./c_regex_main 
used:   404 ms
matched 10000 times





From the results to see the C Regex match time-consuming, this time is not good in the final analysis is fast or slow, see who compared with, if you want to know, please continue to look down. However, in the case of C regex, there is not much difference between the two methods of precompiling and precompiling, and it should be much less than the matching time for such long strings, so the compilation and multiple effects will not be very large.



2. C++11 Regex



I want to say, I write this article a main purpose is to spit groove C + + one regex, accurately speaking, should be spit Groove g++ 4.8! C + + provides two calls to regular matches: Regex_match,regex_search, if the entire input sequence matches an expression, the Regex_match function returns True, and if there is a substring in the input sequence that matches the expression, the Regex_ The search function returns True (refer to C + + primer (fifth edition, Chinese version, P646)). For Regex_match Fortunately say, for regex_search, say, I really 10,000 grass mud horse whistling ...



This regex_search, you set the pattern string and only set good one words, such as any substring in the string to be matched above, "XML", it is very simple, a look should match ah, but this is a return false, you have what to say, You just set the target string to "XML" and it also returns false. Crouching Groove, I am not open the way wrong, in the end where the problem!



So, I took the official example of Cplusplus down to run, I think this should be OK, but I go .... Regex error (Pattern string compilation error), I go to .... What's the situation, the authorities are wrong? Well, I have to take out the Bible C + + primer, I believe the Bible will not be wrong, crouching, the results are really predictable, still the regex error. I'll go... I have ruined the three views, the grass and mud horse in the end what looks like ...



However, it is OK, I can still Google, although be sealed, but we can still turn over the wall, hehe, search regex_search always return false, after all, or Google works ah, I found a similar problem on the StackOverflow:





http://stackoverflow.com/questions/20027305/strange-results-when-using-c11-regexp-with-gcc-4-8-2-but-works-with-boost-reg
http://stackoverflow.com/questions/12279869/using-regex-search-from-the-c-regex-library
http://stackoverflow.com/questions/11628047/difference-between-regex-match-and-regex-search?lq=1
Finally found the answer, yes, you guessed it, c++11 too new,g++ 4.8 has not been fully supported. Some people say that vs2010,vs2012 already support, OK, put on my strong dual system, we solve the problem under Windows, hehe.





To change the above C regex code changes to use C++11 regex, as follows, note that the acquisition of time under Windows can not be used gettimeofday, to be replaced by the corresponding Windows call. Similarly, we set the number of cycles by constant Loop_count, and C + + uses a pattern string to construct a regex_t object equivalent to a compiled regular expression string in C regex, so this does not have to worry about repeating the compilation many times.





/*
 * Program: 
 *   This program for test c++11 regex performance
 * Platform
 *   windows8.1     VS2012
 * History:
 *   weizheng     2014.11.06    1.0
 */

#include <regex>
#include <windows.h>
#include <stdio.h>

const int LOOP_COUNT = 10000;

/************************************ main ************************************/

int main()
{
	std::regex pattern("ywfuncFlag.*rzgw/rzdk_rzzl.jsp");
	std::string target = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

	/*
	 * record the start time
	 */
	LARGE_INTEGER lFreq,lSatrt,lEnd;
	QueryPerformanceFrequency(&lFreq);
	QueryPerformanceCounter(&lSatrt);

	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
		if(std::regex_search(target, pattern))
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	QueryPerformanceCounter(&lEnd);
	float time_used = (float)(lEnd.QuadPart - lSatrt.QuadPart)*1000/lFreq.QuadPart;

	printf("used:   %.2f ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}





The results are as follows, this Nima can not find a more slow, do not know turtle see will not regret not to play with it. You might think C akzent is fast, and if so, you still need to look down:





used: 13754.17 ms
matched 10000 times
Please press any key to continue ...
3. Boost Regex





OK, here's my wall crack recommended boost regex. Boost is a library of C + + standard library, known as the "quasi-standard library", is quite prestigious. I was a sophomore wrote a self-thought good system, and later a senior told me he used boost 10 lines of code to complete ... I was shocked. After referring to the basic introduction of boost regex and some examples, it was found that the design is exactly like c++11, as long as the Regex call namespace is changed from Std to boost it seems good, of course, if you have installed the boost library, Compile-time plus link option-lboost_regex.



OK, of course, the test code is also simple and effective, but when I saw the result, I was shocked again.





/*
 * Program: 
 *   This program for test boost regex performance
 * Platform
 *   Ubuntu14.04     g++-4.8.2
 * History:
 *   weizheng     2014.11.06    1.0
 */

#include <boost/regex.hpp>
#include <sys/time.h>
#include <cstdio>

const int LOOP_COUNT = 10000;

/************************************ main ************************************/

int main()
{
	boost::regex pattern("ywfuncFlag.*rzgw/rzdk_rzzl.jsp");
	std::string target = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

	/*
	 * record the start time
	 */
	struct timeval tv_start, tv_end;
	gettimeofday(&tv_start, NULL);

	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
		if(boost::regex_search(target, pattern))
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	gettimeofday(&tv_end, NULL);
	unsigned long time_used = (tv_end.tv_sec * 1000000 + tv_end.tv_usec - (tv_start.tv_sec * 1000000 + tv_start.tv_usec))/1000;

	printf("used:   %lu ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}
The results are as follows:







[email protected]:~/test$ ./boost_regex_main 
used:   11 ms
matched 10000 times


This result, unexpectedly 40 times times faster than C, and this is in the pattern string can match the target string under the premise, if the test does not match the situation, the boost is 4,000 times times faster than C, I can not believe this result, paste the test with the pattern character and the target character, see if there are people like me, Or for other reasons, such as a regular expression or the structure of the target string itself, the test string is as follows:





boost::regex pattern(".*CSZ.*XTCS.*CSMC.*NEWBB");
	std::string target ="<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?><SOAP-ENV:Envelope xmlns:SOAPSDK1=\"http://www.w3.org/2001/XMLSchema\" xmlns:SOAPSDK2=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:SOAPSDK3=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\"><SOAP-ENV:Body SOAP-ENV:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\"><SOAPSDK4:GetSystemDate xmlns:SOAPSDK4=\"http://svr.blf.common.fmis.ygsoft.com\"/></SOAP-ENV:Body></SOAP-ENV:Envelope>";







So, from the above comparison results show that the performance of boost regex seems to be very prominent, but the final comparison in text 1 is the C regex faster, I am different from the test results, I hope to see more test results, to clarify the problem. However, the most anticipated is that if you replace the original C regex in the project with the boost regex, you don't know what the results will be, just wait and see.












Performance comparison of three Regex libraries in the #墙裂推荐Boost regex# c,c++11,boost


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.