Objective
The error of character encoding conversion is often encountered when developing i18n related. In this case, if the relevant string can be printed in hexadecimal form, for example, "ABC" Output to "\\x61\\x62\\x63" This is very useful for i18n error. Python, you just need to use the repr () function. How can you do this in C + +?
The following is a simple implementation of the Ostream formatting feature:
std::string get_raw_string (std::stringconst& s)
{
std::ostringstream out;
out << ';
out << Std::hex;
for (Std::string::const_iterator it = S.begin (); it!= s.end (); ++it)
{
out << "\\x" << *it;
out << ' \ ';
returnout.str ();
}
|
It looks straightforward, but it's a pity the code doesn't fulfill our intent. It also literally prints out each character. But we clearly specified the use of Std::hex to format the output Ah!? The problem turns out to be Std::hex is just an output format set for an integer type, and when the character type is output, the C + + stream is literally exported. To ostream documents to know, the original C + + standard output flow for the format of the output control is very weak, can only provide a limited number of formats customization, and most of the integer and floating-point type, for the character type is completely without parameters to control. Ironically, Ostream uses the C + + function overload and the strong typing mechanism to avoid the untold hassle of the infamous printf, and to increase security, while the expression is not lost to C. But here, strong type safety is the obstacle to our goal: I just want ostream to print the characters as integers! Fortunately, C + + also has the type of strong transfer this trick allows us to bypass the strong type match this security gate:
Out << std::hex << "\\x" <<static_cast<int> (*it);
|
All right, this character is output by an integer, and Std::hex instructs Ostream to output the integer in hexadecimal notation. The problem has been solved. Wait a minute, why does the output UTF-8 Chinese code become like this:
"\xffffffe4\xffffffb8\xffffffad"//Get_raw_string ("medium")
|
So many F word affects the appearance of the city too much. Can you get rid of them? In fact, the reason is that we output an integer that is forced to convert to int, and int is a bit long, so there are so many bits in front of it. If you want to get rid of it, just turn it into a 8 bit integer. Unfortunately, there are no 8 bit integers in C + +, and the only thing you can do is
But it's still not going to work with the int8_t, because in C + +, the typedef doesn't produce a new type, it just defines an alias of the original type. This alias is not involved in the matching calculation of the function overload. In other words, Ostream said, do not think you put on a int8_t vest I don't know you, I still take you as char to output. Blocked
So we're going to give up using ostream? Wait, in fact, ostream default is not output in front of the 0, so as long as the last 8 bit before the bit is wiped into 0 can not meet our requirements.
All right, here's the final version with no errors:
std::string get_raw_string (std::stringconst& s)
{
std::ostringstream out;
out << ';
out << Std::hex;
for (Std::string::const_iterator it = S.begin (); it!= s.end (); ++it)
{
//and 0xFF would remove the leading "FF" in the output,
//So, we could get "\xab" instead of "\xffab"
out << "\\x" << ( Static_cast<short> (*it) & 0xff);
out << ' \ ';
returnout.str ();
}
|
After a few twists and turns, I finally succeeded in using the hexadecimal output provided by ostream to achieve the function of printing string hex. In fact, the reason for the winding up, or because the ostream itself in the format of the output control is too weak. Further, is there a better tool in C + + to do this? Boost::format looks like it, but it still fails to deal with the dilemma that we are facing. Fortunately, another boost library gives the right answer: Boost::spirit::karma
Karma is part of the Boost::spirit library. You may be more familiar with the Spirit library to do parser to parse the string. Instead, the functionality provided by spirit through Karma is specifically designed to format the C + + data structure into character streams.
We just need it, and here's the code rewritten with the Karma Library:
template<typenameoutputiterator>
Boolgenerate_raw (Outputi Terator sink, std::string s)
{
usingboost::spirit::karma::hex;
usingboost::spirit::karma::generate;
returngenerate (sink, ' \ "' << * (" \\x "<< hex) << ' \ "', s);
}
std::string get_raw_string_k (std::stringcons t& s)
{
std::string result;
if (!generate_raw (std::back_inserter result), s)
{
throwstd::runtime_error ("Parse error");
returnresult;
}
|
The main thing here is to use the Karma built-in output module Karam::hex to help us complete the work, and this hex is a polymorphic generator. Unlike the Ostream type overload, it can output hex formats only for certain types, but for all types, including char. Another advantage is that the code is more expressive and the output format is fully represented in one line of code:
Output format is "\x61\x62\x63", easy to attach to Python or C + + code
' << ' ("\\x" << hex) << ' "'
|
If you want to change the output format, you just need to change this line of code, for example:
Output format changed to "0x61 0x62 0x63"
' \ ' << * ("0x" << hex << "") << ' \ '
|
So is there any performance loss in terms of efficiency? Here is a test code that converts the same string using two algorithms:
#include "boost/test/unit_test.hpp"
#include "boost/. /LIBS/SPIRIT/OPTIMIZATION/MEASURE.HPP "
#include "string.hpp"//the function for test
staticstd::stringconstmessage = "Hex Output performance test data Chinese";
Structusing_karma:test::base
This->val + = get_raw_string_c (message). Size ();
Structusing_ostream:test::base
This->val + = get_raw_string (message). Size ();
Boost_auto_test_case (teststringperformance)
Boost_spirit_test_benchmark (
Boost_check_ne (0, Live_code);
|
Here is the result of the operation, which is the time required by both algorithms, and the smaller the value the better:
Algorithm |
Time consuming (s) |
Karma |
6.97 |
Ostream |
14.24 |
May be unexpected, roughly speaking, karma is one times faster than Ostream. This is similar to Spirit's official performance data. The function return value here is returned through the std::string value copy, consuming a lot of time, and guessing Karma's performance advantage will only be greater if pure from the formatted output. Another test shows that karma should be the fastest formatted character stream scheme you can find in C/s + +.