Apache Commons codec in-depth learning and URLCODEC

Source: Internet
Author: User
Tags cairo rfc alphanumeric characters

We in the entry code, wrote some examples, it seems like this class, only the content, open Javadoc and then look, found that the bag is really a lot of goods.

Commons Codec 1.4 API

Packages

Org.apache.commons.codec

A small set of interfaces used by the various implementations in the Sub-packages.

Org.apache.commons.codec.binary

Base64, Binary, and hexadecimal String encoding and decoding.

Org.apache.commons.codec.digest

Operations to simplifiy common messagedigest tasks.

Org.apache.commons.codec.language

Language and Phonetic encoders.

Org.apache.commons.codec.net

Network related encoding and decoding.

First look at some of the URL coding related knowledge.

Brief introduction

When we surf the internet every day, there are some technologies we are facing all the time. There is the data itself (Web page), the formatting of the data, the transmission mechanism that allows us to get the data, and the basic and fundamental of making the Web network truly a Web: A link from one page to another. These links are URLs.

HTTP URL Syntax

For HTTP URLs (using http or HTTPS protocol), the scheme Description section of the URL defines the path tothe data, followed by the optional query and Fragment.

The path section looks like a hierarchical structure, similar to the hierarchical structure of folders and files in a file system. Path starts with the "/" character, each folder is separated by "/", and finally the file. For example, "/photos/egypt/cairo/first.jpg" has four path Fragments (segment): "Photos", "Egypt", "Cairo", and "first.jpg", which can be launched: "First.jpg" The file is in the folder "Cairo", and the "Egypt" folder is located in the root folder "photos" of the Web site.

URL syntax

http The URL scheme was originally defined by RFC 1738 (in fact, in the previous RFC 1630 ), and the entire URL syntax was extended by the HTTP URL scheme before it was redefined. c6> has evolved several times to evolve into a Uniform Resource Identifier (Uniform Resource Identifiers that is URIs).

For how URLs are assembled, how each part separates a set of grammars. For example: "://" Partition scheme and host part. The host is separated from the path fragment part by "/", and the query part is immediately on the "?" After. This means that some characters are reserved for syntax. Some are reserved for the entire URIs, while others are reserved for specific scenarios. All the reserved characters that appear where they should not be (for example, path fragments -take file names for example-may contain "?") ) must be URL encoded .

URL encoding transforms a character into a harmless form that is meaningless for URL parsing. It converts a character into a sequence of bytes encoded in a specific character , then converts the byte into a 16 binary form and adds "%" to it. The URL-encoded form of the question mark is "%3f".

We can point to "To_be_or_not_to_be?". JPG "Picture URL written as:" Http://example.com/to_be_or_not_to_be%3F.jpg ", so no one would think it might be a query part.

Most browsers now decode the URLs before displaying them (returning the percent-delimited bytes to their original characters) and recoding them when acquiring their network resources. As a result, many users never realize the existence of encodings.

On the other hand, Web page authors, developers must be aware of this, because there are many pitfalls.

Common Pitfalls of URLs

If you're dealing with URLs, it's definitely worth knowing the common pitfalls you can avoid. Now let's introduce some common pitfalls that are not limited to this.

What type of character encoding is used?

The URL encoding specification does not define what character encoding is used to encode bytes. Generic ASCII alphanumeric characters do not need to be escaped, but reserved words other than ASCII are required (for example, "?" in the French word "n?ud"). We must ask what type of character encoding should be used to encode the URL byte.

Of course, if only Unicode , the world would be a lot more pure. Because each character contains it, it's just a collection, or a list if you want, it's not a code in itself. Unicode can be encoded in a number of ways, such as UTF-8 or UTF-16(there are other formats), but the problem is not resolved: what kind of characters should we use to encode URLs (often referred to as URIs).

The standard does not define how a URI should specify its encoding, so it must be inferred from the environment information. For an HTTP URL, it can be an encoding format for an HTML page, or an HTTP header. This is often confusing and a source of many mistakes. In fact, the latest version of the URI standard defines a new URI scheme that will use Utf-8,host (even existing scheme) and UTF-8, This makes me more skeptical: can host and path really use a different encoding?

Each part of the reserved word is different.

Yes, they are, yes, they are, yes, they are ...

For a httpd connection, spaces in the path fragment portion are encoded as "%20" (No, no "+" at all), while the "+" character in the path fragment portion can be persisted without encoding.

Now, in the query section, a space may be encoded as "+" (For backwards compatibility: Do not attempt to search for him in the URI standard) or "%20", when the "+" character (as the result of a wildcard) is compiled to "%2b".

This means that the "Blue+light blue" string, if it is in the path section or in the query section, will have different encodings. For example, to get the "http://example.com/blue+light%20blue?blue%2Blight+blue" encoding form, so that we do not need to parse the URL structure from the syntax, we can deduce that the entire structure of the URL is possible

Consider the following Java code snippet for assembly URL

?

1

2

String str ="blue+light blue";

String url =" " http://example.com/ + str + "?"+ str;

The encoding URL is not a simple word Fudiede for escaping reserved words, we need to know exactly which URL part has the reserved words, and the targeted encoding.

This also means that URL rewrite filters are usually problematic if you do not take into account the appropriate coding details and then directly fragment the URL. It is impractical to encode URLs without considering specific segmentation rules.

The reserved word is not what you think.

Most people do not know that "+" is allowed in the path section and is specifically a plus sign instead of a space. Other similar are:

· "?" In the query section allow not to be escaped,

· "/" In the query section allows not to be escaped,

· "=" is allowed not to be escaped as a path parameter or a query parameter value and in the Path section.

· ":@-._~!$& ' () *+,;=" Characters in the path section allow not to be escaped,

· Characters such as "/?:@-._~!$& ' () *+,;=" are not allowed to be escaped in any segment.

So the following address seems a bit confusing: "http://example.com/:@-._~!$& ' () *+,=;:@-._~!$& ' () *+,=:@-._~!$& ' () *+,==?/?:@-. _~!$' () *+,;=/?:@-._~!$ ' () *+,;==#/?:@-._~!$& ' () *+,;= "

According to the above rules, it is actually a legal address.

If interested, the relevant knowledge is very much, from the perspective of the use of simple summed up:

1. Characters that need to be transcoded via URLs, such as kanji;

2. Do not need to pass the URL transcoding characters, such as 12*, etc.;

3. Reserved words that need to be processed, such as #等.

If all the characters have to be handled by themselves, then the development of the application will be more complex, and very boring, org.apache.commons.codec.net This package provides this data processing.

We look at the source code:

Package Test.ffm83.commons.codec;

import org.apache.commons.codec.CharEncoding;

import Org.apache.commons.codec.net.URLCodec;

import org.apache.commons.lang.BooleanUtils;

import org.apache.commons.lang.StringUtils;

/**

 * through Apache Commonscodec of the Net Package for URL Encryption of Data

*

 * @author Fan Fangming

*/

Public class Urlcodecusage {

Public static void main (string[] args)throws exception{

Urlcodecusage usage = new urlcodecusage ();

Usage.useurlencode ();

Usage.usesafecharencodedecode ();

Usage.useunsafeencodedecode ();

}

// of ordinary characters URL Encrypt

Public void Useurlencode () throws exception{

System. out. println (StringUtils. Center(" URL encryption for ordinary characters ", " -"));

Final String msg =" Fan Fangming in doing Url encode test,123456";

Final Urlcodec Urlcodec =new urlcodec ();

String enmsg =urlcodec.encode (msg, charencoding. Utf_8);

System. out. println (msg +", after encryption ");

System. out. println (enmsg);

}

// encoding and decoding of secure characters

Public voidUsesafecharencodedecode () throws Exception {

System. out. println (StringUtils. Center(" encoding and decoding of secure characters ", "-"));

Final Urlcodec Urlcodec =new urlcodec ();

Final String plain ="abc123_-.*";

//final String plain = "12345";

Final String encoded = Urlcodec.encode (plain);

System. out. println ("plain:" + plain);

System. out. println ("encoded:" + encoded);

System. out. println ("Safe chars URL Encoding test:"+ booleanutils. Tostringyesno(

Plain.equals (encoded));

System. out. println ("Safe chars URL Encoding test:"+ booleanutils. Tostringyesno(

Plain.equals (Urlcodec.decode (encoded)));

}

// encoding and decoding of unsafe characters

Public void Useunsafeencodedecode ()throwsException {

System. out. println (StringUtils. Center(" encoding and decoding of unsafe characters ", "-"));

Final Urlcodec Urlcodec =new urlcodec ();

Final String plain ="[Email protected]#$%^& () +{}\" \ \;: ',/[] ";

Final String encoded = Urlcodec.encode (plain);

System. out. println ("plain:" + plain);

System. out. println ("encoded:" + encoded);

System. out. println ("Safe chars URL Encoding test:"+ booleanutils. Tostringyesno(

Plain.equals (encoded));

System. out. println ("Safe chars URL Encoding test:"+ booleanutils. Tostringyesno(

Plain.equals (Urlcodec.decode (encoded)));

}

}

Post-run results

--------------------the URL encryption of ordinary characters--------------------

Fan Fangming is doing urlencode test, 123456, after encryption

%e8%8c%83%e8%8a%b3%e9%93%ad%e5%9c%a8%e5%81%9aurl+encode%e6%b5%8b%e8%af%95%ef%bc%8c123456

--------------------encoding and decoding of secure characters---------------------

plain:abc123_-.*

encoded:abc123_-.*

Safe chars URL Encoding Test:yes

Safe chars URL Encoding Test:yes

--------------------encoding and decoding of unsafe characters--------------------

Plain:[email protected]#$%^& () +{} "\;: ',/[]

encoded:%7e%21%40%23%24%25%5e%26%28%29%2b%7b%7d%22%5c%3b%3a%60%2c%2f%5b%5d

Safe chars URL Encoding Test:no

Safechars URL Encoding Test:yes

Apache Commons codec in-depth learning and URLCODEC

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.