Secure Programming: verification Input

Last Update:2018-12-08 Source: Internet

Author: User

Tags processing text rfc account security

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Secure Programming: verification Input

Best practices for receiving user data

Document options

Expand Tomcat applications

Download the latest version of IBM Open-Source J2EE application server was ce V1.1

Level: elementary

David A. Wheeler, Full-time Researcher, Institute for Defense Analyses

July 10, 2003

This article describes how to verify input-one of the first steps of any security program.

On April 9, July 2003, the Computer Emergency Response Team coordination center reported a group of dangerous vulnerabilities in the Microsoft Windows DirectX MIDI library. The DirectXMIDI library is the underlying Windows library used to play MIDI music. Unfortunately, this library cannot check all data values in the MIDI file; incorrect values in the text, copyright, or MThd track field can cause the library to fail, attackers can exploit this vulnerability to allow the system to execute any code they want to execute. This is especially dangerous because Internet Explorer automatically loads the file and plays it when looking at a webpage containing a MIDI file link. What are the results? An attacker only needs to publish a webpage. When the user looks at the webpage, the user's computer will delete all the files, send all the confidential files to other places by email, and the machine will crash, or do anything that attackers want to do.

Check Input

In almost all security programs, your first line of defense is to check each piece of data you receive. If you can prevent malicious data from entering your program, or at least not process it in the program, your program will be more robust in the face of attacks. This is similar to the firewall's principle of protecting computers; it cannot prevent all attacks, but it can make a program more stable. This process is called checking, verifying, or filtering your input.

A major problem is where to perform the check? When the data first enters the program, or when a low-level routine actually uses the data? Generally, it is best to check both of them. In this way, even if an attacker successfully breaks through a line of defense, they will encounter another one. The most important rule is that all data must be checked before use.

Back to Top

Misunderstanding: Find incorrect input

One of the biggest mistakes of security program developers is to try to find "invalid" data values. This is not true because attackers are very smart and often think of other dangerous data values. Therefore, we should determine which data is valid, check whether the data meets the definition, and reject all data that does not comply with the definition. To ensure security, you should be especially cautious at the beginning and only allow you to know valid data. After all, if you have too strict restrictions, the user will soon report that the program does not allow legal data access. On the other hand, if your restrictions are too loose, you may not find this problem until the program is damaged.

For example, assume that you want to create a file name based on a user input. You may know that the user input should not include "/", but it may not be correct to check this character only. For example, what about control characters? Will there be a problem with spaces? What if it starts with a break (a problem may occur in bad code )? Will there be problems with special phrases? In most cases, if you create a list of "invalid" characters, attackers can still find ways to use your program. Therefore, check and ensure that the input meets the specific mode that you think is safe, and reject all input that does not conform to this mode.

It is still a good idea to identify the values you know: you can use them (in your mind) to check your validation routines. In this way, if you know that using "/" is dangerous, you can check your mode to ensure that it won't let this character pass.

Of course, all of these are faced with the question: what is a legal value? The answer depends on the expected data type. So in the following sections we will discuss several common data types that the program will use-and how to handle them.

Back to Top

Number

We start with a type of information that seems the easiest to read-a number. If you want to enter a number, make sure that the data is in the digital format-for example, only Arabic numerals and at least one Arabic digit (you can use a regular expression^[0-9]+$Check it ). In most cases, there is a minimum value and a maximum value; in this case, make sure that the data is within the valid range.

Do not consider that there will be no negative number based on the condition that there is no subtraction. In many Data Reading routines, if you read a very large number, "overflow" will occur and become a negative number. In fact, a very smart attack against Sendmail is based on this principle. Sendmail checks whether the "debugging mark" is greater than the valid value, but does not check whether the value is negative. Developers of Sendamil assume that since they do not allow subtraction, they do not have to check whether the input is negative. The problem is that the data reading routine converts a number greater than 2 ^ 31, for example, 4,294,967,269 to a negative number. Attackers can exploit this to overwrite critical data and control Sendmail.

If you read a floating point number, you need to pay attention to it. Many routines designed to read floating-point numbers may allow values such as NaN. This will actually cause problems to the subsequent processing routines, because any result compared with the data will be false (and NaN are not equal !). You also need to know other special Definitions of standard IEEE floating point numbers, such as positive infinity and negative infinity, and negative zero (and positive zero ). All input data that is not considered by your program may be exploited in the future.

Back to Top

String

Similarly, you must determine which strings are valid and reject all other strings. The simplest way to specify a valid string is to use a regular expression: you only need to use a regular expression to write a description of which strings are valid and discard data that does not conform to this pattern. For example,^[A-Za-z0-9]+$The specified string must be at least one character long and can only contain uppercase letters, lowercase letters, and Arabic numerals 0 to 9 (in any order ). You can use regular expressions to limit the allowed strings in more detail (for example, you can specify which letters can be the first character ). All languages have implemented Regular Expression Libraries; Perl is based on regular expressions, for C, functionsregcomp(3)Andregexec(3)Is a POSIX.2 standard and widely used.

If you use a regular expression, you must specify the beginning of the data to be matched.^) And end (usually$). If you forget to include^Or$Attackers can embed legal texts in their attacks to pass your checks. If you use Perl and use its multi-line options (m), Note: You must use\ATo mark the start, use\ZTo identify the end, because the multi-row operation has changed^And$.

The biggest problem is how to clearly identify which strings are valid. Generally, you should be as strict as possible. Many characters may cause specific problems. As long as possible, you are not willing to allow those characters with specific meanings in the program or in the final output. It is indeed difficult to find out, because in some cases too many characters may cause problems.

Here is a list of characters that often cause problems:

General control characters (character value less than 32 ):It also includes the character 0, which is traditionally called NUL; I call it NIL to distinguish it from the NULL pointer in C language. In C, NIL marks the end of a string. Even if you do not directly use the C language, many libraries call the C language routines. If NIL is given, an error may occur. Another problem can be interpreted as the line terminator of command termination. Unfortunately, there are several types of end encoding: UNIX-based systems use line breaks (0x0a), but DOS-based systems (including windows) CP/M carriage return newline (0x0d 0x0a) is used, Apple MacOS uses carriage return (0x0d), and many IBM hosts (such as OS/390) the next line (0x85) is used, and some programs even (mistakenly) use the anti-CP/M mark (0x0a 0x0d ).
Character value greater than 127: These are international characters, but the problem is that they may have many meanings, so make sure they are correctly explained. Usually these are characters of UTF-8 encoding, has its own complexity; can refer to this article later on the discussion of UTF-8.
Metacharacters:Metacharacters are characters with specific meanings in the programs or libraries you depend on, such as shell or SQL.
Characters with specific meanings in your program:For example, a character used for demarcation. Many programs store data in text files and use commas, tabs, or colons to separate data domains. You need to reject or encode data that contains these values. Currently, a common problem is less than sign (<), Because XML and HTML use it.

This is not a detailed list, And you often need to accept part of them. Later articles will discuss how to handle these characters when you have to receive them. The purpose of this list is to persuade you to try to accept as little data as possible, and consider it carefully before accepting another character. The fewer characters you accept, the more difficult it is to create for attackers.

Back to Top

More Special Data Types

Of course, there are more special data types. Here is a brief introduction to some of them.

File Name

If the data is a file name (or used to create a file), strictly restrict it. We recommend that you do not select a file name. If you have to do that, you should restrict the characters^[A-Za-z0-9][A-Za-z0-9._\-]*$. You should consider removing "/", control characters (especially new line generation), and the leading character "." (hidden files in UNIX/Linux systems) from the valid mode. It is not good to start with "-" because poorly written scripts will interpret them as options: if there is a file named "-rf", execute the command in UNIX/Linux.rm *Will be executedrm -rf *. Removing "../" from the mode is also a good idea, so that attackers cannot "Jump out" the current directory. If possible, do not allow wildcards (use the characters *,? , [], And {} to select a group of files). Attackers can create odd wildcard modes to disable the system without knowing how to handle them.

There is another problem in Windows: Some file names (ignoring the case of extensions and letters) are always considered physical devices. For example, if a program tries to open a program named com1w.w.com1.txt in any directory, it will be misunderstood by the system as trying to communicate with the serial port. Because I am concerned about UNIX-like systems, I will not discuss how to solve this problem in depth, and it does not make sense, because this is just an example, a condition where valid characters used for checking are insufficient.

Localization
In today's global economy, many programs allow users to display specific information related to languages and other languages (such as digit formats and character encoding ). The program obtains this information by providing a "Locale" value. For example, if the value of the localization parameter is "en_US.UTF-8", it indicates that the language of the localization parameter is English, American habits are used, and UTF-8 encoding is used. Local UNIX-like programs are from environment variables (usually LC_ALL, but may be more detailed into LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TIME; other values to be checked are NLSPATH, LANGUAGE, LANG, and LINGUAS .) Obtain this information. Network applications can obtain this information by receiving the header information of the language request or other methods.

Because the user may be an attacker, we need to verify the localization parameter value. We recommend that you make sure that the local Parameters match the following pattern:

^[A-Za-z][A-Za-z0-9_,+@\-\.=]*$

MeHowTo create this Authentication mode is more valuable than this mode itself. First, I searched for the relevant standards and library documents to determineCorrect. There are many conflicting standards, so I must ensure that the final mode can accept all the localization parameters defined by these standards. Soon I found that only the characters listed above are needed. Limiting this character set (especially the first character) can avoid many problems. Then I considered common dangerous characters (for example, the "/" used as the directory separator for the "upper-level directory" .. ", used for leading dashes, or empty localization parameters), and confirm that they are filtered out.

UTF-8

Internationalization has another influence on the program: character encoding. Processing text requires certain conventions to convert characters into numbers that can be processed by computers. These conventions are calledCharacter encoding. A particularly common text encoding method is UTF-8, which is an excellent character encoding method that essentially represents any character in any language. UTF-8 is particularly popular because it uses plain ASCII text as a simple subset of it. As a result, the originally designed program for processing ASCII can be easily upgraded to be able to process UTF-8; in some cases these programs do not need to be modified at all.

But, like any good thing, UTF-8 also has its shortcomings. Some UTF-8 characters are represented by one byte, some are represented by two bytes, and some are represented by three bytes, or even more, the program is assumed to always generate the shortest possible representation. However, many UTF-8 readers receive a "too long" sequence; for example, a sequence of some three bytes may be interpreted as a character represented by two bytes. Attackers can exploit this to "cheat" data verification to attack programs. Your filter may not allow hexadecimal 2F 2E 2E 2F ("/.. /"), but if it permits the hexadecimal value of the UTF-8 2F C0 AE 2E 2F, the program may also interpret it as"/.. /". So, if you want to receive UTF-8 text, you need to make sure that each character uses the shortest possible UTF-8 encoding (rejecting any text that is not the shortest form ). Many languages have tools to handle these issues, and it is not difficult to write them by yourself. Note that the sequence "C0 80" is an excessively long sequence that can represent NIL (character 00). Some languages (such as Java) think that this specific sequence can be received.

Email Address

Many programs must receive email addresses, but it is surprisingly difficult to process all possible legal email addresses (such as those specified in RFC 2882 and 822. Jeffrey Fiedl's "short" Regular Expression used to check E-mail addresses has a length of 4,724 characters, even if this still does not cover all cases. However, most programs can be very strict and only receive emails from a particularly restricted subset to work normally. In most cases, as long as the program can receive normal internet addresses in the "name @ domain" format (like "john.doe@somewhere.com "), it is no problem to reject technically legal addresses like John Doe <john.doe@somewhere.com>. The books published by vipers and Messier in 2003 contain child routines that can complete this check.

Cookies

Network applications often use cookies for important data. As I will talk about later, it is important not to forget that users can reset the cookie value and format at will. However, there is an important verification trick that should be mentioned now. If you receive a cookie value, check whether its domain value is as expected (for example, your site ). Otherwise, a website (possibly destroyed) may be inserted into the cookie used for spoofing. For more information, see ietf rfc 2965 ).

HTML

Sometimes your program needs to get data from an untrusted user and pass it to another user. If the program of the second user may be damaged by the data, you have the responsibility to protect the second user. Attacks that use seemingly trustable intermediate media to transmit malicious data are called "Cross-Site malicious content" attacks.

This problem is especially difficult for network applications, such as the Community "Blackboard" that allows users to add continuous comments on the spot ". In this case, attackers can try to add comments in HTML format containing malicious code scripts and image tags; the purpose is to allow other users' browsers to execute malicious code while viewing this article. Because attackers often try to add malicious scripts, these changes are called cross-site scripting (XSS) attacks ).

Generally, the best way to avoid such attacks is to verify that the HTML you receive does not include such malicious scripts. Similarly, you need to list all the security information you know and disable others.

Generally, you can receive at least the following and their end tags in HTML:

(paragraph)
(bold)
(italic)
(emphasis)
(special emphasis)
<Pre> (pre-defined text)
(Force Line Disconnection-note that it does not need to disable tags)

Remember that HTML tags are case-insensitive. Unless you have checked the property type and its value, do not receive any property; there are many attributes that support Javascript and so on, which may cause trouble for your users.

You can expand this set, but be careful. Note that any tag that allows users to immediately load another file, such as the image Tag, is suitable for XSS attacks.

Another problem is that you need to confirm that attackers cannot disrupt the rest of the file at will. In particular, make sure that any comments or fragments do not look like formal content. One way is to ensure that any XML or HTML command is completely symmetrical (close any opened command ). This is called "well-formed" data in XML terms. If you are receiving standard HTML, you may not need to do this for paragraph tags () because they are not symmetric.

In many cases, you may want to receive <a> (Hyperlink), or you may need the attribute "href ". If you must do this, you must verify the URI/URL you are linked to-this is our next topic.

URI/URL

Technically, hypertext links can be any "Uniform Resource Identifier" (URI), but most people only know a specific URI, that is, "unified resource locator" (URL ). Many users will blindly click a hyperlink pointing to a URI, and it is assumed that it will not cause any trouble to display it. As a developer, your task is to ensure that user expectations are not lost.

Although URI provides great flexibility, if you receive a URI from an attacker, you need to check it before transferring it to anyone else. Attackers can add a lot of odd things to URI to confuse users. For example, attackers can introduce some queries, causing users to do things they don't want to do, and they can mistakenly assume that they want to browse another website, rather than accessing it. Unfortunately, it is difficult to give a single pattern that applies to all situations. However, the most secure mode that can prevent most attacks and allow most useful links to pass (for public websites) is:

^(http|ftp|https)://[-A-Za-z0-9._/]+$

A more complex mode is:

^(http|ftp|https)://[-A-Za-z0-9._]+(\/([A-Za-z0-9\-\_\.\!\~\*\'\%\?]+))*/?$

If your needs are more complex, you need more complex patterns to check data; you can find other methods in my book (listed in references.

Data Files

Complex data files and data structures are usually composed of many small components. You only need to break down the file or structure and check each part. If these components have specific dependencies, check them together. At the beginning, writing this Code may be a bit boring, but it is indeed good for Reliability: If you drop the illegal data, many incredible problems will soon disappear.

Back to Top

Conclusion

Obviously, there are many different types of data that need to be checked. But where did the data enter your program? The answer varies in various situations. In fact, your program may obtain the attacker's data through the channels you did not think. I will discuss this issue in the next section.

Back to Top

References

For more information, see the original article on the developerWorks global site.
Read the first part of the David security programming series, "Secure Programming: developing secure programs ".
David'sSecure Programming for Linux and Unix HOWTO(Wheeler, September 2003) provides a detailed description of how to develop secure software.
CERT Advisory CA-2003-18 Integer Overflows in Microsoft Windows DirectX MIDI Library describes MIDI Library Vulnerabilities and links to more details.
Jeffrey Friedl'sMastering Regular Expressions(O'reilly & Associates, 1997) is a good book on how to create regular expressions.
[RHSA-2000: 057-04] glibc vulnerabilities in ld. so, locale and gettext describe how a local user improves his permissions through an error in localization.
Matt Bishop's How Attackers Break Programs, and How To Write Programs More Securely is a set of slides on SANS 2002 about How To Write security Programs; page 64th-66 deals with Integer Overflow in Sendmail.
John Viega and Matt MessierSecure Programming Cookbook(O'reilly & Associates, February 2003) there are many code snippets that can be used to verify data.
Kristol and Montulli's IETF Request for Comment (RFC) 2965 and HTTP State Management mechanisms discuss some security issues related to Web cookies.
"Server clinic: Linux SECURITY" lists some methods to ensure user account security.
The "software security principle" discusses the most important issues to pay attention to when building a secure system.
"Build Secure software" focuses on the technology that helps you create secure code. This series consists of two parts. The first part is the selection of programming languages and deployment platforms, and the second part is the operating system and authentication technology.
For more information about IBM Security, visit the IBM Research Security group homepage.
"Secure Internet applications on the AS/400 system" outlines SSL and involves OS/400 applications that use SSL.
The IBM Linux Technical Center supports several Security sample projects on Linux, including Linux Security Modules and GCC extension for protecting applications from stack-smashing attacks.
"Enterprise Security for Linux" is a white paper on IBM Tivoli Access Manager for Linux technology. InDeveloperWorksOfTivoli Developer Domain you can find more information about Tivoli.
InDeveloperWorksFind more Linux articles in the Linux area.

Back to Top

About the author


		David A. Wheeler is A computer security expert who has long been committed to improving development technologies for large and high-risk software systems. He wrote the book "Secure Programming for Linux and Unix HOWTO" and is a common standard verifier. David also wrote "Why Open Source Software/Free Software? Look at the Numbers !" Article and books published by Springer-VerlagAda95: The Lovelace TutorialAnd is a book published by IEEESoftware Inspection: An Industry Best PracticeCo-author and the first author. This articleDeveloperWorksThe article represents the author's point of view, not necessarily the position of the Institute for Defense Analyses. You can contact David via dwheelerNOSPAM@dwheeler.com.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More