Verify input-Best Practices for receiving user data

Last Update:2017-10-08 Source: Internet

Author: User

Tags arabic numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Article Title: Best practices for verifying input-receiving user data. Linux is a technology channel of the IT lab in China. Includes basic categories such as desktop applications, Linux system management, kernel research, embedded systems, and open source.
On April 9, July 2003, the Computer Emergency Response Team coordination center reported a group of dangerous vulnerabilities in the Microsoft Windows DirectX MIDI library. The DirectXMIDI library is the underlying Windows library used to play MIDI music. Unfortunately, this library cannot check all data values in the MIDI file; incorrect values in the text, copyright, or MThd track field can cause the library to fail, attackers can exploit this vulnerability to allow the system to execute any code they want to execute. This is especially dangerous because Internet Explorer automatically loads the file and plays it when looking at a webpage containing a MIDI file link. What are the results? An attacker only needs to publish a webpage. When the user looks at the webpage, the user's computer will delete all the files, send all the confidential files to other places by email, and the machine will crash, or do anything that attackers want to do.
　　
　　 Check Input
In almost all security programs, your first line of defense is to check each piece of data you receive. If you can prevent malicious data from entering your program, or at least not process it in the program, your program will be more robust in the face of attacks. This is similar to the firewall's principle of protecting computers; it cannot prevent all attacks, but it can make a program more stable. This process is called checking, verifying, or filtering your input.
　　
A major problem is where to perform the check? When the data first enters the program, or when a low-level routine actually uses the data? Generally, it is best to check both of them. In this way, even if an attacker successfully breaks through a line of defense, they will encounter another one. The most important rule is that all data must be checked before use.
　　
　　 Misunderstanding: Find incorrect input
One of the biggest mistakes of security program developers is to try to find "invalid" data values. This is not true because attackers are very smart and often think of other dangerous data values. Therefore, we should determine which data is valid, check whether the data meets the definition, and reject all data that does not comply with the definition. To ensure security, you should be especially cautious at the beginning and only allow you to know valid data. After all, if you have too strict restrictions, the user will soon report that the program does not allow legal data access. On the other hand, if your restrictions are too loose, you may not find this problem until the program is damaged.
　　
For example, assume that you want to create a file name based on a user input. You may know that the user input should not include "/", but it may not be correct to check this character only. For example, what about control characters? Will there be a problem with spaces? What if it starts with a break (a problem may occur in bad code )? Will there be problems with special phrases? In most cases, if you create a list of "invalid" characters, attackers can still find ways to use your program. Therefore, check and ensure that the input meets the specific mode that you think is safe, and reject all input that does not conform to this mode.
　　
It is still a good idea to identify the values you know: you can use them (in your mind) to check your validation routines. In this way, if you know that using "/" is dangerous, you can check your mode to ensure that it won't let this character pass.
　　
Of course, all of these are faced with the question: what is a legal value? The answer depends on the expected data type. So how many general data types will we discuss in the following sections ?? And how to handle them.
　　
　　 Number
Starting with the first type of information that seems the easiest to read ?? Number. If you want to enter a number, are you sure the data format is digital ?? For example, it is only for Arabic numerals and at least one Arabic digit (you can use the regular expression ^ [0-9] + $ to check it ). In most cases, there is a minimum value and a maximum value; in this case, make sure that the data is within the valid range.
　　
Do not consider that there will be no negative number based on the condition that there is no subtraction. In many Data Reading routines, if you read a very large number, "overflow" will occur and become a negative number. In fact, a very smart attack against Sendmail is based on this principle. Sendmail checks whether the "debugging mark" is greater than the valid value, but does not check whether the value is negative. Developers of Sendamil assume that since they do not allow subtraction, they do not have to check whether the input is negative. The problem is that the data reading routine converts a number greater than 2 ^ 31, for example, 4,294,967,269 to a negative number. Attackers can exploit this to overwrite critical data and control Sendmail.
　　
If you read a floating point number, you need to pay attention to it. Many routines designed to read floating-point numbers may allow values such as NaN. This will actually cause problems to the subsequent processing routines, because any result compared with the data will be false (and NaN are not equal !). You also need to know other special Definitions of standard IEEE floating point numbers, such as positive infinity and negative infinity, and negative zero (and positive zero ). All input data that is not considered by your program may be exploited in the future.
　　
　　 String
Similarly, you must determine which strings are valid and reject all other strings. The simplest way to specify a valid string is to use a regular expression: you only need to use a regular expression to write a description of which strings are valid and discard data that does not conform to this pattern. For example, ^ [A-Za-z0-9] + $ specifies that the string is at least one character long and can only contain uppercase letters, lowercase letters, and Arabic numbers 0 to 9 (in any order ). You can use regular expressions to limit the allowed strings in more detail (for example, you can specify which letters can be the first character ). Perl is based on regular expressions. For C, the regcomp (3) and regexec (3) functions are POSIX.2 standards and are widely used.
　　
If you use a regular expression, you must specify the START (usually marked by ^) and end (usually identified by $) of the data to be matched ). If you forget to include ^ or $, attackers can embed legal texts in their attacks to pass your check. If you are using Perl and use its multi-line option (m), note: You must use \ A to mark the start and end with \ Z, because the multi-row operation changes the meaning of ^ and $.
　　
The biggest problem is how to clearly identify which strings are valid. Generally, you should be as strict as possible. Many characters may cause specific problems. As long as possible, you are not willing to allow those characters with specific meanings in the program or in the final output. It is indeed difficult to find out, because in some cases too many characters may cause problems.
　　
Here is a list of characters that often cause problems:
　　
General control character (character value less than 32): it also includes character 0, traditionally called NUL; I call it NIL to distinguish it from NULL pointer in C language. In C, NIL marks the end of a string. Even if you do not directly use the C language, many libraries call the C language routines. If NIL is given, an error may occur. Another problem can be interpreted as the line terminator of command termination. Unfortunately, there are several types of end encoding: UNIX-based systems use line breaks (0x0a), but DOS-based systems (including windows) CP/M carriage return newline (0x0d 0x0a) is used, Apple MacOS uses carriage return (0x0d), and many IBM hosts (such as OS/390) the next line (0x85) is used, and some programs even (mistakenly) use the anti-CP/M mark (0x0a 0x0d ).
Character values greater than 127: These are international characters, but the problem is that they may have many meanings, so make sure they are correctly interpreted. Usually these are characters of UTF-8 encoding, has its own complexity; can refer to this article later on the discussion of UTF-8.
Metacharacters: metacharacters are specific characters in the programs or libraries you depend on, such as shell or SQL.
A character with a specific meaning in your program: for example, a character used for demarcation. Many programs store data in text files and use commas, tabs, or colons to separate data domains. You need to reject or encode data that contains these values. Currently, a common problem is less than sign (<), Because XML and HTML use it.
This is not a detailed list, And you often need to accept part of them. Later articles will discuss how to handle these characters when you have to receive them. The purpose of this list is to persuade you to try to accept as little data as possible, and consider it carefully before accepting another character. The fewer characters you accept, the more difficult it is to create for attackers.
　　
　　 More Special Data Types
Of course, there are more special data types. Here is a brief introduction to some of them.
　　
　　 File Name
If the data is a file name (or used to create a file), strictly restrict it. It is better not to let the user select the file name, if you have to do that, then the character is confined to a smaller pattern like ^ [A-Za-z0-9] [A-Za-z0-9. _ \-] * $. You should consider removing "/", control characters (especially new line generation), and the leading character "." (hidden files in UNIX/Linux systems) from the valid mode. It is not good to use "-" as the leading line, because poorly written scripts will interpret them as options: if there is a file named "-rf ", in UNIX/Linux, executing the command rm * will change to executing rm-rf *. Removing "../" from the mode is also a good idea, so that attackers cannot "Jump out" the current directory. If possible, do not allow wildcards (use the characters *,? , [], And {} to select a group of files). Attackers can create odd wildcard modes to disable the system without knowing how to handle them.
　　
There is another problem in Windows: Some file names (ignoring the case of extensions and letters) are always considered physical devices. For example, if a program tries to open a program named com1w.w.com1.txt in any directory, it will be misunderstood by the system as trying to communicate with the serial port. Because I am concerned about UNIX-like systems, I will not discuss how to solve this problem in depth, and it does not make sense, because this is just an example, a condition where valid characters used for checking are insufficient.
　　
　　 Localization
In today's global economy, many programs allow users to display specific information related to languages and other languages (such as digit formats and character encoding ). The program obtains this information by providing a "Locale" value. For example, if the value of the localization parameter is "en_US.UTF-8", it indicates that the language of the localization parameter is English, American habits are used, and UTF-8 encoding is used. Local UNIX-like programs are from environment variables (usually LC_ALL, but may be more detailed into LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TIME; other values to be checked are NLSPATH, LANGUAGE, LANG, and LINGUAS .) Obtain this information. Network applications can obtain this information by receiving the header information of the language request or other methods.
　　
Because the user may be an attacker, we need

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More