Thoughts and conclusions on XSS prevention
I recently read some web security-related articles, most of which have systematic and complete solutions. However, XSS (Cross-site scripting) attack-related information is messy, even the XSS attacks where HTML object escaping can solve are unclear.
After turning over a bunch of materials, I thought I 'd better record some of my thoughts on it.
Note that there are different ways to avoid XSS:
Text section of the HTML Tag, for example:
<a href="{user_input}">{user_input}</a><button onclick="{user_input}"><link ref="{user_input}" href="{user_input}"><script src="{user_input}">
If user_data contains HTML tags, the appearance (img tags can be added) and logic (script tags can be added) of the display may be tampered. Therefore, at least convert the <character to <, so that no tag is enabled or disabled.
The single> character does not close the tag, so it can not be processed. However, when the output is in XML (such as RSS) or XHTML format, separate> may cause page parsing failure, so escape is required in this case.
In addition, when you enter &, you want it to be displayed as &, but the result is displayed as &, which is not in line with expectations. So the & characters should also be converted &.
Another small problem is that consecutive spaces and carriage returns may be treated as spaces, which can be handled using the white-space: pre-wrap style, or be replaced
Or ignore it directly.
In summary, the solution here is to escape at least <and & characters:
def escape_html_text(string): return string.replace('&', '&').replace('<', '<')
function escape_html_text(string) { return string.replace('&', '&').replace('<', '<');}
The attribute section of the HTML Tag, for example:
Here, users cannot enter <and & characters.
In addition, if you can enter>, you can easily close the input tag, so> need to be converted to>.
If quotation marks are allowed, this attribute can be closed, and other attributes can be inserted. Therefore, "and" need to be converted to "and" (& apos; is the entity in XML, which is not defined in HTML, so it is recommended to use the former ).
In addition, IE allows the use of the 'character as the attribute delimiter:
You can also escape the 'character if necessary. But if you can ensure that your code only uses "and" as the attribute delimiter, you can leave it alone.
Another more important thing is that attributes must be enclosed by quotation marks. Although the format below is acceptable, it is difficult to prevent XSS attacks:
To do this, OWASP recommends that all ASCII characters in the attribute value be encoded in the form of & # xHH.
Back to our solution, considering that there are too many replacement times, it is more efficient to use a regular expression:
import reescape_pattern = re.compile(r'[&<>"\']')escape_map = { '&': '&', '<': '<', '>': '>', '"': '"', "'": '''}def replacer(match): return escape_map[match.group(0)]def escape_html_attr(string): return escape_pattern.sub(replacer, string)
var escape_map = { '&': '&', '<': '<', '>': '>', '"': '"', "'": '''};function replacer(char) { return escape_map[char];}function escape_html_attr(string) { return string.replace(/[&<>"']/g, replacer);}
In fact, Python can also directly use the cgi. escape function, at least I am too lazy to write the code.
In addition, this replacement applies to the text part of the HTML tag, but the number of output bytes may increase, with no other side effects. The URL attribute of the HTML Tag, for example:
{User_input} <Script src = "{user_input}">Their processing rules are actually different:
The URL of the image must be encoded with the HTML attribute. The URL must be a valid URL (preferably prompted when the user inputs it). Otherwise, you can enter code such as javascript: alert (0. User input is not allowed in the following parts. URL parameters, for example:
...
The parameter must be percent encoded (percent-encoding). JavaScript has a built-in encodeURIComponent function (which ignores letters, numbers, and -_.! ~ * '(), Python can use urllib. quote (url ,'-_.!~ * () ') (I think the' character is still dangerous and should not be ignored ). Text section of the script Tag:
<Script> var value = {user_input}; var value = "{user_input}"; </script>In the first case, user input is not allowed.
In the second case, you need to consider how to use the value (for example, whether it is used for eval, whether it is used to generate HTML, etc ). Even if it is not in any danger in the future, it will be very troublesome to handle. We recommend that you convert all ASCII characters, except letters and numbers, into \ xHH format.
I am too lazy to write the code to avoid this situation. If you are interested, you can refer to the ESAPI implementation. There are many language versions. Text section of the style label:
User input is not allowed. Some websites that allow users to customize styles should provide templates and optional values if they are valid for others, because there are too many XSS attacks.
Most of the above XSS environments only involve one environment. browsers can parse HTML, JavaScript, and CSS, and ensure URL correctness. So sometimes we need both HTML entity encoding and JavaScript Hex Encoding, and we need to consider percent code.
In addition, there are some easy-to-use solutions, such as using a tested template engine. However, we recommend that you use XSS based on your knowledge of the XSS solution. Otherwise, you may have the XSS vulnerability.
Many XSS risks come from the worry-free mentality. If you use document. it is naturally much safer to manually set attributes and sub-elements in the createElement method. However, it is no doubt convenient to splice them into strings and then directly pass them to jQuery () or innerHTML, however, the risks are also highlighted.
Finally, this article briefly introduces some common solutions, but there are far more than several types of XSS.
All places that output untrusted data may be attacked by XSS. Think more about whether user input can achieve unexpected results, and over time, check these potential hidden points for a long time, in the future, it is very likely that no attack methods have been found (such as browser bugs and incorrect web server configuration ).