Python web development has become one of the mainstream today, but some of the relevant Third-party modules and libraries are not PHP and node.js many.
For example, the XSS filter component, PHP under the famous "HTML purifier" (http://htmlpurifier.org/), as well as the non-well-known filter components "xsshtml" (http://phith0n.github.io/XssHtml )
Python's Pip can also install a library called "Html-purifier", but this purifier and PHP are very different. This library is responsible for filtering out tags and attributes in HTML that are not in the whitelist.
Note that he is not filtering XSS, just filtering out labels and attributes that are not in the whitelist. In other words, such javascript is not filtered.
So I had to develop a Python XSS filter that I used in my own future Python project.
Talk about the specific implementation principle.
First, parsing HTML
Parsing HTML, using the Htmlparser class from Python. In the Python2, the name is Htmlparser, in the Python3 called Html.parser.
Using Htmlparser, you need your own class to inherit Htmlparser, and implement Handle_starttag, Handle_startendtag, Handle_endtag, Handle_data, and so on.
such as the Handle_starttag method, is called when entering a label. We can implement this method when we can get the tag tag that is being processed at this time, and all attribute attrs.
We can check whether the tag, Attrs is in the whitelist, and some special tags and attributes for special treatment, as follows:
Ii. Special handling of links
Some properties are possible to execute JavaScript code with JavaScript pseudo protocol, such as the href,embed src of a, so special processing is needed: to determine whether to start with http|https|ftp://, and if not, force the front plus HTTP ://
In this way, against the potential XSS injection.
III. special treatment of embed
Embed is a label embedded in a media file such as SWF, and in theory sometimes our rich-text editor is allowed to insert flash. However, we need to ensure that no JavaScript code can be executed in flash, and that he cannot send out some HTTP requests (which can easily cause csrf attacks).
So force to set the allowscriptaccess=never,allownetworking=none of the Embed label:
Four, when splicing tags and attributes, to prevent double quotes, become a new label
I have found an XSS Vulnerability (cve-2015-1433) in Roundcube webmail because the white list has been detected and then spliced HTML tags and attributes without filtering the double quotes, causing the property value to become a new property name, resulting in XSS.
So I'm using self.__htmlspecialchars to handle attribute values to prevent the more:
Finally, this module is also more convenient to use, the simplest demo is as follows:
Import Pxfilter
Parser = Pxfilter. Xsshtml ()
Parser.feed (")"
Parser.close ()
html = parser.gethtml ()
Print HTML
And then according to the instructions in the source code to modify it. GitHub Project Address: Https://github.com/phith0n/python-xss-filter
Oneself use web.py to build a demo, welcome test, Submit issues:http://python-xss-filter.leavesongs.com/, function, safety all still need everybody to give some suggestion.