Code Analysis of browser Lexer and XSS-HTML
0 × 00 Introduction
0 × 01 decoding process overview
0 × 02 lexical analysis in browsers
0 × 03 HTML encoding and HTML Parsing
0 × 04 common mistakes
0 × 05 interesting Fault Tolerance behavior of browsers
0 × 06 conclusion
0 × 00 Introduction
Coding has always been a pain point. In wooyun, there is an XSS coding article about some pain points. Now that we are ready to finish another explanation of the XSS encoding, we can also afford the name of this article, in this article, we will systematically explain the problem of HTML processing in the browser Lehttp: // www.2cto.com/article/201604/xerand the logic of HTML parsing in xss.
0 × 01 decoding process overview
Before starting XSS, if we do not know the encoding and decoding process, it will cause a lot of difficulties for XSS, if you are using an automated tool, but if you are using a manual XSS tool, you will suffer. If you are lucky, you will not be able to solve the coding problem.
To understand the coding process, first let's talk about it through browser parsing.
If you have some knowledge about HTML parsing in a browser, you must be clear about the principles of these tools. Generally, the browser uses Lehttp: // ghost (Browser script), but why do you need to talk about this part? The reason is that this is related to the decoding order!
For example, in an HTML (non-XHTML) environment, if your http://www.2cto.com/article/201604/xssoutput point is in the <script> label, you use the HTML Entity encoding format, how can this vulnerability be triggered? If you don't understand this problem, you may be useless.
0 × 02 lexical analysis in browsers
Readers who are familiar with the compilation principles can quickly skip the second paragraph or brief review.
I believe everyone has their own opinions on whether a computer worker needs to learn the compilation principles. However, I believe that if you want to be a good programmer or IT worker, you may not have to be proficient in the compilation principle, but at least you should understand IT. Due to the limitation of space, I am not going to talk too much about the compilation principle here. I just want to briefly mention what the compilation principle is and how it is applied in a browser.
Parser-Lehttp: // www.2cto.com/Article/201604/xer Combination (Parser-lexical analyzer)
This structure is responsible for parsing html documents. The parsing process is divided into two processes: lexical analysis and syntax analysis.
This section focuses on lexical analysis.
Lexical analysis is to break down the input sentence (statement, content) into ordered words and symbols. The specific example is that if the input is 1 + 2-3, after lexical analysis, five tokens should be obtained in order: 1 (int), + (option), 2 (int),-(option), 3 (int ). The result is then handed over to the syntax analysis for context-independent syntax discrimination.
If you are interested in learning how to implement lexical analysis, refer to the compilation principle and practice book.
In the browser, the lexical analysis feature is worth noting. For example, it automatically skips spaces, line breaks, or tabs in HTML, in this way, under some conditions, only multiple spaces or line breaks can play the waf principle. (But now this bypass method is out ). In addition, the comments may be ignored during lexical analysis. Do you have some ideas? Then, based on the previous experience of XSS, the author briefly describes the symbolic algorithm. You can check whether your guess is correct.
As we all know, when our browser parses html
This tag is parsed
These six symbols (token.
Is it that simple? Of course, the answer is no.
A simple example of the parsing process:
1. In Parsing <这个符号以前,状态是data state< p>
2. Then parse <的时候,解析状态变为tag open state,然后开始搜寻标签名,(在搜寻标签名的时候,我们要思考一个问题,<和标签名img并不是同一个token,他们显然是分别进行解析的,那么有没有可能忽略掉<和img之间的空格或者换行什么的?这个问题,我相信很好找到答案。)< p>
3. Find the Tag name and the status changes to the Tag name state. This state indicates that the Tag name has been recognized,
4. Then, when the latest> is read, the status of the tag name state is ended and the Data State is re-entered.
If tags are nested, repeat the preceding parsing steps.Is similar, except that the/Symbol creates a closed tag ID to indicate the closure of a tag.
But! But! This parsing process is quite loose. Due to the history of HTML, the HTML mode is strange, errors occur frequently, developers' levels are uneven, and HTML features are not standardized. This parsing process is doomed to be quite complex. This is the simplest example.
Do you think this element can be recognized? Can the window pop up? In this case, the answer must be yes.
If response?
In addition, there are a lot of amazing trick worth exploring.
0 × 03 HTML encoding and HTML Parsing
The emergence of HTML Entity encoding solves a problem, such <和> These two symbols are insecure in HTML documents because they indicate the beginning and end of the tag and are used for security purposes. Developers use a set of encoding policies called Entity encoding, which starts with & and ends with a semicolon.
However, with the knowledge of 0 × 02, we can guess that ";" is the delimiter. In actual processing, due to the high fault tolerance of the browser parser, all the symbols that can be separated by delimiters can be placed at the end. Of course, they can be correctly separated. This is also the famous trick: the final part of HTML Entity encoding. It can be omitted, this is the principle. Well, let's talk about some interesting things about HTML Entity encoding.
In HTML encoding, we all know that the name should be "& # Number" or "& # http://www.2cto.com/Article/201604/x. In the sleepy web, the author gives an example of an interesting symbol:
So what does this mean? Next we will talk about some very clear things and many misunderstandings: Where can HTML be parsed? Can the entire HTML document be parsed?
Specifically, HTML encoding must be resolved at the Data state (tehttp: // www.2cto.com/article/201604/xtsegment of the tag) and the position of the attribute values in the tag, however, in special mode (this problem will be explained in the next section), HTML encoding will not be parsed even in Data state. In 0 × 02, we know that when parsing to HTML, a pair <和> The parsing process is Tagname/open/close state. In these three states, HTML encoding will not be decoded, this means that when the value of the label attribute is parsed, the system automatically switches to the Data State, which is correct to us.
Next let's take Firefohttp: // www.2cto.com/article/201604/xas an example ):
When we see the following tag, some readers will ask why there is no backslash? It's easy. The browser identifies a tag. As we mentioned in section 0 × 02, we only need <作为标签的开始,并且> As the end of the tag, it is not necessary to haveTags can also be identified under HTML parsing conditions.
We enter these six labels in Notepad. The reader can guess which labels can pop up and which labels cannot be used?
Obviously, the first and second are all right.
The third does not clearly specify the border. In fact, this is the same as the second boundary processing. By default, the boundary is used. (We can assume that = and> are independent tokens, that is to say, even if you do not use single or double quotation marks or reverse quotation marks, it can still act as a boundary. In fact, this is the case)
The fourth person adds a boundary, so of course we can,
The fifth "is the quotation marks. The quotation marks here are in the data area, which is equivalent to onerror =" 'alert (1 )'"
Sixth, we know that the non-Data state segment will not be HTML encoded and decoded, so the sixth section destroys the attribute structure.
Special mode: