Jsoup Code Interpretation VII-Implementation of a CSS Selector
When to be! Finally came to the Jsoup features: CSS selector section. Selector is also one of the key points that I wrote about the reptile framework webmagic development. Attach a map of Street Fighter, hope that the future webmagic can also challenge jsoup!
Select mechanism
In the Jsoup select package, the class structure is as follows:
At the beginning of the introduction of Jsoup, it has been said NodeVisitor and Selector . Selectoris the external facade of the select part, while the NodeVisitor underlying api,css of the traversal Tree selector is also based on NodeVisitor the traversal of the implementation.
Jsoup's Select Core is Evaluator . The expression passed by selector will pass through QueryParser and eventually compile into one Evaluator . Evaluatoris an abstract class, it has only one method:
public abstract boolean matches(Element root, Element element);
Note that the root is passed in for some cases to traverse the tree.
Evaluator's design is simple and straightforward, and all selector expression words are compiled to the corresponding evaluator. For example #xx corresponds Id , corresponds .xx Class , [] corresponds Attribute . Here to add the CSS selector specification: http://www.w3.org/TR/CSS2/selector.html
Of course, this is not enough, Jsoup also defines (a CombiningEvaluator and/or combination of evaluator) StructuralEvaluator (combined with the DOM tree structure).
What we may be most concerned about here is how the parent-child structure such as Div ul Li is implemented. The implementation of this method in StructuralEvaluator.Parent , paste the code:
StaticClassParentExtendsStructuralevaluator {public parent (Evaluator Evaluator) {this.evaluator = Evaluator } public Boolean matches (element root, Element Element) {if ( root = = Element) return FALSE; Element parent = Element. Parent (); while (parent! = root) {if ( Evaluator.matches (Root, parent)) return true; parent = parent. Parent (); } return false;}
Here the parent contains a evaluator property that validates all parent nodes according to the evaluator. Note that the parent can be nested, so the expression "div ul Li" will eventually be compiled into And(Parent(And(Parent(Tag("div")),Tag("ul")),Tag("li"))) such a evaluator combination.
The Select section is simpler than you think, and the code is very readable. After a parser part of the study, this part should be regarded as a very familiar.
A follow-up plan for WebMagic
WebMagic is a reptile framework, its selector is used to crawl the text specified in the HTML, its mechanism and Jsoup evaluator very much like, but WebMagic temporarily is to encapsulate selector into a simpler API, and evaluator directly on the expression. Before also consider their own custom DSL to write an HTML, now see the Jsoup source code, realize the ability to have, but the introduction of DSL, implementation is only a small part, how to make the DSL easy to write easy to understand is the difficulty.
Actually looked at the Jsoup source code, the finer degree is better than the WebMagic, the basic each class corresponds to a real concept abstraction, may later in this aspect work.
Jsoup code interpretation of the five-implementation of a CSS Selector