This is a creation in Article, where the information may have evolved or changed.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.
Recently study the knowledge of Go crawler, use to goquery this library is more, especially to crawl to the HTML to choose and find matching content, Goquery selectors use particularly many, and there are many less common but very useful selectors, summarized here for reference.
If you have done front-end development before, jquery is not unfamiliar, goquery like jquery, it is the go version of jquery implementation. With it, HTML can be easily processed.
Selectors based on HTML element elements
This is relatively simple, based on a , and so on, the basic elements of the p HTML to choose, the direct use of the element name as a selector. For example dom.Find("div") .
12345678910111213141516171819 |
func Main ()' <body><div>div1</div><div>div2</div><span>span </span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("div"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example above, the elements can be div filtered out and body span not filtered.
ID Selector
This is the most frequently used, similar to the above example, there are two div elements, in fact, we only need one of them, then we just need to give this tag a unique id , so we can use the id selector, precise positioning.
12345678910111213141516171819 |
func Main ()' <body><div id= "Div1" >div1</div><div>div2</div><span >SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("#div1"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
Element ID Selector
idSelector to # start, followed id by the value of the element, using the syntax for dom.Find(#id) , the following example I will shorthand for Find(#id) , you know this is representative of the Goquery selector.
What if you have the same ID, but they belong to a different HTML element? There are good ways to combine with element. For example, if we filter elements as div , and id are div1 elements, we can use Find(div#div1) such filters to filter.
So the syntax for this type of filter Find(element#id) is that this is a common combination of methods, such as the following filter can also be used in this way combined.
Class Selector
classis also commonly used in HTML properties, we can class quickly filter the required HTML elements through the selector, its usage and ID selectors similar to the Find(".class") .
12345678910111213141516171819 |
func Main ()' <body><div id= "Div1" >div1</div><div class= "name" >div2</div ><span>SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find (". Name"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example above, this element is filtered out class name div .
Element Class Selector
classSelectors and id selectors can also be used in conjunction with HTML elements, and their syntax is similar so that Find(element.class) you can filter the elements of a specific element and specify class.
Property Selector
An HTML element has its own properties and property values, so we can also filter elements through attributes and values.
12345678910111213141516171819 td> |
func main () {html: = ' <body> <div>div1 </div><div class= "name" >DIV2</div><span>SPAN</span></body> ' Dom,err:=goquery . Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div[class]" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example, we div[class] filter out element for div and have this attribute with this selector, so the class first one is div not filtered.
Just above This example is taking the existence of a property as a filter, in the same vein, we can filter out the element that has the property as a value.
123 |
Dom. Find ("Div[class=name]"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
This allows us to filter out the class element that this property value is name div .
Of course, we can class use attributes as an example and other properties, such as href many, and custom attributes are also possible.
In addition to being completely equal, there are other ways of matching, using a similar way, here is a unified list, no longer an example
| Selector | Selector
Description |
| Find ("Div[lang]") |
Filter div elements that contain the lang attribute |
| Find ("Div[lang=zh]") |
Filter the div element with the Lang property as zh |
| Find ("Div[lang!=zh]") |
Filter the div element for which the lang attribute is not equal to zh |
| Find ("Div[lang¦=zh]") |
Filter the div element with the lang attribute beginning with zh or zh- |
| Find ("Div[lang*=zh]") |
The Filter Lang property contains the div element for this string of zh |
| Find ("Div[lang~=zh]") |
The Filter Lang property contains the DIV element for the word zh, with the words separated by a space |
| Find ("Div[lang$=zh]") |
Filter a DIV element with the lang attribute ending in zh, case-sensitive |
| Find ("Div[lang^=zh]") |
Filter the div element with the lang attribute starting with zh, case sensitive |
The above is the use of the property filter, as an example of a property filter, of course, you can also use a combination of multiple property filters, such as:
Find("div[id][lang=zh]"), you can connect them with multiple brackets. When there are multiple property filters, the elements that satisfy the filters are filtered.
Parent>child Selector
If we want to filter out sub-elements that qualify under an element, we can use a child element filter whose syntax is to Find("parent>child") filter the most direct (first-level) child of the parent element, which conforms to this condition.
12345678910111213141516171819202122 |
Func Main () {html: = ' <body><div lang= "ZH" >div1</div><div lang= "ZH-CN" >DIV2</div>< Div lang= "en" >DIV3</div><span><div>DIV4</div></span></body> ' dom,err:= Goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Body>div"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
The above example filters out the body most immediate child element under the parent element that matches the condition, and the div result is DIV1、DIV2、DIV3 that, although it is DIV4 also body a child element but not a level, it is not filtered.
So the problem is, I just want to DIV4 filter out what to do? is to filter body all the div elements, whether it is a level, two or N. There is a way, goquery to consider, just need to change the greater than () to the > space is good. For example above, change to the following selector.
123 |
Dom. Find ("body div"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Prev+next adjacent Selector
Suppose we want to filter the elements that are not regular, but the previous element of the element has a pattern, we can use this next neighbor selector to make the selection.
123456789101112131415161718192021222324 |
Func Main () {html: = ' <body><div lang= "zh" >div1</div><p>p1</p><div lang= "ZH-CN" > Div2</div><div lang= "en" >div3</div><span><div>div4</div></span><p >P2</p></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Div[lang=zh]+p"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
This example demonstrates this usage, we want to select <p>P1</p> this element, but there is no regularity, we find that it <div lang="zh">DIV1</div> is very regular in front of the choice, so we can use to Find("div[lang=zh]+p") achieve P the purpose of selecting elements.
The syntax for this selector is ("prev+next") that the middle is a plus sign (+), and the + number is also a selector.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Some of the more shameful site crawl my article will remove the copyright information, here to write a paragraph, we forgive.
Prev~next Selector
There are adjacent brothers, brother selectors are not necessarily required to be adjacent, as long as they have a parent element can be.
123 |
Dom. Find ("Div[lang=zh]~p"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Just the example, just need to + change the number ~ , you can P2 also filter out, because P2 , P1 and DIV1 are brothers.
The syntax of the brother selector is ("prev~next") that the adjacent selector is + replaced ~ .
Content Filter
Sometimes we use the selector to choose out after, want to filter again, this time to use the filter, filter has a lot, we first talk about the content filter this kind.
123 |
Dom. Find ("Div:contains (DIV2)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Find(":contains(text)")Indicates that the filtered element is to contain the specified text, and in our case the selected div element is required to contain DIV2 text, then only one DIV2 element satisfies the requirement.
There is also a Find(":empty") representation that the filtered elements cannot have child elements (including text elements), and only those elements that do not contain any child elements are filtered.
Find(":has(selector)")and contains almost, but this is the element node that is contained.
123 |
Dom. Find ("Span:has (div)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
The above example indicates that div the node containing the element is filtered out span .
: First-child Filter
:first-childfilters, which are syntax Find(":first-child") , indicate that the filtered elements are not filtered if they are the first child of their parent element.
1234567891011121314151617181920212223242526 |
func main () {html: = ' <body> <div lang= "en" >div1</div><p>p1 </p><div lang= "ZH-CN" >div2</div><div lang= "en" >div3</div><span><div style= "Display:none;" >div4</div><div>div5</div></span><p>p2</p><div></div></ Body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div:first-child" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
In the example above, we used to Find("div") filter out all the div elements, but when we added it, we had only the same ones, :first-child DIV1 because the DIV4 only two were the first child elements of their parent element, and none of the others DIV were satisfied.
: First-of-type Filter
:first-childSelector limit of the comparison dead, must be the first child element, if the element before the other in front, it can not be used :first-child , this :first-of-type time comes in handy, it requires as long as this type of the first can be, we have the above example fine-tuned.
12345678910111213141516171819202122232425 |
func main () {html: = ' <body> <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div:first-of-type" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
The change is very simple, DIV4 replace the original P2 , if we still use :first-child , DIV5 is not filtered out, because it is not the first child element, it has a front P2 . At this point we can use it to :first-of-type achieve the goal, because it requires the same type to be the first one. DIV5is the div first element of this type, P2 not the div type, is ignored.
: Last-child and: Last-of-type filter
These two are exactly the same as above, the :first-child :first-of-type opposite, means the last one, here no longer an example, you can try it yourself.
: Nth-child (N) filter
This indicates that the filtered element is the nth element of its parent element, and N begins with 1. So we can know :first-child and :nth-child(1) be equal. By specifying it n , we have the flexibility to filter out the elements we need.
12345678910111213141516171819202122232425 |
func main () {html: = ' <body> <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find (). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
This example filters out DIV2 because it DIV2 is the body third child element of its parent element.
: Nth-of-type (N) filter
:nth-of-type(n)and :nth-child(n) similar, except that it represents the nth of the same type of element, so :nth-of-type(1) and :first-of-type is equal, you can try it yourself, no longer an example here.
Nth-last-child (n) and: Nth-last-of-type (n) filter
These two are similar to the above, except that the reverse is calculated, and the last element is considered the first one. Let's test it out for ourselves, it's obvious.
: Only-child Filter
Find(":only-child")Filters, literally, can be guessed, it represents the filtered element, in its parent element, only its own, its parent element has no other child elements, will be matched filter out.
12345678910111213141516171819 |
func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div:only-child"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
The example DIV5 can be filtered out because it is the parent element span that reaches the unique child element, but it is DIV1 not, so it cannot be filtered out.
: Only-of-type Filter
The above example, if you want to filter out DIV1 what to do? Can be used Find(":only-of-type") because it is the only element in its parent element, div which is what the :only-of-type filter does, and the same type of elements can be filtered out as long as there is only one. Let's change the example above to :only-of-type try and see if there are any DIV1 .
Selector or (|) Operation
What if we want to filter out and div span wait for elements? This time can be used in combination with multiple selectors, and separated by commas (,), Find("selector1, selector2, selectorN") so long as one of the selectors can be filtered out, that is, the selector or (|) Arithmetic operation.
12345678910111213141516171819 |
func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div,span"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
Summary
Goquery is an essential tool for parsing HTML Web pages, in the process of crawling Web pages, flexible use of goquery different selectors, can make our crawl work more efficient, greatly enhance the efficiency of the crawler.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.