Golang goquery Selector (selector) sample Daquan

Last Update:2018-01-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.

Recently study the knowledge of Go crawler, use to goquery this library is more, especially to crawl to the HTML to choose and find matching content, Goquery selectors use particularly many, and there are many less common but very useful selectors, summarized here for reference.

If you have done front-end development before, jquery is not unfamiliar, goquery like jquery, it is the go version of jquery implementation. With it, HTML can be easily processed.

Selectors based on HTML element elements

This is relatively simple, based on a , and so on, the basic elements of the p HTML to choose, the direct use of the element name as a selector. For example dom.Find("div") .

12345678910111213141516171819

func Main ()' <body><div>div1</div><div>div2</div><span>span </span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("div"). Each (funcint, selection *goquery. Selection)  {fmt. Println (selection. Text ())})}

In the example above, the elements can be div filtered out and body span not filtered.

ID Selector

This is the most frequently used, similar to the above example, there are two div elements, in fact, we only need one of them, then we just need to give this tag a unique id , so we can use the id selector, precise positioning.

12345678910111213141516171819

func Main ()' <body><div id= "Div1" >div1</div><div>div2</div><span >SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("#div1"). Each (funcint, selection *goquery. Selection)  {fmt. Println (selection. Text ())})}

Element ID Selector

idSelector to # start, followed id by the value of the element, using the syntax for dom.Find(#id) , the following example I will shorthand for Find(#id) , you know this is representative of the Goquery selector.

What if you have the same ID, but they belong to a different HTML element? There are good ways to combine with element. For example, if we filter elements as div , and id are div1 elements, we can use Find(div#div1) such filters to filter.

So the syntax for this type of filter Find(element#id) is that this is a common combination of methods, such as the following filter can also be used in this way combined.

Class Selector

classis also commonly used in HTML properties, we can class quickly filter the required HTML elements through the selector, its usage and ID selectors similar to the Find(".class") .

12345678910111213141516171819

func Main ()' <body><div id= "Div1" >div1</div><div class= "name" >div2</div ><span>SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find (". Name"). Each (funcint, selection *goquery. Selection)  {fmt. Println (selection. Text ())})}

In the example above, this element is filtered out class name div .

Element Class Selector

classSelectors and id selectors can also be used in conjunction with HTML elements, and their syntax is similar so that Find(element.class) you can filter the elements of a specific element and specify class.

Property Selector

An HTML element has its own properties and property values, so we can also filter elements through attributes and values.

 12345678910111213141516171819

td>

 func  main   ()   {html: =  ' <body>  <div>div1 </div><div class= "name" >DIV2</div><span>SPAN</span></body> ' Dom,err:=goquery . Newdocumentfromreader (Strings. Newreader (HTML)) if  err!=nil  {log. Fatalln (Err)}dom. Find ( "Div[class]" ). Each (func   (i int , Selection *goquery. Selection)   {fmt. Println (selection. Text ())})}

In the example, we div[class] filter out element for div and have this attribute with this selector, so the class first one is div not filtered.

Just above This example is taking the existence of a property as a filter, in the same vein, we can filter out the element that has the property as a value.

123	Dom. Find ("Div[class=name]"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})

This allows us to filter out the class element that this property value is name div .

Of course, we can class use attributes as an example and other properties, such as href many, and custom attributes are also possible.

In addition to being completely equal, there are other ways of matching, using a similar way, here is a unified list, no longer an example

Selector

Selector	Description
Find ("Div[lang]")	Filter div elements that contain the lang attribute
Find ("Div[lang=zh]")	Filter the div element with the Lang property as zh
Find ("Div[lang!=zh]")	Filter the div element for which the lang attribute is not equal to zh
Find ("Div[lang¦=zh]")	Filter the div element with the lang attribute beginning with zh or zh-
Find ("Div[lang*=zh]")	The Filter Lang property contains the div element for this string of zh
Find ("Div[lang~=zh]")	The Filter Lang property contains the DIV element for the word zh, with the words separated by a space
Find ("Div[lang$=zh]")	Filter a DIV element with the lang attribute ending in zh, case-sensitive
Find ("Div[lang^=zh]")	Filter the div element with the lang attribute starting with zh, case sensitive

The above is the use of the property filter, as an example of a property filter, of course, you can also use a combination of multiple property filters, such as:
Find("div[id][lang=zh]"), you can connect them with multiple brackets. When there are multiple property filters, the elements that satisfy the filters are filtered.

Parent>child Selector

If we want to filter out sub-elements that qualify under an element, we can use a child element filter whose syntax is to Find("parent>child") filter the most direct (first-level) child of the parent element, which conforms to this condition.

12345678910111213141516171819202122

Func Main () {html: = ' <body><div lang= "ZH" >div1</div><div lang= "ZH-CN" >DIV2</div>< Div lang= "en" >DIV3</div><span><div>DIV4</div></span></body> ' dom,err:= Goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Body>div"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})}

The above example filters out the body most immediate child element under the parent element that matches the condition, and the div result is DIV1、DIV2、DIV3 that, although it is DIV4 also body a child element but not a level, it is not filtered.

So the problem is, I just want to DIV4 filter out what to do? is to filter body all the div elements, whether it is a level, two or N. There is a way, goquery to consider, just need to change the greater than () to the > space is good. For example above, change to the following selector.

123	Dom. Find ("body div"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})

Prev+next adjacent Selector

Suppose we want to filter the elements that are not regular, but the previous element of the element has a pattern, we can use this next neighbor selector to make the selection.

123456789101112131415161718192021222324

Func Main () {html: = ' <body><div lang= "zh" >div1</div><p>p1</p><div lang= "ZH-CN" > Div2</div><div lang= "en" >div3</div><span><div>div4</div></span><p >P2</p></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Div[lang=zh]+p"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})}

This example demonstrates this usage, we want to select <p>P1</p> this element, but there is no regularity, we find that it <div lang="zh">DIV1</div> is very regular in front of the choice, so we can use to Find("div[lang=zh]+p") achieve P the purpose of selecting elements.

The syntax for this selector is ("prev+next") that the middle is a plus sign (+), and the + number is also a selector.

This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Some of the more shameful site crawl my article will remove the copyright information, here to write a paragraph, we forgive.

Prev~next Selector

There are adjacent brothers, brother selectors are not necessarily required to be adjacent, as long as they have a parent element can be.

123	Dom. Find ("Div[lang=zh]~p"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})

Just the example, just need to + change the number ~ , you can P2 also filter out, because P2 , P1 and DIV1 are brothers.

The syntax of the brother selector is ("prev~next") that the adjacent selector is + replaced ~ .

Content Filter

Sometimes we use the selector to choose out after, want to filter again, this time to use the filter, filter has a lot, we first talk about the content filter this kind.

123	Dom. Find ("Div:contains (DIV2)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})

Find(":contains(text)")Indicates that the filtered element is to contain the specified text, and in our case the selected div element is required to contain DIV2 text, then only one DIV2 element satisfies the requirement.

There is also a Find(":empty") representation that the filtered elements cannot have child elements (including text elements), and only those elements that do not contain any child elements are filtered.

Find(":has(selector)")and contains almost, but this is the element node that is contained.

123	Dom. Find ("Span:has (div)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})

The above example indicates that div the node containing the element is filtered out span .

: First-child Filter

:first-childfilters, which are syntax Find(":first-child") , indicate that the filtered elements are not filtered if they are the first child of their parent element.

 1234567891011121314151617181920212223242526

 func  main   ()   {html: =  ' <body>  <div lang= "en" >div1</div><p>p1 </p><div lang= "ZH-CN" >div2</div><div lang= "en" >div3</div><span><div style= "Display:none;" >div4</div><div>div5</div></span><p>p2</p><div></div></ Body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if  err!=nil  {log. Fatalln (Err)}dom. Find ( "Div:first-child" ). Each (func   (i int , Selection *goquery. Selection)   {fmt. Println (selection. Html ())})}

In the example above, we used to Find("div") filter out all the div elements, but when we added it, we had only the same ones, :first-child DIV1 because the DIV4 only two were the first child elements of their parent element, and none of the others DIV were satisfied.

: First-of-type Filter

:first-childSelector limit of the comparison dead, must be the first child element, if the element before the other in front, it can not be used :first-child , this :first-of-type time comes in handy, it requires as long as this type of the first can be, we have the above example fine-tuned.

 12345678910111213141516171819202122232425

 func  main   ()   {html: =  ' <body>  <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if  err!=nil  {log. Fatalln (Err)}dom. Find ( "Div:first-of-type" ). Each (func   (i int , Selection *goquery. Selection)   {fmt. Println (selection. Html ())})}

The change is very simple, DIV4 replace the original P2 , if we still use :first-child , DIV5 is not filtered out, because it is not the first child element, it has a front P2 . At this point we can use it to :first-of-type achieve the goal, because it requires the same type to be the first one. DIV5is the div first element of this type, P2 not the div type, is ignored.

: Last-child and: Last-of-type filter

These two are exactly the same as above, the :first-child :first-of-type opposite, means the last one, here no longer an example, you can try it yourself.

: Nth-child (N) filter

This indicates that the filtered element is the nth element of its parent element, and N begins with 1. So we can know :first-child and :nth-child(1) be equal. By specifying it n , we have the flexibility to filter out the elements we need.

 12345678910111213141516171819202122232425

 func  main   ()   {html: =  ' <body>  <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if  err!=nil  {log. Fatalln (Err)}dom. Find (). Each (func   (i int , Selection *goquery. Selection)   {fmt. Println (selection. Html ())})}

This example filters out DIV2 because it DIV2 is the body third child element of its parent element.

: Nth-of-type (N) filter

:nth-of-type(n)and :nth-child(n) similar, except that it represents the nth of the same type of element, so :nth-of-type(1) and :first-of-type is equal, you can try it yourself, no longer an example here.

Nth-last-child (n) and: Nth-last-of-type (n) filter

These two are similar to the above, except that the reverse is calculated, and the last element is considered the first one. Let's test it out for ourselves, it's obvious.

: Only-child Filter

Find(":only-child")Filters, literally, can be guessed, it represents the filtered element, in its parent element, only its own, its parent element has no other child elements, will be matched filter out.

12345678910111213141516171819

func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div:only-child"). Each (funcint, selection *goquery. Selection)  {fmt. Println (selection. Html ())})}

The example DIV5 can be filtered out because it is the parent element span that reaches the unique child element, but it is DIV1 not, so it cannot be filtered out.

: Only-of-type Filter

The above example, if you want to filter out DIV1 what to do? Can be used Find(":only-of-type") because it is the only element in its parent element, div which is what the :only-of-type filter does, and the same type of elements can be filtered out as long as there is only one. Let's change the example above to :only-of-type try and see if there are any DIV1 .

Selector or (|) Operation

What if we want to filter out and div span wait for elements? This time can be used in combination with multiple selectors, and separated by commas (,), Find("selector1, selector2, selectorN") so long as one of the selectors can be filtered out, that is, the selector or (|) Arithmetic operation.

12345678910111213141516171819

func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div,span"). Each (funcint, selection *goquery. Selection)  {fmt. Println (selection. Html ())})}

Summary

Goquery is an essential tool for parsing HTML Web pages, in the process of crawling Web pages, flexible use of goquery different selectors, can make our crawl work more efficient, greatly enhance the efficiency of the crawler.

This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More