This is a creation in Article, where the information may have evolved or changed.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org
to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.
Recently study the knowledge of Go crawler, use to goquery this library is more, especially to crawl to the HTML to choose and find matching content, Goquery selectors use particularly many, and there are many less common but very useful selectors, summarized here for reference.
If you have done front-end development before, jquery is not unfamiliar, goquery like jquery, it is the go version of jquery implementation. With it, HTML can be easily processed.
Selectors based on HTML element elements
This is relatively simple, based on a
, and so on, the basic elements of the p
HTML to choose, the direct use of the element name as a selector. For example dom.Find("div")
.
12345678910111213141516171819 |
func Main ()' <body><div>div1</div><div>div2</div><span>span </span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("div"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example above, the elements can be div
filtered out and body
span
not filtered.
ID Selector
This is the most frequently used, similar to the above example, there are two div
elements, in fact, we only need one of them, then we just need to give this tag a unique id
, so we can use the id
selector, precise positioning.
12345678910111213141516171819 |
func Main ()' <body><div id= "Div1" >div1</div><div>div2</div><span >SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("#div1"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
Element ID Selector
id
Selector to #
start, followed id
by the value of the element, using the syntax for dom.Find(#id)
, the following example I will shorthand for Find(#id)
, you know this is representative of the Goquery selector.
What if you have the same ID, but they belong to a different HTML element? There are good ways to combine with element. For example, if we filter elements as div
, and id
are div1
elements, we can use Find(div#div1)
such filters to filter.
So the syntax for this type of filter Find(element#id)
is that this is a common combination of methods, such as the following filter can also be used in this way combined.
Class Selector
class
is also commonly used in HTML properties, we can class
quickly filter the required HTML elements through the selector, its usage and ID
selectors similar to the Find(".class")
.
12345678910111213141516171819 |
func Main ()' <body><div id= "Div1" >div1</div><div class= "name" >div2</div ><span>SPAN</span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find (". Name"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example above, this element is filtered out class
name
div
.
Element Class Selector
class
Selectors and id
selectors can also be used in conjunction with HTML elements, and their syntax is similar so that Find(element.class)
you can filter the elements of a specific element and specify class.
Property Selector
An HTML element has its own properties and property values, so we can also filter elements through attributes and values.
12345678910111213141516171819 td> |
func main () {html: = ' <body> <div>div1 </div><div class= "name" >DIV2</div><span>SPAN</span></body> ' Dom,err:=goquery . Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div[class]" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
In the example, we div[class]
filter out element for div
and have this attribute with this selector, so the class
first one is div
not filtered.
Just above This example is taking the existence of a property as a filter, in the same vein, we can filter out the element that has the property as a value.
123 |
Dom. Find ("Div[class=name]"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
This allows us to filter out the class
element that this property value is name
div
.
Of course, we can class
use attributes as an example and other properties, such as href
many, and custom attributes are also possible.
In addition to being completely equal, there are other ways of matching, using a similar way, here is a unified list, no longer an example
Selector | Selector
Description |
Find ("Div[lang]") |
Filter div elements that contain the lang attribute |
Find ("Div[lang=zh]") |
Filter the div element with the Lang property as zh |
Find ("Div[lang!=zh]") |
Filter the div element for which the lang attribute is not equal to zh |
Find ("Div[lang¦=zh]") |
Filter the div element with the lang attribute beginning with zh or zh- |
Find ("Div[lang*=zh]") |
The Filter Lang property contains the div element for this string of zh |
Find ("Div[lang~=zh]") |
The Filter Lang property contains the DIV element for the word zh, with the words separated by a space |
Find ("Div[lang$=zh]") |
Filter a DIV element with the lang attribute ending in zh, case-sensitive |
Find ("Div[lang^=zh]") |
Filter the div element with the lang attribute starting with zh, case sensitive |
The above is the use of the property filter, as an example of a property filter, of course, you can also use a combination of multiple property filters, such as:
Find("div[id][lang=zh]")
, you can connect them with multiple brackets. When there are multiple property filters, the elements that satisfy the filters are filtered.
Parent>child Selector
If we want to filter out sub-elements that qualify under an element, we can use a child element filter whose syntax is to Find("parent>child")
filter the most direct (first-level) child of the parent element, which conforms to this condition.
12345678910111213141516171819202122 |
Func Main () {html: = ' <body><div lang= "ZH" >div1</div><div lang= "ZH-CN" >DIV2</div>< Div lang= "en" >DIV3</div><span><div>DIV4</div></span></body> ' dom,err:= Goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Body>div"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
The above example filters out the body
most immediate child element under the parent element that matches the condition, and the div
result is DIV1、DIV2、DIV3
that, although it is DIV4
also body
a child element but not a level, it is not filtered.
So the problem is, I just want to DIV4
filter out what to do? is to filter body
all the div
elements, whether it is a level, two or N. There is a way, goquery to consider, just need to change the greater than () to the >
space is good. For example above, change to the following selector.
123 |
Dom. Find ("body div"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Prev+next adjacent Selector
Suppose we want to filter the elements that are not regular, but the previous element of the element has a pattern, we can use this next neighbor selector to make the selection.
123456789101112131415161718192021222324 |
Func Main () {html: = ' <body><div lang= "zh" >div1</div><p>p1</p><div lang= "ZH-CN" > Div2</div><div lang= "en" >div3</div><span><div>div4</div></span><p >P2</p></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if Err!=nil{log. Fatalln (Err)}dom. Find ("Div[lang=zh]+p"). Each (func (i int, selection *goquery. Selection) {fmt. Println (selection. Text ())})} |
This example demonstrates this usage, we want to select <p>P1</p>
this element, but there is no regularity, we find that it <div lang="zh">DIV1</div>
is very regular in front of the choice, so we can use to Find("div[lang=zh]+p")
achieve P
the purpose of selecting elements.
The syntax for this selector is ("prev+next")
that the middle is a plus sign (+), and the + number is also a selector.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org
to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Some of the more shameful site crawl my article will remove the copyright information, here to write a paragraph, we forgive.
Prev~next Selector
There are adjacent brothers, brother selectors are not necessarily required to be adjacent, as long as they have a parent element can be.
123 |
Dom. Find ("Div[lang=zh]~p"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Just the example, just need to +
change the number ~
, you can P2
also filter out, because P2
, P1
and DIV1
are brothers.
The syntax of the brother selector is ("prev~next")
that the adjacent selector is +
replaced ~
.
Content Filter
Sometimes we use the selector to choose out after, want to filter again, this time to use the filter, filter has a lot, we first talk about the content filter this kind.
123 |
Dom. Find ("Div:contains (DIV2)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
Find(":contains(text)")
Indicates that the filtered element is to contain the specified text, and in our case the selected div
element is required to contain DIV2
text, then only one DIV2
element satisfies the requirement.
There is also a Find(":empty")
representation that the filtered elements cannot have child elements (including text elements), and only those elements that do not contain any child elements are filtered.
Find(":has(selector)")
and contains
almost, but this is the element node that is contained.
123 |
Dom. Find ("Span:has (div)"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Text ())}) |
The above example indicates that div
the node containing the element is filtered out span
.
: First-child Filter
:first-child
filters, which are syntax Find(":first-child")
, indicate that the filtered elements are not filtered if they are the first child of their parent element.
1234567891011121314151617181920212223242526 |
func main () {html: = ' <body> <div lang= "en" >div1</div><p>p1 </p><div lang= "ZH-CN" >div2</div><div lang= "en" >div3</div><span><div style= "Display:none;" >div4</div><div>div5</div></span><p>p2</p><div></div></ Body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div:first-child" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
In the example above, we used to Find("div")
filter out all the div
elements, but when we added it, we had only the same ones, :first-child
DIV1
because the DIV4
only two were the first child elements of their parent element, and none of the others DIV
were satisfied.
: First-of-type Filter
:first-child
Selector limit of the comparison dead, must be the first child element, if the element before the other in front, it can not be used :first-child
, this :first-of-type
time comes in handy, it requires as long as this type of the first can be, we have the above example fine-tuned.
12345678910111213141516171819202122232425 |
func main () {html: = ' <body> <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find ( "Div:first-of-type" ). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
The change is very simple, DIV4
replace the original P2
, if we still use :first-child
, DIV5
is not filtered out, because it is not the first child element, it has a front P2
. At this point we can use it to :first-of-type
achieve the goal, because it requires the same type to be the first one. DIV5
is the div
first element of this type, P2
not the div
type, is ignored.
: Last-child and: Last-of-type filter
These two are exactly the same as above, the :first-child
:first-of-type
opposite, means the last one, here no longer an example, you can try it yourself.
: Nth-child (N) filter
This indicates that the filtered element is the nth element of its parent element, and N begins with 1. So we can know :first-child
and :nth-child(1)
be equal. By specifying it n
, we have the flexibility to filter out the elements we need.
12345678910111213141516171819202122232425 |
func main () {html: = ' <body> <div lang= "en" >div1 </div><p>p1</p><div lang= "ZH-CN" >div2</div><div lang= "en" >DIV3</div> <span><p>P2</p><div>DIV5</div></span><div></div></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML)) if err!=nil {log. Fatalln (Err)}dom. Find (). Each (func (i int , Selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
This example filters out DIV2
because it DIV2
is the body
third child element of its parent element.
: Nth-of-type (N) filter
:nth-of-type(n)
and :nth-child(n)
similar, except that it represents the nth of the same type of element, so :nth-of-type(1)
and :first-of-type
is equal, you can try it yourself, no longer an example here.
Nth-last-child (n) and: Nth-last-of-type (n) filter
These two are similar to the above, except that the reverse is calculated, and the last element is considered the first one. Let's test it out for ourselves, it's obvious.
: Only-child Filter
Find(":only-child")
Filters, literally, can be guessed, it represents the filtered element, in its parent element, only its own, its parent element has no other child elements, will be matched filter out.
12345678910111213141516171819 |
func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div:only-child"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
The example DIV5
can be filtered out because it is the parent element span
that reaches the unique child element, but it is DIV1
not, so it cannot be filtered out.
: Only-of-type Filter
The above example, if you want to filter out DIV1
what to do? Can be used Find(":only-of-type")
because it is the only element in its parent element, div
which is what the :only-of-type
filter does, and the same type of elements can be filtered out as long as there is only one. Let's change the example above to :only-of-type
try and see if there are any DIV1
.
Selector or (|) Operation
What if we want to filter out and div
span
wait for elements? This time can be used in combination with multiple selectors, and separated by commas (,), Find("selector1, selector2, selectorN")
so long as one of the selectors can be filtered out, that is, the selector or (|) Arithmetic operation.
12345678910111213141516171819 |
func Main ()' <body><div lang= ' en ' >div1</div><span><div>div5</div ></span></body> ' Dom,err:=goquery. Newdocumentfromreader (Strings. Newreader (HTML))if err!=nil{log. Fatalln (Err)}dom. Find ("Div,span"). Each (funcint, selection *goquery. Selection) {fmt. Println (selection. Html ())})} |
Summary
Goquery is an essential tool for parsing HTML Web pages, in the process of crawling Web pages, flexible use of goquery different selectors, can make our crawl work more efficient, greatly enhance the efficiency of the crawler.
This article for the original article, reprint annotated source, welcome to sweep the code to pay attention flysnow_org
to the public number or website http://www.flysnow.org/, the first time to see the following wonderful articles. Feel good, and share to friends Circle, thank you for your support.