This is a creation in Article, where the information may have evolved or changed.
I added a search function to my website last week to automatically crawl my blog and other people's CSDN blog. Crawl through RSS. This data format specification, easy to parse. The problem is less information. Later found in the HTML source code, there will be a convenient search engine indexed meta
fields, can point to the author and details. Take the example of my blog, "Golang implementation HTTP to send gzip requests." The information inside is meta
as follows:
<meta charset="utf-8"><meta name="description" content="beego的httplib不支持发送gzip请求,自己研究了一下。"><meta name="author" content="Bryce"><meta name="google-translate-customization" content="a4136e955b3e09f2-45a74b56dc13e741-gf616ffda6e6360e0-11"><meta name="viewport" content="width=device-width, initial-scale=1.0">
Checked, General people through xpath
the analysis. Have a ready- https://github.com/go-xmlpath/xmlpath
made package, follow the instructions to do it, no. Look at the source code, this package is encoding/xml
implemented internally, if the HTML code has a problem, the label is not strictly in accordance with the specifications written, there will be a parsing problem. Similarly, it is not possible to treat HTML as XHTML.
Later found a magical tool https://github.com/EricChiang/pup
to install by command go get github.com/ericchiang/pup
. It can be called through a pipeline:
To fetch the author and the profile directly, you can use the following command:
This package can solve my problem perfectly, went in to look at the source code, found the package name is main
, another is because it is used to parse HTML is not so convenient, think, I embarrassed or use cmd
the way through the pipeline execution.
req := httplib.Get("http://blog.cyeam.com/golang/2014/11/29/golang_gzip/")res, err := req.Bytes()if err != nil {panic(err)}cmd := exec.Command("pup", `head meta`)stdin, err := cmd.StdinPipe()if err != nil {panic(err)}// defer stdin.Close()var output bytes.Buffercmd.Stdout = &outputif err = cmd.Start(); err != nil { //Use start, not runfmt.Println("An error occured: ", err) //replace with logger, or anything you want}stdin.Write(res)stdin.Close()if err := cmd.Wait(); err != nil {panic(err)}fmt.Println(string(output.Bytes())) //for debug
Through the shell
command line pipeline is |
implemented through the Golang code, which needs to be exec
provided through the package Stdin
implementation. Writes the content to the standard input stream, which is equivalent to the pipeline input. Finished writing to close the input stream stdin.Close()
, if not closed, the input stream will not be written ...
Please refer to the complete source code in this article.
###### References + "1" package exec-the Go Programming language+ "2" Golang explanation (Go language) standard library analysis Os/exec-widuu
Original link: Golang through the pup HTML parsing, reproduced please indicate the source!