Golang HTML parsing via pups

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

I added a search function to my website last week to automatically crawl my blog and other people's CSDN blog. Crawl through RSS. This data format specification, easy to parse. The problem is less information. Later found in the HTML source code, there will be a convenient search engine indexed meta fields, can point to the author and details. Take the example of my blog, "Golang implementation HTTP to send gzip requests." The information inside is meta as follows:

<meta charset="utf-8"><meta name="description" content="beego的httplib不支持发送gzip请求,自己研究了一下。"><meta name="author" content="Bryce"><meta name="google-translate-customization" content="a4136e955b3e09f2-45a74b56dc13e741-gf616ffda6e6360e0-11"><meta name="viewport" content="width=device-width, initial-scale=1.0">

Checked, General people through xpath the analysis. Have a ready- https://github.com/go-xmlpath/xmlpath made package, follow the instructions to do it, no. Look at the source code, this package is encoding/xml implemented internally, if the HTML code has a problem, the label is not strictly in accordance with the specifications written, there will be a parsing problem. Similarly, it is not possible to treat HTML as XHTML.

Later found a magical tool https://github.com/EricChiang/pup to install by command go get github.com/ericchiang/pup . It can be called through a pipeline:

To fetch the author and the profile directly, you can use the following command:

This package can solve my problem perfectly, went in to look at the source code, found the package name is main , another is because it is used to parse HTML is not so convenient, think, I embarrassed or use cmd the way through the pipeline execution.

req := httplib.Get("http://blog.cyeam.com/golang/2014/11/29/golang_gzip/")res, err := req.Bytes()if err != nil {panic(err)}cmd := exec.Command("pup", `head meta`)stdin, err := cmd.StdinPipe()if err != nil {panic(err)}// defer stdin.Close()var output bytes.Buffercmd.Stdout = &outputif err = cmd.Start(); err != nil { //Use start, not runfmt.Println("An error occured: ", err) //replace with logger, or anything you want}stdin.Write(res)stdin.Close()if err := cmd.Wait(); err != nil {panic(err)}fmt.Println(string(output.Bytes())) //for debug

Through the shell command line pipeline is | implemented through the Golang code, which needs to be exec provided through the package Stdin implementation. Writes the content to the standard input stream, which is equivalent to the pipeline input. Finished writing to close the input stream stdin.Close() , if not closed, the input stream will not be written ...

Please refer to the complete source code in this article.

###### References + "1" package exec-the Go Programming language+ "2" Golang explanation (Go language) standard library analysis Os/exec-widuu

Original link: Golang through the pup HTML parsing, reproduced please indicate the source!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.