"C language" parsing HTML using Gumbo

Source: Internet
Author: User
Tags benchmark oauth strcmp git clone

The simple HTML DOM that was used before PHP simply parses the HTML, but PHP is not the language I'm familiar with, although I don't have absolute attachment to Language = (what you don't believe, OK, whether believe it or not, I believe it anyway)). Although you can simply use regular expressions to parse HTML but I do not want to find a suitable HTML parsing library, online search for the C language to parse the HTML library, as if not busty more appearance, I searched for Google's Gumbo,gumbo is open source, Can get it from here https://Github.com/google/gumbo-parserwe need to download it back manually to compile the installation, here Linux Debian for example git clone https://Github.com/google/gumbo-parserCD gumbo-parser./autogen.sh./Configure these generally will be very smooth, there is nothing to say, next is make I want to do after making a bug that can not compile through, do not know what you are, the error is given is benchmarks/An undefined function was used in the benchmark.cc file Clock_gettimeman The function needs to contain the Time.h header file, open benchmark.cc file to see indeed already contains the Time.h header file, very distressed, suddenly all of a sudden, but fortunately I reacted quickly, see Manpages wrote link with-LRT (only forGLIBC versions before2.17). So guess there is no link library, use vim to open the makefile file, this file content too much= =, it was a bit of an effort to analyze, but the wit of me was quickly anchored to the Benchmark_ldadd variable by the benchmark keyword, which was then appended-LRT Note that there are spaces to make again, sure enough. Once the compilation is complete, you can install it using make install, and you may need to use root user rights because the default installation directory/usr/local/under Gumbo Source provides a few sample programs, a C language to get the title of the source code and another three use C+ + Write code, I read it all (you see, I said I am not an absolute language obsession, unfortunately, these programs I have read = =simply put, the use of gumbo is simple, using gumbo_parse or gumbo_parse_with_options to get a GUMBOOUTPUT data structure, we can look for what we want from the structure. Let's start with a simple example, take the title, I decided to use my own parsing code instead of the gumbo source to provide a good example, because I found that the program cannot parse out the sample HTML text file I used= =, so I'll write it myself ... #include<stdio.h>#include<sys/types.h>#include<sys/stat.h>#include<stdlib.h>#include<string.h>/*Include header file*/#include<gumbo.h>voidGet_title (Gumbonode *node) {Gumbovector*children;inti; /*If the current node is not an element, return directly*/ if(Node->type! = gumbo_node_element)return; /*gets all child element nodes of the node*/Children=&node->V.element.children; /*checks whether the label of the current node is the title * If it is, the text content of the first node under the node is output*/ if(Node->v.element.tag = =gumbo_tag_title) printf ("%s\n", ((Gumbonode *) children->data[0])v.text.text); /*recursively all child nodes under this node*/  for(i=0; I < children->length;++i) Get_title (children-data[i]);} intMainintargcChar**argv) { structstat buf; Gumbooutput*output; FILE*FP;Char*data; /*reading HTML text files*/ if(! (Fp=fopen (argv[1],"RB")))return-1; stat (argv[1],&buf); Data=malloc(sizeof(Char) * (buf.st_size+1)); Fread (data,sizeof(Char), BUF.ST_SIZE,FP); Fclose (FP); Data[buf.st_size]=0; /*parsing HTML text files*/Output=gumbo_parse (data);/*Get title*/get_title (Output-root); /*destroy, free memory*/Gumbo_destroy_output (&kgumbodefaultoptions,output);  Free(data); return 0;} The note has been written very clearly, first of all our rhythm is like this: the first step to load the HTML text file, we read it into a buf, The second step is to parse out the GUMBOOUTPUT data structure Third step in gumbooptout this data structure to find the title tag finally we output content, gumbo step basically is this look, use GCC compile time need to add-Lgumbo Another example, the HTML file in this example is the IP address and physical address of each country's DNS, the approximate format is<dt><ddclass="Ipstart"> Start IP Address </dd><ddclass="Ipend"> End IP Address </dd><ddclass="Address"> Physical Address </dd></dt>Our resolution step is to obtain all the DT tags and then get all DD tags, and then output the DD Tag class attribute to Ipstart, ipend, address content, the following code, because the original HTML text file content put more, I inconvenience to put up, This is the use of online crawling to get HTML text, so here is the HTML text URL address, to the current writing code this moment the program is still fully functional, it will be in the future if the page will be adjusted for reasons such as the error is unknown. #include<stdio.h>#include<string.h>#include<stdlib.h>#include<oauth.h>/*Include header file*/#include<gumbo.h>#defineURL "http://ip.yqie.com/dns_usa.htm " voidPrint_dns (Gumbonode *node,gumboattribute *attr) { /*Get child nodes*/Gumbonode*ip= (Gumbonode *) (&node->v.element.children)->data[0]; /*print results based on the value of the class attribute*/ if(strcmp (Attr->value,"Ipstart") ==0) {  if(Ip->type = =gumbo_node_text) printf ("Start ip:%s",ip->v.text.text); } Else if(strcmp (Attr->value,"Ipend") ==0) {  if(Ip->type = =gumbo_node_text) printf ("End ip:%s",ip->v.text.text); } Else if(strcmp (Attr->value,"Address") ==0) {  if(Ip->type = =gumbo_node_text) printf ("Physical Address:%s\n",ip->v.text.text); }} voidGet_dns (Gumbonode *node,gumbotag tag) {Gumbovector*children; Gumboattribute*attr;inti; if(Node->type! = gumbo_node_element)return; /*Gets the current node class property*/ if(Attr=gumbo_get_attribute (&node->v.element.attributes,"class") ) Print_dns (node,attr); /*Current node pips*/Children=&node->V.element.children;/*if the current node label is TD, we'll look for the DD tag .*/ if(Node->v.element.tag = =Gumbo_tag_dt) for(i=0; I < children->length;++i) Get_dns (children-data[i],gumbo_tag_dd); /*Find all <dt> tags*/  for(i=0; I < children->length;++i) Get_dns (children-data[i],gumbo_tag_dt);} intMainvoid) {Gumbooutput*output;Char*buf; /*Download HTML text file*/buf=Oauth_http_get (url,null);if(!BUF)return-1; /*parsing*/Output=Gumbo_parse (BUF);if(!output) {   Free(BUF); return-1; } /*get the content we want <dt>*/Get_dns (Output-Root,gumbo_tag_dt); /*Freeing Resources*/Gumbo_destroy_output (&kgumbodefaultoptions,output);  Free(BUF); return 0;} Because OAuth is used, it is necessary to compile with GCC by adding-loauth parameters

"C language" parsing HTML using Gumbo

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.