Web content parsing based on Htmlparser (theme crawler) _

Web content parsing based on Htmlparser (theme crawler) __html

Last Update:2018-07-28 Source: Internet

Author: User

Tags gettext keyword list string format

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

implementation of Web page content analysis based on Htmlparser

Web page parsing, that is, the program automatically analyzes the content of the Web page, access to information, thus further processing information.

Web page parsing is an indispensable and very important part of web crawler, because of my experience is limited, I only on our team to develop based on keyword matching and template matching theme crawler experience to talk about how to achieve web analytics.

First of all, it must be said that the tools we use--htmlparser

Briefly, the Htmlparser package provides a convenient and concise way to handle HTML files by parsing the tags in HTML pages into a single node in a tree structure, a type of node that corresponds to a class, and by calling its methods to easily access the contents of the tag.

What I'm using is htmlparser2.0, the latest version. Highly recommended.

OK, get to the point.

For the theme crawler, its function is to download the pages related to the topic to the local, the Web page related information into the database.

Web page parsing module to achieve two major functions: 1. Extract the child link from the page and add it to the crawl URL queue; 2. Parse the content of the Web page and calculate the relevance of the topic.

Because the Web page content Resolution requires frequent access to the Web page files, if access to the network through the URL to get files of the time overhead, so our approach is to crawl the pages of the queue to download all the local, to the local Web page files to resolve the page content, and finally delete the page does not match. And the extraction of child links is relatively simple, through the network to get the page file. Htmlparser is supported for a given URL to access a Web page through a network and a given file path to access a local Web page file .

1. Sub-link extraction:

Do page face link extraction of the basic ideas are:

1. Instantiate a parser with the URL of the extracted Web page

2. Instantiate filter, set page filter condition--only get the content of <a> label and <frame> label

3. Use parser to extract all the nodes through the filter in the page, get nodelist

4. Traverse NodeList, call node's corresponding method to get the link, add the child link collection

5. Returns a collection of child links

OK, on the code:

 1 package Crawler;
 2 3 4 Import Java.util.HashSet;
 5 Import Java.util.Set;
 6 7 Import Org.htmlparser.Node;
 8 Import Org.htmlparser.NodeFilter;
9 Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException; public class Htmllinkparser {17//Get the child link, url is a Web page url,filter is a link filter, return the page face link hashset public static set<st ring> extraclinks (String URL, linkfilter filter) {set<string> links = new Hashset<string> ()
;             try {Parser Parser = new Parser (URL); parser.setencoding ("Utf-8"); 24 Filter <frame > Label filters, used to extract the SRC attribute in the frame label as indicated by link nodefilter framefilter = new Nodefilter () {2 6 Public Boolean Accept (node node) {Node.gettext (). StartswitH ("Frame src=")) {return true;} else {Retu
RN false;
31} 32} 33}; Orfilter Accept <a> label or <frame> label, note nodeclassfilter () can be used to filter a class of labels, linktag corresponding < tags >
Rfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), framefilter);
37//Get all the filtered labels, the result is nodelist nodelist list = Parser.extractallnodesthatmatch (Linkfilter);                 for (int i = 0; i < list.size (); i++) {Node tag = List.elementat (i); 41 if (tag instanceof Linktag)//<a> tag Linktag link = (linktag) tag; 4 4 String Linkurl = Link.getlink ()///Call GetLink () method gets the link to/if (filter) in <a> label.
Accept (Linkurl))//To add a link to the filter criteria to join the link table links.add (linkurl); 47                else{//<frame> Tag 48//Extract the SRC attribute in frame with links such as <frame src= "test.html"/>
The String frame = Tag.gettext ();
int start = Frame.indexof ("src=");
Wuyi frame = frame.substring (start);
The int end = Frame.indexof ("");
if (end = = 1) The end = Frame.indexof (">");
A String Frameurl = frame.substring (5, end-1);
if (Filter.accept (Frameurl)) Links.add (Frameurl); (Parserexception e) {//Capture parser exception E.printstacktra
CE ();
The return links; 64} 65}

At this time may have the reader is thinking: Oh ~ Ah ~ Bo Master ignored the relative URL link problem (-.-)

In fact, I thought of it, and at first I wrote a private method that specifically converts any URL into an absolute URL link. Later I found out that my method was useless, because htmlparser the conversion very humanized.

In addition, parser is required to set the encoding, in this program I directly set to Utf-8. In fact, the Web page encoding is a variety of, in the <meta> tag has information about the encoding, if the encoding is not correct, the text content of the page may be garbled. However, in the part of the child link extraction, we only handle the contents within the tag, which are written according to HTML syntax and do not involve coding problems.

2. Parse Web page content:

Basic ideas:

1. read HTML file, get page code, get file contents in string format

2. Parser to instantiate HTML files with page encoding

3. Set the corresponding filter for the node that needs to be extracted

4. Parse the HTML file with parser according to the given filter

5. Extracts the text content in the node, carries on the processing (in this case is the keyword match, calculates the topic correlation degree)

  1 Import Java.io.BufferedReader;
  2 Import Java.io.FileInputStream;
  3 Import java.io.FileNotFoundException;
  4 Import Java.io.FileReader;
  5 Import java.io.IOException;
  6 Import Java.io.InputStreamReader;
  7 Import Java.util.regex.Matcher;
  8 Import Java.util.regex.Pattern;
 9 Import Org.htmlparser.Parser;
 Import Org.htmlparser.filters.NodeClassFilter;
 Import Org.htmlparser.tags.HeadingTag;
 Import Org.htmlparser.tags.LinkTag;
 Import Org.htmlparser.tags.MetaTag;
 Import Org.htmlparser.tags.ParagraphTag;
 Import Org.htmlparser.tags.TitleTag;
 Import org.htmlparser.util.NodeList;
 Import org.htmlparser.util.ParserException;
 Import Java.util.Set;
 Import multi.patt.match.ac.*; public class Htmlfileparser {string Filepath=new string ();//html file path-private static string[] Keyw ords;//keyword list/*static{keywords=read ("FilePath")//Read keyword list from specified file}*/public Htmlfilepa Rser (String filePath) {This.filepath=filepath;
 The public String GetTitle () {//Gets the page title Fileandenc fae=readhtmlfile ();
 int i=0; try{36//Instantiate a local HTML file Parser Parser titleparser = Parser.createparser (fae.getfil
 E (), Fae.getenc ());
 Nodeclassfilter titlefilter =new nodeclassfilter (titletag.class);
 NodeList titlelist = Titleparser.extractallnodesthatmatch (Titlefilter); 40//actually a Web page should have only one <title> tag, but the Extractallnodesthatmatch method returns only one nodelist for (i = 0; I &l T Titlelist.size ();
 i++) {Titletag Title_tag = (titletag) titlelist.elementat (i);
 Title_tag.gettitle ();
 }catch (Parserexception e) {-return null;
 return null;
 Number of public String getencoding () {//Get page encoding Wuyi Fileandenc fae=readhtmlfile (); Fae.getenc return();
 The public float Getrelatgrade () {//Compute the topic relevance of the page Fileandenc fae=readhtmlfile ();
 File=fae.getfile String ();
 Enc=fae.getenc ();
 Curstring String; 6 int curwordwei = 1;//current keyword weight Curtagwei = 0;//Current label weight Totalgra = 0;//Total Correlation degree
 2 int i;
 acapply obj = new acapply ()//instance AC automaton p = null;
 Matcher m = null;
 try{//according to different labels in order to calculate the relevance of//title tag <title> curtagwei=5;
 Parser Titleparser = Parser.createparser (File,enc);
 Nodeclassfilter titlefilter =new nodeclassfilter (titletag.class);
 NodeList titlelist = Titleparser.extractallnodesthatmatch (Titlefilter); A for (i = 0; i < titlelist.size (); i++) {Titletag titletag= (titletag) titlelist.eleme
 Ntat (i);
 Curstring=titletag.gettitle ();    75             Set result = Obj.findwordsinarray (keyWords, curstring);//ac automata method returns a table of matching words totalgra=totalg Ra+result.size () *curtagwei;//Computing Correlation ()//meta tag of description and keyword <meta> 7
 9 curtagwei=4;
 Parser Metaparser = Parser.createparser (File,enc);
 Bayi Nodeclassfilter metafilter =new nodeclassfilter (metatag.class);
 NodeList Metalist = Metaparser.extractallnodesthatmatch (MetaFilter);
 p = pattern.compile ("\\b (description|keywords) \\b", pattern.case_insensitive); for (i = 0; i < metalist.size (); i++) {metatag metatag= (metatag) Metalist.elementat (
 i);
 Curstring=metatag.getmetatagname ();
 if (curstring==null) {continue; m = P.matcher (curstring); The regular match name is the <meta> tag of description or keyword (m.find ()) {curstring=metatag.getmetacontent ()//extract its content, Set result = Obj.findwo
 Rdsinarray (KeyWords, curstring);
 Totalgra=totalgra+result.size () *curtagwei;
 else{curstring=metatag.getmetacontent ();
 The Set result = Obj.findwordsinarray (keyWords, curstring);
Totalgra=totalgra+result.size () *2;
102//heading Tag

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More