For a webpage, there are usually rich borders or navigation bar information, but users tend to focus on the topic content. The border content is of little value. Especially for handheld devices, displaying a large amount of border information will become very annoying. In search engines, as long as you search for the subject content of a webpage, indexing the border content is of little significance.
The function of extracting the content of a webpage subject is described above. The following describes the methods. It is very difficult to extract the topic of a webpage accurately, because the webpage structure is diverse.ProgramAutomatic Extraction is still difficult.
When reading data online, you can see an articleArticleBytesCode.
We know that for navigation information, there are usually some links, so the proportion of the link text in the webpage will be relatively high, as long as we remove the text with a higher proportion of these links, you can basically remove these rich border information.
The following code is implemented using C:
Using system; using system. collections. generic; using system. LINQ; using system. text; using system. XML; using SGML; using system. io; namespace Soso. youwang {class htmlmaincontentextractor {private hashset
Containertags = new hashset
() {"Div", "table", "TD", "th", "tbody", "thead", "tfoot", "col", "colgroup ", "Ul", "ol", "html", "center", "span", "form"}; private hashset
Removetags = new hashset
() {"Script", "NoScript", "style", "meta", "input", "iframe", "embed", "HR", "IMG ", "# comment", "Link", "label"}; private hashset
Igonorelentags = new hashset
() {"Span "};///
/// The length of the link text /// Private int totallinklen = 0 ;///
/// Total text length /// Private int totallen = 0 ;///
/// Link degree, which is more than twice the average. Delete the link /// Private double rate = 1.1 ;///
/// Average link degree /// Private double avglinkrate ;///
/// The iner smaller than a certain number of characters will be deleted /// Private int minlen = 20; Public String extract (string html) {html = html. tolower (); xmldocument Doc = converthtml2xhtml (HTML); heuristicremove (Doc. documentelement); avglinkrate = 1.0 * totallinklen/totallen; int total, link; containerremove (Doc. documentelement, out total, out link); Return Doc. outerxml;} private bool containerremove (xmlnode node, out int total, out int link) {Total = 0; link = 0; List
Toremove = new list
(); Foreach (xmlnode El in node. childnodes) {int t; int L; If (containerremove (El, out t, out l) {toremove. add (EL) ;}else {total + = T; Link + = L ;}} foreach (xmlnode El in toremove) {node. removechild (EL);} If (containertags. contains (node. name) {If ((! Igonorelentags. contains (node. name) & total <= minlen) | 1.0 * link/total> = Rate * avglinkrate) {return true ;}} else if (node. nodetype = xmlnodetype. text) {total + = node. value. length;} else if (node. name = "A") {link + = node. innertext. length;} return false;} private bool heuristicremove (xmlnode node) {If (removetags. contains (node. name) {return true;} List
Toremove = new list
(); Foreach (xmlnode El in node. childnodes) {If (heuristicremove (EL) {toremove. add (EL) ;}} foreach (xmlnode El in toremove) {node. removechild (EL);} If (node. name = "A") {totallinklen + = node. innertext. length;} else if (node. nodetype = xmlnodetype. text) {totallen + = node. value. length;} return false;} private xmldocument converthtml2xhtml (string html) {using (sgmlreader reader = new sgmlreader () {reader. doctype = "html"; reader. inputstream = new stringreader (HTML); Using (stringwriter = new stringwriter () {using (xmltextwriter writer = new xmltextwriter (stringwriter) {reader. whitespacehandling = whitespacehandling. none; writer. formatting = formatting. indented; xmldocument Doc = new xmldocument (); Doc. load (Reader); Return Doc ;}}}}}}
The Code uses a tool that normalizes HTML and converts it into XML: sgmlreaderdll.
Click here for the entire project code.