Text Extraction algorithms published on the Internet

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text Extraction published on the InternetAlgorithmYou can test which one is better. Word Network -- Beijing word Network Technology Limited Company http://demo.cikuu.com/cgi-bin/cgi-contex hunting rabbit web page text extraction http://www.lietu.com/extract/ PHP version of Web Page Text Extraction http://www.woniu.us/get_content_demo/ web page text extraction analysis (DEMO) http: // 61.128.196.27/TXT personal think http: // 61.128.196.27/txt is the best extracted, basically any page can be extracted, and can effectively maintain the original style, images, links.

Http://code.google.com/p/joyhtml/
Look at this.
Http://www.likeshow.net/article.asp? Id = 92
Although the things I wrote a year ago were incomplete, they could still be used to extract the text from news and blog forums. I also wrote a clear explanation of the principles of comments and replies on blogs and BBs.
For example, if you want to extract the body content from the HTML source code, the content between <p> </P> is not written in regular format. Is there any other extraction method besides the regular expression method? Thank you!
Latest download
Online Demo and latest download:
Http://www.shoula.net/ParseContent

http://www.pudn.com/downloads152/sourcecode/internet/search_engine/detail668443.html
Google Code open source web page text extraction cx-extractor2010-05-19 Based on the row block distribution function of general web page text extraction: linear Time, no DOM tree, irrelevant to HTML tags
Description:
for Web information retrieval, webpage body extraction is the key to subsequent processing. Although regular expressions can be used to accurately extract a page in a fixed format, it is inevitable that the use of rules is too slow in the face of all kinds of HTML. Whether the text of a page can be efficiently and accurately extracted and used within the scope of a large-scale Web page is a challenge for upper-layer applications with direct relationships.
the author proposed the general webpage body Extraction Algorithm Based on the row block distribution function. For the first time, the problem of webpage body extraction was transformed into the row block distribution function of the page, this method does not need to create a DOM tree and is not tired by the pathological HTML (in fact, it is completely irrelevant to the HTML Tag ). The row block distribution function graph established in linear time can directly locate the webpage body. At the same time, a method combining statistics and rules is adopted to deal with general problems. I believe that simple things should always be solved in the simplest way. The entire algorithm implements less than a hundred lines of Code . But the amount is not large.
project web site: http://code.google.com/p/cx-extractor/
Algorithm Description: Based on the row block distribution of the web page positive extraction algorithm
welcome to put forward comments ~

Http://www.ngiv.cn/post/204.html
Significance of VIPs algorithms on search engines
Http://blog.csdn.net/tingya/archive/2006/02/18/601954.aspx

Implementation of visual web page paging algorithm VIPsSource codeDownload
Http://blog.csdn.net/tingya/archive/2006/04/28/694651.aspx
Author information: Leap, JavaScript tutorial-blogger of the Technology House blog

Http://www.madcn.net /? P = 791

I have an open-source project, which is good. You can search for joyhtml on googlecode.
Http://gfnpad.blogspot.com/2009/11/blog-post.html
The following are some open-sourceProgram:
1. A python text density-based program:
Http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
PS: There is a bug in it and it needs to be slightly changed. In addition, HTML comments are not processed.
2. Java open-source project: Gate
Http://gate.ac.uk/

In fact, you can use the dhmtl object for programming and analysis. You have obtained the desired data file. For details, see my program.
Http://www.vbgood.com/thread-94788-1-1.html
Http://download.csdn.net/source/568439

I. Title Block
L segmented nodes: TD, Div, H, span
L generally located at head/Title
L The current unit contains labels such as L style. Generally, the class contains characters such as title and head.
L text length, generally greater than 3 characters, less than 35 characters

Ii. posting time blocks
L segmented nodes: TD, Div, and Span
L text length, generally less than 50 characters
L string containing the date format ()
L contains the following keywords: source, table

Iii. Topic Blocks
L segmented nodes: TD, div
L HTML webpages have some special tags, which are usually only displayed in the theme blocks of webpages, such as <p> <br>. Therefore, topic blocks often contain special tags.
L the content of the topic block contains many sentences, so it has many punctuation marks such as commas (,) and periods (> 5 ).
L from the perspective of Information volume, theme blocks generally contain more text information.
L The tag density of the topic block = 1000 * The number of tags/The number of texts should be within the same range.
L text density of topic blocks = Len (text)/Len (HTML code) is large
L it should not contain "previous" or "Next"
L content blocks that contain the following strings are determined to contain copyright information and must be revoked: "ICP filing No. 04000001", "Copyright", "Copyright"
L The topic block number is under the title block
L The topic block number is under the published time block
L The topic block number is above the relevant Link Block

4. Related link Blocks
L segmented nodes: TD, div
L The text should be sensitive words such as "related links", "related news", and "related reports", and the proportion of connections is very high.
L less than 20 links

Implementation:
Based on the preceding information block features, the feature extraction algorithm and C # (3.5) programming are used to name the QD body extraction component. After testing, the correct extraction rate of text-based content pages in HTML format is more than 85%, and the news pages of major portals are more than 95%. Download the example (you need to install Microsoft. NET Framework 3.5)

Note: QD Text Extraction components are not open-source. If you need source code, you can choose to pay for them.

The selected text is generally in place, but the problem is that some pieces of advertisement may be left at the beginning and end. I think these block advertisements are very different from those in the text. These advertisements are the parent nodes. Their Parent nodes either contain the area where the body is located, that is, they are at the same level as the body, or they are a child node in the area where the body is located, it is difficult to refer to the text node itself. Then, a scan is performed on a suspected body node to remove blocks whose parent node text content is too large (including the advertisement and the body, that is, at the same level as the body, also, remove the blocks whose parent node text content is too small.
After such processing, the content is basically the body we need. The following is the title to be extracted.
Scan the document that represents the entire webpage once, find the nodes with font fonts, strong, H1, and title, and extract their information. Then, you can split the obtained text content and check the number of words that are contained in the body. The most half of the words that are contained is the title. However, it should be noted that sometimes the node itself is a sub-node of the body node, so no matter how it is divided, it is completely included, therefore, we need to remove the suspected titles that are part of the body. This is also effective for most web pages, but I have no special idea about those pages with only titles on the body node.
I have also studied some other people's papers over these days. I have a lot of good ideas, and many people want to use Markov and Artificial Neural Networks for training. Maybe I will try it later. Now it's okay, huh, huh.
?
I also wrote this algorithm, but it was written in C ++.
I don't quite understand what paging means. I analyzed the DOM tree and used the rules mentioned in the article to process the DOM node and subsequent processing.
I mainly want to separate the content on the webpage by the webpage framework, combine the body parts, and then use Bayesian decision-making to calculate the body feature support rate.
Extract webpage content.
Now VIPs are basically written.
But I also found some problems,
For example, when the coordinates of some nodes are extracted, there will be no separators, because a few coordinates overlap. This involves the determination of coordinates.
Then there is the node segmentation rule problem. Most of the current pages are organized by Div. VIPs seems to be more suitable for table organization pages. I have tried to organize pages with tables, which is quite good.
In addition, the translation above tinya seems to have changed some rules, and some translations are not very accurate. For example, the definition of the virtual text is somewhat different from that of the original text. I wonder if tinya has noticed it.
Finally, I would like to thank tinya for introducing this algorithm.
In addition, if you are interested in this algorithm, I hope you can discuss it together.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Text Extraction algorithms published on the Internet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Text Extraction algorithms published on the Internet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support