Analyze the text from the web page (1)

Source: Internet
Author: User
Tags idf

Friends who are website Members will inevitably need to capture some information from other websites.

1. A common method is to use regular expressions to obtain

Advantage: it is relatively accurate and can be achieved overnight without missing the content you need

Disadvantages: if there are so many webpages in China, writing regular expressions will go crazy! If the target webpage needs to be revised again with a regular expression, manual troubleshooting is too high.

2. We will discuss whether the webpage body can be automatically analyzed (the title will be described in the next chapter)

There are two ideas: the keyword matching method (this is mainly about this method, only the idea, for commercial reasons do not provide source code) The second is: The text area Acquisition Method

Keyword Matching Method

 Every news page is filled with a large number of text elements, and there are many interference factors. I have read a lot of articles about analyzing the text, which is very complicated and not suitable for our grassroots programmers, in addition, the accuracy is not very good after the algorithm is completed, and the efficiency is not very high. So I personally copied a keyword matching method.

Preparations:

1. Prepare a word segmentation class library. shotseg 1.0 is used here, which is very effective but can be used.

2. Take a look at the concept of TF-IDF (TF-IDF is a statistical method used to evaluate the importance of a word to one of a collection or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases proportionally with the frequency of its appearance in the corpus. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, search engines on the Internet also use a link-based analysis-based rating method to determine the order in which files appear in the search results .)

3. Write your own code to obtain the HTML source code based on the URL, which is too simple to be discussed here.

4. filter all hyperlinks, scripts, and images in the HTML source code (this article does not discuss the image filtering method and will be discussed later), because the hyperlink on the news details page is not required, we need text information.

Next, let's explain the ideas:

1. after filtering, the text will be cut by pressing the carriage return, which will become an array of text containing empty lines and text information. The empty lines here are useless in this article, but it will be useful in the next chapter.

The string [] strlist = {"",

"\ T ",

"The temperature has reached 40 degrees today ",

"Abstract: The temperature in Jiangnan region is rising rapidly"

"Comment ",

"Content: the weather station predicts that the temperature in Jiangsu may rise to about 40 degrees today. ",

"Ask the majority of Internet users to do a good job in heatstroke prevention, and try to reduce physical strength at high temperatures ",

"\ T ",

"\ T ",

"\ T ",

"All Rights Reserved: Test news ",

}

For this text, we need to know that the temperature has reached 40 degrees today. The weather station predicts that the temperature in Jiangsu may rise to about 40 degrees today. Please do a good job of heatstroke prevention for the majority of users, and try to reduce the amount of heat to be provided ". Do not use any other content.

By using shootseg to separate keywords (slightly processed), the following keywords are obtained: temperature, today, abstract, Jiangnan Region, temperature, comment, content, meteorological station, prediction, Jiangsu, region, netizens, heatstroke prevention, reduction, high temperature, copyright, news

Then calculate the TF-IDF values of these keywords respectively. (The corpus must be prepared by yourself. The more data the database has, the more comprehensive the corpus .)

Obtain the most valuable keywords for the tf idf value: temperature, heatstroke prevention, and physical strength. here you need to set the threshold value in practice, that is, the recognition coefficient of the tf idf value.

The next step is simple. loop through all the lists, extract the paragraphs that contain the "most valuable" keywords, and splice them to get the final webpage body.

As for the title, here we use a regular expression to obtain the content of <title> </title>, which can effectively filter out interference factors.

After my program testing, most news websites can be correctly captured. Most news sections are directly related to the keywords in this article. Then, in some cases, it is often omitted.

As shown in the following figure, the content of an announcement page is as follows:

Iv. Surrounding Environment
1. Business: Changzhou shopping center, cultural palace, and tianning Shangdu
2. Entertainment packages: Golden Hotel, Oriental Pearl KTV, and wenbi Villa
3. Others: Hongmei Park, Changzhou Railway Station, and Changzhou Coach Passenger Station

5. Notes
1. Bidding Qualification: a natural person or legal person with full civil capacity
2. The subject must be paid in one lump sum. The loan cooperation bank is the business department of China Construction Bank Changzhou branch.

6. Contact Information
Mr. Wang 1399xxx

The text will not be pasted. The text is an introduction to the announcement in the region and can be correctly obtained using the above method. However, the text above does not seem to have much to do with the news body, the keyword TFIDF values of these paragraphs are very low, so if I want to get the keyword, the next chapter describes how to get the text by using the text area method"
To be continued .....

Please follow my personal website: www.shenyisyn.org

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.