Implementation points of Semantic Analysis-related applications

Source: Internet
Author: User

Zheng @ playfun RT 20090703

Public Opinion Monitoring and word-of-mouth monitoring are both a path. It is often asked how natural language processing technology can be used. The following is a brief introduction.

I. dictionaries andAlgorithm

The main problem in the early stage was to create a dictionary for Word Segmentation and classification. Depending on your application, this dictionary (there may be many dictionaries, such as personal names, place names, organization names, and frequently-used abbreviations) is different. There are also deprecated word lists, that is, words such as "ah", "oh.

Example:

Should "fish head King" and "Fish Head" appear in the dictionary for life search? What about "hotel" and "hotel? The trade-off depends on your word splitting algorithm, or even your application point. If it is a forward maximum matching word splitting algorithm, and if it is a search application, you should obviously remove the "Grand Hotel" and "fish head King ".

In the later stage, algorithms and efficiency of automatic de-duplication, tag extraction and automatic clustering are involved. The extraction of acronyms and tags is also a dictionary problem.

It doesn't matter which word splitting algorithm is used. There are many open-source, available, and reliable word splitting algorithms. It is mainly a set of specialized dictionaries. You must (automatically) keep up with the times, instead of using an old dictionary that was not updated many years ago.

That is to say, the main problems of the entire application are dictionaries and algorithms.

Ii. clarify requirements

If there is no clear demand for semantics, it is very likely that it will take some effort to do something. Once an algorithm is researched, it will take a lot of effort to do a comparison experiment. Once dictionaries have to collect and carefully organize themselves, it will take a lot of labor. These are all costs.

Therefore, the requirements must be clarified.
There is no clear requirement, and a lot of work will be done in vain.

3. Advanced Mining

Deep Text Mining:
1: Descriptive feature extraction, such as the evaluation and scoring of a vehicle's specific handling, fuel consumption, and comfort;
2: sentiment analysis, that is, positive and negative judgment, basically relies on dictionary and pattern matching;
3: Automatic hotspot discovery, that is, the variant of clustering;
4: use acronyms and tags for statistics;
5. Analysis of transmission channels;
6: the viewpoint of a specific field is automatically extracted, basically by matching the dictionary and pattern;

7. automatically generate a summary. Note that it is not a summary ".
The rest is simple gameplay around keywords.

Social networking

The read/write web article also mentions several social networking points. Let's take a look:

    • Semantic Link sharing
    • Network Mining
    • News sharing
    • Tweet Mining

As for the semantic/contextual advertising mentioned later, it is a giant's way of playing. Do not intervene easily.

Vertical and vertical

If the customer is tracking a vertical field, rather than generic content monitoring, there is actually much room for doing and control.
Word-of-mouth monitoring or public opinion monitoring, the most troublesome thing is not knowing what the content will be monitored, whether there are clear language features; in this way, it is difficult to accumulate dictionaries and cross-validation and adjustment of algorithms, it is not easy to take shortcuts.

4. What can be done well

In vertical fields, such as automobile, tourism, restaurants, hotels, and stocks, word-of-mouth monitoring technology can be sure:
1: accurate word segmentation and classification;
2. accurately extract tags and acronyms;
3: Descriptive Feature Extraction;
4: automatically discover hot spots;

This can be done without the need for vertical fields:
1: Automatic deduplication;
2: establish association between acronyms and tags;

V. Dictionary

To sort out proprietary dictionaries:
1: Word Segmentation and classification (depending on whether the corpus of the training machine is accurate or not );
2: acronyms and extracted tags;
3: sentiment analysis;
4: Descriptive Feature Extraction;

You can do it without a dictionary:
1: Automatic deduplication;
2: Automatic hotspot discovery (in the end, it is still needed, but not so strict );
3: high-speed propagation event monitoring (in fact, it is an automatic de-duplicated image application ).

6. application process

1: Determine the vertical fields of monitoring;

2: Collect and sort out proprietary dictionaries;
3: Prepare enough corpus for classification. Each classification requires at least three hundred to five hundred texts for training;
4: To be vertical, such as CIC or isouche, at least collect enough unique Chinese words, such as product nicknames, there will be BlackBerry, BB, Benben, xiao, Xiaojie, etc;

4: Provides descriptive feature extraction, corpus collection, and various dictionaries.

5. Collect corpus for sentiment analysis and create a dictionary.

6: experiment with various algorithms and adjust them repeatedly to achieve commercial accuracy.

7: combine various semantic processing and synthesize applications.

 

Example:

Let's take a look at the simple semantic application process of wengrui In the playlist:

1. clarify the requirement: Get the RT (repush, or "Forward") messages of Twitter and meal in near real time, and combine the similar content into a message; if the message has been forwarded enough times, you can go to the wengrui push list and publish it through our official Weibo account and RSS.

2: determine the key functions and corresponding natural language processing capabilities:

A. Merge similar forwarded messages based on word segmentation;

B. tags, associated tags, and popular tag trends on the list are automatically extracted based on tags;

C. Prevent retries on the list with similar content: Based on tags ;(

What is not typical is that one of the two ranklists should be blocked, but it is actually very difficult because it is difficult to make judgments on language features, although people can see at a glance that it is repeated:

RT: @ jason5ng32: the "door" I saw in the past few days ": the classrooms of Handan University have sex doors, Cixi Vocational College touch the milk door, Beijing Shunyi exit the trousers door, Shanghai Metro hand washing door, Hunan kindergarten teacher touch the bird door, library airplane door, a school swing door in Hunan.

And

RT @ yeluchow. /// No. This is also creating a "Green Dam" trend ???

. What is easier to block is the following:

RT @ flypig: your country's CCTV is finally on the website (page address:Http://is.gd/16cfg) Admitted the existence of GFW for the teacher Qin Gang, please refer:Http://twitpic.com/7silpLet's cheer for this responsible media! (CCTV 'admitted)

And

# RT: @ David FENG: your country's CCTV is finally on the websiteHttp://is.gd/16cfgTeacher Qin Gang admitted the existence of GFW,Http://twitpic.com/7silp

)

The main reason is that the text is too short, and the text is also a dozen characters short.ArticleMany of the conventional methods are useless and need to be adjusted.

3: sort out and continuously update your own proprietary Stop Word Dictionary for language behavior.

4: sort out your own label-specific dictionary S, which can be a general dictionary S;

5: adjust parameters repeatedly to make them richer and better interesting;

6, refresh every 5 minutes, and detect the news, paragraphs, and quotations that are popular in the Chinese micro-blog world 7 × 24.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.