Machine learning and text analysis

Source: Internet
Author: User
Keywords Cloud computing machine learning

The following newsletter comes from a distinguished scientist at Microsoft Ashok Chandra and program manager Dhyanesh Narayanan.

When I was a student at Stanford University's Artificial Intelligence Lab (Ashok) in the 70 's, I was optimistic that human-level machine intelligence was imminent. And, at the same time, computers are becoming increasingly powerful because of the use of machine learning (ML) technology. Because of this, almost all of Microsoft's new products use machine learning techniques to analyze voice, data, and text in varying degrees. In this newsletter, we mainly focus on text.

As computers have a better understanding of natural languages, new fields are being created, such as the promotion of user-interface interfaces, better search engines, personal assistants such as Cortana and Siri, and tools for analyzing a given document. For example, if a news site can link people using algorithms to Wikipedia, the site is more appealing, and users can easily get more information about someone on the site. In addition, through the use of additional information in the text, users can easily distinguish between notable entities (such as athletes, teams, etc.) that are mentioned in the article, as shown in Figure 1:

Figure 1 Vision for text analysis

Text analysis has been an active field of scientific research. After all, creating all human knowledge (text representation) is not an easy task. The early work of the 90 's, including the work of the Brill label [1], defines part of the speech in the sentence, and [2] also has some implications for the new job. Microsoft Research has been keen to create new ideas in science, but we have further put new technology into practice, creating product-level technology.

In this blog newsletter, we briefly show how AI technology can be applied to text analysis by using named entity recognition (NER) techniques. As a platform to provide a complete and straightforward machine learning capability, Microsoft Azure ml contains the basic capabilities of text analysis and specifically supports ner– so we can relate the general concepts to the specific design choices.

NER is a technique that references text to people, places, organizations, and sports teams. Let's take a look at how to use "supervised learning" to solve this problem:

Figure 2 Named entity identification flowchart

At design time or "learning time", the system uses training data to create a "model" of learning tasks. This method is generalized from the small part of the example to handle any new text.

The training data includes the label of the named entity that is labeled by the human. This looks like: "When Chiris Bosh, the Miami Heat will become very powerful." The model is expected to be able to learn from natural examples, training to identify athlete entities and team-name entities from newly typed text.

The effect of the design time flow depends on the feature extraction stage – Generally, the more feature extraction, the stronger the model. For example, a local statement related to a word in a text [for example, the first k and the posterior K] is a powerful feature that we humans use to connect words to entities. For example, in the sentence "San Francisco beat the Cardinals in a intense match yesterday", it is obvious that the "San Francisco" mentioned in the sentence refers to a sports team rather than a place name in San Francisco. Uppercase letters are another practical feature that identifies named entities, such as people and places that appear in the text.

Model training is what machines learn to do, such as: produce a good model. In general, the selection of features is a complex combination process. There are many machine learning techniques that can be used, including perceptual elements (perceptron), conditional random airports (Conditional Random Fields), etc. The choice of technology relies on the accuracy of the model using limited training data, the amount of processed elements, and the number of named entities that can be automatically learned. For example, the Azure ML NER module supports three types of entities by default: people, places, and organizations.

The goal of the run time process is to enter unmarked text and produce the corresponding output text that is identified by the model at design time. As one can observe, the runtime process takes a feature extraction module from the design time process – Therefore, it is necessary to provide a relatively lightweight high value attribute in the running process if it is necessary for an efficient and thorough entity recognition. As an illustrative example, the Azure ML NER module uses a small number of easily computed features that are primarily based on local text, which has proven to be very effective. The ambiguity in the processing process is usually solved by the Viterbi tool, which assigns the entity label to a series of input words.

Notably, NER is only the beginning, but it is an important step in capturing "knowledge" from the original text. The recent blog newsletter describes how ner plus a range of related technologies is improving the experience of the Bing Sports app – a very similar NER stack is also available for you to use in azure ml. In addition to NER, natural language participle, links and prominence, affective analysis, and fact extraction represent important steps to enhance the user text-related application experience, which is an additional technique that can help you make the text "vivid".

We hope you enjoy this newsletter and look forward to your suggestions.

References

[1] Eric Brill, 1992, A simple rule-based part of speech tagger, applied-natural language 處理 (ANLC ' 92)

[2] Li Deng, Dong Yu, 2014, Deep learning:methods and applications

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.