Sediment Dragon Note: From sparse data again on parsing is the nuclear weapon of NLP application

Source: Internet
Author: User

Sediment Dragon Note: From sparse data again on parsing is the nuclear weapon of NLP application

White: Parsing accuracy rate, if all the outstanding issues are thrown to the semantic pragmatic, a little self-talk of the taste, end-user no sense.

Wei: The user sense does not have a big relationship, the key is that it saves the development of the pragmatic level.

No parsing, extraction is carried out on the surface, the dilemma is sparse data and long tail problem.

The surface of the things learn to learn, and with deep parsing support, extraction rules can be a hundred, at least from the rule amount of view, this is not exaggerated. This is one.

Second, deep parsing makes domain portability much stronger.

No parsing extraction task has changed, everything must be pushed to the back again.

For the rule system, with deep parsing, the extraction task changes with the field and does not require so much rework. Parsing ate about 90% of the repetitive work (language knowledge and structure is inherently cross-cutting), with less than 10% rework.

The significance of parsing is here.

For machine learning, the knowledge bottleneck of NLP application is in (1) sparse data, (2) The task changes, the training library must be re-annotated: the previous task's labeling is not likely to be reused for subsequent tasks because the annotations are in the pragmatic layer.

If there is parsing support, in theory, machine learning can better overcome sparse data, but in practice, so far, combined structure features and keywords in machine learning one pot, is still in the exploratory research stage, There are not many mature cases. We've tried this before, and it seems that parsing's involvement has the potential to advance the quality of the system, but it's still hard to wrap, the model is complicated, the features is mixed, and it's not easy to coordinate well.

In fact, the rule system to do extraction, no parsing almost have a sense of difficulty. Because the human brain to write rules on the surface of the language, the number is too large to write. Only machine learning can bypass the parsing to learn the huge number of extraction rules or models, but only if there is a massive set of annotations. Otherwise, sparse data is still unavoidable.

Sparse data is not a single-finger surface of the emergence of low frequency ngrams (customary usage, idioms, etc.) of the accumulation, the kind of sparse data is relatively simple, can be used as an expert dictionary of a writing, the end can move mountains. If the amount of training data is huge, such as machine translation, then such sparse data is not a problem for machine learning. Of course, most of the scenarios, the amount of training data is always not up, this knowledge bottleneck is killing ML.

The more important sparse data is due to a lack of structure, and this sparse data is almost impossible without parsing. The ever-changing surface, generally following a normal distribution, long-tail problems in the structure before the effective capture of the method. And the surface of the change is parsing normalized, the surface of the sparse phenomenon is no longer sparse, at the structural level, sparse patterns was normalize. This is the root of what parsing can call a nuclear weapon in NLP applications.

Without parsing, structural sparse data will not play.

Chomsky has 10,000 not, 1000 misleading, but the thought of the surface structure and deep structure that his ladyship put forward is immortal. Parsing is to eat a variety of surface structures, creating a logical deep structure. In this kind of deep structure to do extraction or other semantic pragmatic aspects of the application of work, more than a multiplier.

Deep parsing consumes variations of surface patterns, that's why it's as powerful as nuclear bombs in NLP.

Not to mention the variety of the surface of natural language statements, we are looking at some simple language subtasks, such as data entity automatic labeling task, you can know how troublesome the sparse data of the surface layer: such as "time" expression, such as "E-mail address" expression, and so on. These can be covered by the regular expression parse, if the surface to use Ngram learning, the long tail problem is disaster.

Natural language needs parsing, and labeled data entity, the regular expression is better than ngram learning, the truth is interlinked.

Related

Sediment Dragon Note: parsing is the engine of nuclear weapons, again on NLP and search



This article refers to address:http://blog.sciencenet.cn/blog-362400-908894.html This article from the Science Network Levi Blog, reproduced please indicate the source.

Sediment Dragon Note: From sparse data again on parsing is the nuclear weapon of NLP application

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.