Problems in webie's current research and future research trends
Source: Internet
Author: User
Currently, the Web Information Extraction Technology is basically mature, but the automatic acquisition of knowledge is still not completely automatic. Most information extraction systems only convert tasks previously completed by field experts into user tasks. We have made some useful discussions on building a general knowledge learner, but the results are not very satisfactory. Currently, the web-based IE system can only process specific types of text and only obtain partial accuracy, there are still many questions.
(1) The two main factors that affect the wide application of Web Information Extraction Technology are system performance and system portability, how to solve these two problems will determine the level of development of the Web Information Extraction System, AI researchers have been committed to building a system that can grasp the precise content of the entire document. These systems usually run well only in a narrow knowledge area, but are poorly transplanted to other fields [41].
(2) The extraction efficiency and accuracy of the Web information extraction system need to be further improved.
(3) At present, the English system has reached or is close to practical level in terms of identifying named entities and object relations. However, there are still many problems to be explored in terms of real information extraction. We can see that most of these problems involve the core difficulties in natural language processing.
(4) Defining templates that contain important information extracted from texts is a very difficult and complex problem. Texts of specific genres (such as medical conclusions, scientific papers, and policy reports) has a specific vocabulary, syntax, and chapter structure. Ambiguity exists in the process of system word segmentation and part-of-speech tagging. semantic feature tagging and Chapter syntactic analysis are also a subject that needs further research.
(5) The system needs to be improved in terms of adapting to different sub-language features and different types of texts. The system should be able to process specific language structures and Multilingual Texts. Web-based documents may differ strongly from texts such as news and newspapers and must be able to adapt to different situations [20].
(6) compared with foreign web information extraction systems, there is still a huge gap in the research of Chinese information extraction systems [8].
Future research
In view of the existing problems in the current research, how can we improve the comprehensiveness of the Web Information Extraction System in the future, how to simplify the learning process and improve automation, and how to improve the system's adaptability to new web pages, enhance the adaptability of the system to Web information extraction; strengthen the induction of existing extraction rules to improve the extraction efficiency and accuracy of the system; the information and webpage structure on the web are constantly updated and changing. Therefore, how should we perceive the updates and changes of Web Information and structure; currently, Web information extraction tools are generally used to extract a type of web pages with similar structures after learning. Therefore, how to determine the structure similarity; how to Improve the system performance and portability design and the ability to adapt to multilingual systems; how to learn from foreign mature system construction technologies in the research of Chinese Web information extraction systems, combined with the particularity of Chinese, we will make full use of some basic Chinese research results to build an efficient and accurate Chinese Web Information Extraction System. These problems will be a hot topic in the future research of Web Information Extraction Technology.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.