Natural language Processing Task data set
KEYWORDS:NLP, DataSet
AI Challenger-UK-China translation reviews
Applicable field: Machine translation
The largest English-Chinese bilingual data set in the field of spoken English. More than 10 million English-Chinese pairs of sentences are provided as data sets. All bilingual sentences are manually checked, and the data sets are guaranteed in terms of size, relevance and quality.
Training set: 10,000,000 sentences
Verification set (simultaneous interpretation): 934 sentences
Validation set (text translation): 8000 sentences
Https://challenger.ai/datasets/translation
UN Parallel Corpus-United Nations parallel corpus
Applicable field: Machine translation
The United Nations parallel corpus consists of official United Nations records and other parliamentary documents that have entered the public domain. The corpus contains text that has been written for 1990-2014 years and is manually translated, including text that is aligned in statement units.
The corpus aims to provide multilingual language resources to facilitate research and progress in various natural language processing such as machine translation. For ease of use, the corpus also provides ready-made bilingual text in specific languages and six language parallel language material libraries.
Description: Https://conferences.unite.un.org/UNCorpus/zh#introduction
Download: Https://conferences.unite.un.org/UNCorpus/zh/DownloadOverview
(not currently downloaded)
2nd International Chinese Word Segmentation Bakeoff
Applicable field: Chinese participle
This directory contains the training, test, and Gold-standard data
Used in the 2nd International Chinese Word Segmentation bakeoff.
http://sighan.cs.uchicago.edu/bakeoff2005/
Newsgroups
Applicable field: Text classification
The newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across different newsgroups.
http://qwone.com/~jason/20Newsgroups/
NLPCC 2017 News Headlines categories
Applicable field: Text classification
http://tcci.ccf.org.cn/conference/2017/taskdata.php
Reuters-21578 Text Categorization Collection
Applicable field: Text classification
This is a collection of documents, appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.
Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Full-screen News data (Sogouca)
Application areas: Text classification, event detection tracking, new word discovery, named entity recognition automatic summary
From a number of news sites June 2012-July period of domestic, international, sports, social, entertainment and other 18 channels of news data, provide URL and body information
http://www.sogou.com/labs/resource/ca.php
CMU World Wide Knowledge Base (WEB->KB) Project
Applicable field: Knowledge extraction
To develop a probabilistic, symbolic knowledge base this mirrors the content of the World Wide Web. If successful, this would make text information on the web available in computer-understandable form, enabling much more so phisticated Information retrieval and problem solving.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
Natural language Processing Task data set