Record a common database.
- Timit
I also forgot where I was from, and I didn't see any good links on the Internet.
The Timit full name of the DARPA timit acoustic-phonetic continuous Speech Corpus is an acoustic-phoneme continuous voice corpus built by Ti, MIT and Tanfu Boutique Research (SRI). The Timit data set has a voice sampling frequency of 16kHz and consists of 6,300 sentences, each of the 630 people from the eight main dialects of the United States speaks the given 10 sentences, and all the sentences are manually segmented and tagged at the phoneme level. 70% of the speakers are male, and most of the speakers are adult whites.
- THCHS30
THCHS30 is an open voice data set released by the Great Gods of Dong Wang, Xuewei Zhang, Zhiyong Zhang, which can be used to develop Chinese speech recognition systems.
- CSTR VCTK Corpus
The database used by Google wavenet.
This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of 中文版 with various accents. Each speaker reads out is about sentences, most of which were selected from a newspaper plus the Rainbow Passage and an E Licitation paragraph intended to identify the speaker ' s accent. The newspaper texts were taken from the Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set is selected using a greedy algorithm design Ed to maximise the contextual and phonetic coverages. The Rainbow Passage and elicitation paragraph is the same for all speakers. The Rainbow Passage can be found in the international dialects of 中文版 Archive: (http://web.ku.edu/~idea/readings/rainb ow.htm). The elicitation paragraph is identical to the one and used for the speech Accent Archive (http://accent.gmu.edu). The details of the the speech Accent archive can be found at Http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009aacl.pdf
All speech data is recorded using an identical recording Setup:an omni-directional head-mounted microphone (DPA 4035), 9 6kHz sampling frequency at $ bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into-bits, were downsampled to-kHz based on STPK, and were manually end-pointed. This corpus is recorded for the purpose of building hmm-based text-to-speech synthesis systems, especially for SPEAKER-AD Aptive hmm-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation Technolog ies.
- Voxforge (open source identification library)
Voxforge was created to collect label recordings for the free and open source speech recognition engine (on Linux/unix,windows and Mac platforms).
We open all submitted recording files under the GPL and produce acoustic models for use by the open source speech recognition engine, such as Cmusphinx,isip,julias (GitHub) and HTK (note: HTK has distribution restrictions).
OPENSLR is an audio book data set.
OPENSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and S Oftware related to speech recognition. We intend to is a convenient place for anyone to put resources that they has created, so that they can be downloaded publ icly.
Following excerpt from: http://www.cnblogs.com/AriesQt/articles/6742721.html
From the paper Zhang et al., 2015. This is a large database consisting of eight text categorical datasets. For the new text classification benchmark, it is the most common. The sample size is 120K to 3.6M and includes issues ranging from two to 14. Datasets from DBpedia, Amazon, Yelp, Yahoo!, Sogou, and AG.
Address: Https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M
Wikitext
Tags: practical academic benchmarks
Large language modeling corpus from high-quality Wikipedia articles. Salesforce metamind Maintenance.
Address: http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/
Question Pairs
Tags: practical
The first dataset published by Quora contains a copy/semantic approximation tag.
Address: Https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
Squad
Tags: practical academic benchmarks
Stanford's question and answer community data set-a wide range of questions answered and read comprehension datasets. Each answer is used as a span, or as a text.
Address: https://rajpurkar.github.io/SQuAD-explorer/
CMU q/a Dataset
Tags: none
Manually created simulation statement questions/answer combinations, as well as the difficulty rating of Wikipedia articles.
Address: http://www.cs.cmu.edu/~ark/QA-data/
Maluuba Datasets
Tags: practical
A complex set of data created manually for NLP research.
Address: https://datasets.maluuba.com/
Billion Words
Tags: practical academic benchmarks
Large, general-purpose modeling data sets. Often used to train the expression of a spread tone (distributed), such as Word2vec or GloVe.
Address: http://www.statmt.org/lm-benchmark/
Common Crawl
Tags: practical academic benchmarks
PB (Pat Byte) level of web crawler. Most often used to learn word embedding. Available for free from Amazon S3. For the WWW World Wide Web information collection, it is a more useful network data set.
Address: http://commoncrawl.org/the-data/
BAbi
Tags: Academic benchmark Classic
Facebook AI Research (FAIR) launches the synthetic reading comprehension and question answer datasets.
Address: https://research.fb.com/projects/babi/
The children ' s book Test
Tags: Academic benchmark
Project Gutenberg (a genuine digital book Free Sharing project) the data extracted from children's books (problem plus context, answer) benchmarks. It is useful for questions and answers, reading comprehension, simulation statements (factoid) queries.
Address: https://research.fb.com/projects/babi/
Stanford sentiment Treebank
Tags: Academic benchmark Classic older
The standard emotional data set has detailed emotional annotations on each node's syntax tree.
Address: http://nlp.stanford.edu/sentiment/code.html
Newsgroups
Tags: classic older
A more classic text classification dataset. It is often useful in this regard as a benchmark for purely classifying or validating ir/indexing algorithms.
Address: http://qwone.com/~jason/20Newsgroups/
Reuters
Tags: classic older
Older, purely categorical-based datasets. The text comes from the Reuters news line. is often used in tutorials.
Address: Https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
IMDB
Tags: classic older
Older, relatively small data sets. Used for emotional classification. But gradually fell out of favour with the literary benchmark, allowing it to be located in a larger data set.
Address: http://ai.stanford.edu/~amaas/data/sentiment/
UCI ' s Spambase
Tags: classic older
Older, classic spam data set from the UCI machine learning Repository. Because of the management details of the data set, this is an interesting benchmark for learning about private custom spam filtering.
Address: Https://archive.ics.uci.edu/ml/datasets/Spambase
Voice
Most speech recognition databases are proprietary-they are of great value to all of their companies. The vast majority of public data sets in this area are already very old.
HUB5 中文版
Tags: Academic benchmark older
Contains only speech data in English. The last time it was used was Baidu's deep voice thesis.
Address: https://catalog.ldc.upenn.edu/LDC2002T43
Librispeech
Tags: Academic benchmark
Audio book data set, including text and speech. Close to 500 hours of clear voice, from a number of readers and a number of audio books, according to book chapters to organize.
Address: http://www.openslr.org/12/
Voxforge
Tags: practical academic benchmarks
A clear voice data set with accent English. It will be useful if you need to have a strong accent and intonation recognition ability.
Address: http://www.voxforge.org/
Timit
Tags: Academic benchmark Classic
English-only speech recognition data set.
Address: https://catalog.ldc.upenn.edu/LDC93S1
CHIME
Tags: practical
Speech Recognition Challenge Cup data set with a lot of noise. It contains real, simulated, and clear recordings: real, because the dataset contains four speakers in four different noisy environments close to 9000 segments of the recording; simulation is generated by combining multiple environments with speech, clear, clear recording without noise.
Address: http://spandh.dcs.shef.ac.uk/chime_challenge/data.html
Ted-lium
Tags: none
Audio transcription of TED speech. Includes 1495 TED speeches, along with their full subtitle text.
Address: Http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
Common database records