Common database records

Source: Internet
Author: User
Tags benchmark uci machine learning repository

Record a common database.

    • Timit
      I also forgot where I was from, and I didn't see any good links on the Internet.
      The Timit full name of the DARPA timit acoustic-phonetic continuous Speech Corpus is an acoustic-phoneme continuous voice corpus built by Ti, MIT and Tanfu Boutique Research (SRI). The Timit data set has a voice sampling frequency of 16kHz and consists of 6,300 sentences, each of the 630 people from the eight main dialects of the United States speaks the given 10 sentences, and all the sentences are manually segmented and tagged at the phoneme level. 70% of the speakers are male, and most of the speakers are adult whites.
    • THCHS30
      THCHS30 is an open voice data set released by the Great Gods of Dong Wang, Xuewei Zhang, Zhiyong Zhang, which can be used to develop Chinese speech recognition systems.
    • CSTR VCTK Corpus

The database used by Google wavenet.
This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of 中文版 with various accents. Each speaker reads out is about sentences, most of which were selected from a newspaper plus the Rainbow Passage and an E Licitation paragraph intended to identify the speaker ' s accent. The newspaper texts were taken from the Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set is selected using a greedy algorithm design Ed to maximise the contextual and phonetic coverages. The Rainbow Passage and elicitation paragraph is the same for all speakers. The Rainbow Passage can be found in the international dialects of 中文版 Archive: (http://web.ku.edu/~idea/readings/rainb ow.htm). The elicitation paragraph is identical to the one and used for the speech Accent Archive (http://accent.gmu.edu). The details of the the speech Accent archive can be found at Http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009aacl.pdf

All speech data is recorded using an identical recording Setup:an omni-directional head-mounted microphone (DPA 4035), 9 6kHz sampling frequency at $ bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into-bits, were downsampled to-kHz based on STPK, and were manually end-pointed. This corpus is recorded for the purpose of building hmm-based text-to-speech synthesis systems, especially for SPEAKER-AD Aptive hmm-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation Technolog ies.

    • Voxforge (open source identification library)

Voxforge was created to collect label recordings for the free and open source speech recognition engine (on Linux/unix,windows and Mac platforms).
We open all submitted recording files under the GPL and produce acoustic models for use by the open source speech recognition engine, such as Cmusphinx,isip,julias (GitHub) and HTK (note: HTK has distribution restrictions).

    • OpenSL

OPENSLR is an audio book data set.

OPENSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and S Oftware related to speech recognition. We intend to is a convenient place for anyone to put resources that they has created, so that they can be downloaded publ icly.

Following excerpt from: http://www.cnblogs.com/AriesQt/articles/6742721.html

From the paper Zhang et al., 2015. This is a large database consisting of eight text categorical datasets. For the new text classification benchmark, it is the most common. The sample size is 120K to 3.6M and includes issues ranging from two to 14. Datasets from DBpedia, Amazon, Yelp, Yahoo!, Sogou, and AG.

Address: Https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

Wikitext

Tags: practical academic benchmarks

Large language modeling corpus from high-quality Wikipedia articles. Salesforce metamind Maintenance.

Address: http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/

Question Pairs

Tags: practical

The first dataset published by Quora contains a copy/semantic approximation tag.

Address: Https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

Squad

Tags: practical academic benchmarks

Stanford's question and answer community data set-a wide range of questions answered and read comprehension datasets. Each answer is used as a span, or as a text.

Address: https://rajpurkar.github.io/SQuAD-explorer/

CMU q/a Dataset

Tags: none

Manually created simulation statement questions/answer combinations, as well as the difficulty rating of Wikipedia articles.

Address: http://www.cs.cmu.edu/~ark/QA-data/

Maluuba Datasets

Tags: practical

A complex set of data created manually for NLP research.

Address: https://datasets.maluuba.com/

Billion Words

Tags: practical academic benchmarks

Large, general-purpose modeling data sets. Often used to train the expression of a spread tone (distributed), such as Word2vec or GloVe.

Address: http://www.statmt.org/lm-benchmark/

Common Crawl

Tags: practical academic benchmarks

PB (Pat Byte) level of web crawler. Most often used to learn word embedding. Available for free from Amazon S3. For the WWW World Wide Web information collection, it is a more useful network data set.

Address: http://commoncrawl.org/the-data/

BAbi

Tags: Academic benchmark Classic

Facebook AI Research (FAIR) launches the synthetic reading comprehension and question answer datasets.

Address: https://research.fb.com/projects/babi/

The children ' s book Test

Tags: Academic benchmark

Project Gutenberg (a genuine digital book Free Sharing project) the data extracted from children's books (problem plus context, answer) benchmarks. It is useful for questions and answers, reading comprehension, simulation statements (factoid) queries.

Address: https://research.fb.com/projects/babi/

Stanford sentiment Treebank

Tags: Academic benchmark Classic older

The standard emotional data set has detailed emotional annotations on each node's syntax tree.

Address: http://nlp.stanford.edu/sentiment/code.html

Newsgroups

Tags: classic older

A more classic text classification dataset. It is often useful in this regard as a benchmark for purely classifying or validating ir/indexing algorithms.

Address: http://qwone.com/~jason/20Newsgroups/

Reuters

Tags: classic older

Older, purely categorical-based datasets. The text comes from the Reuters news line. is often used in tutorials.

Address: Https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

IMDB

Tags: classic older

Older, relatively small data sets. Used for emotional classification. But gradually fell out of favour with the literary benchmark, allowing it to be located in a larger data set.

Address: http://ai.stanford.edu/~amaas/data/sentiment/

UCI ' s Spambase

Tags: classic older

Older, classic spam data set from the UCI machine learning Repository. Because of the management details of the data set, this is an interesting benchmark for learning about private custom spam filtering.

Address: Https://archive.ics.uci.edu/ml/datasets/Spambase

Voice

Most speech recognition databases are proprietary-they are of great value to all of their companies. The vast majority of public data sets in this area are already very old.

HUB5 中文版

Tags: Academic benchmark older

Contains only speech data in English. The last time it was used was Baidu's deep voice thesis.

Address: https://catalog.ldc.upenn.edu/LDC2002T43

Librispeech

Tags: Academic benchmark

Audio book data set, including text and speech. Close to 500 hours of clear voice, from a number of readers and a number of audio books, according to book chapters to organize.

Address: http://www.openslr.org/12/

Voxforge

Tags: practical academic benchmarks

A clear voice data set with accent English. It will be useful if you need to have a strong accent and intonation recognition ability.

Address: http://www.voxforge.org/

Timit

Tags: Academic benchmark Classic

English-only speech recognition data set.

Address: https://catalog.ldc.upenn.edu/LDC93S1

CHIME

Tags: practical

Speech Recognition Challenge Cup data set with a lot of noise. It contains real, simulated, and clear recordings: real, because the dataset contains four speakers in four different noisy environments close to 9000 segments of the recording; simulation is generated by combining multiple environments with speech, clear, clear recording without noise.

Address: http://spandh.dcs.shef.ac.uk/chime_challenge/data.html

Ted-lium

Tags: none

Audio transcription of TED speech. Includes 1495 TED speeches, along with their full subtitle text.

Address: Http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus

Common database records

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.