Big data: More than 70 sites give you free access to large data repositories

Source: Internet
Author: User
Keywords Can can provide can provide contain can provide contain large data storage can provide contain large data storage large data

Do you need a lot of data to test your app performance? The easiest way to do this is to download data samples from the free data repository on the web. But the biggest drawback of this approach is that the data rarely has unique content and does not necessarily achieve the desired results. Here are more than 70 sites with free large data repositories available.

Wikipedia:database: Provide free copies of all available content to interested users. Data can be obtained in multiple languages. Content can be downloaded together with pictures.

Common Crawl establishes and maintains an open network accessible to all. This data is stored in Amazon S3bucket, and the requester may spend some money to access it.

Common Crawl: Establish and maintain an open network that is open to all.

EDRM file Formats Data Set: consists of 381 folders in 200 file formats.

The Apache Mahout TLP Project creates an extensible machine learning algorithm. Mahout has a number of free and paid corpus corpora.

EDRM Enron Email Data Set v2 consists of Enron mail messages and attachments, which are available in two groups of downloadable compressed files: XML and PST.

CLUEWEB09 is used to support information retrieval and related human language technology research databases. It contains about 1 billion pages collected from January 2009 to February, and contains 10 languages. The database is used for tracking detection by several TREC meetings.

Dmoz– is the largest and most comprehensive, manually edited open Web site directory. It collects different types of Web site links. DMOZ is a major source of Internet search engines.

theinfo.org– This is a large dataset site where academics, designers, artists, etc. can exchange tips and tricks, develop and share tools together, and begin to integrate their unique projects.

Project Gutenberg offers downloads of more than 36000 free e-books that can be downloaded to PCs, Kindle, Android, IOS or other portable devices.

Technologists song data set: Data related to tracks and artists

AWS (Amazon Web Services) public data Sets: Provides a centralized repository of common datasets that seamlessly integrate into the AWS (Amazon Network Services) cloud application.

BIGML Big List of the public data sources.

Bioassay Data: The research article "Virtual screening of biometric data", prepared by Amanda Schierz, has 21 biometric datasets (Active/non-physiological active components) that can be downloaded.

bitly 1.usa.gov Data: Anonymous click government link

Canada Open Data: A pilot project with many government and geospatial datasets

Canada Open Data: A pilot project for many government and geospatial datasets.

Causality Workbench: Data repository

Corral Big Data repository: Provides data-centric technology at the Texas Advanced Computing Center.

Data Source Handbook: a Guide

Datacatalogs.org: Open government data from the United States, the European Union, Canada, Ckan and others

Data.gov.uk: Public available data in the UK (London Datastore)

Data.gov/education: A key guide to educational data resources, including High-value datasets, data visualization, classroom resources, applications to create private data, and more.

DataMarket: Visualization of the world economy, society, nature and industry, with 100 million time series from the United Nations, the World Bank, Eurostat and other key data providers.

Datamob: Open data that can be used well

Datasf.org: Available to City & Shawnee of San Francisco, CA. Purchased data set information Exchange Center

Dataferrett: A collection of data mining tools for accessing and using the Web, many online U.S. government data sets.

Econdata: A lot of the time series of economics, compiled by many American government agencies.

Enron Email DataSet: Data from approximately 150 users, most of whom are senior management of Enron Company

Europeana data: Contains 20 million text, images, video open metadata, and sound collected by European digital libraries, a trustworthy and comprehensive resource for the content of European cultural heritage.

Europeana Data:

FedStats: A comprehensive resource for American statistics and more

FIMI repository for frequent itemset mining: tools and datasets

Financial data Finder at OSU: Large financial Data set directory

Gdelt: Global data on events, positions and tones, described by the British Guardian as "The history of Life, the universe, and everything."

Geo (Geo Gene Expression omnibus): A gene expression/molecular abundance information base that supports MIAME compliant data submissions, a well-planned online resource for browsing, querying, and retrieving gene expression data.

Geoda Center: Geographic and spatial data

Google Ngrams datasets: Millions of book texts from Google scans

Grain harsh: financial data, including stocks, futures, etc.

Hilary Mason research-quality Big Data sets collects many text and picture datasets

Hitcompanies Datasets:hitcompanies randomly sampled 10,000 UK companies with comprehensive data, using artificial intelligence/machine learning for Automatic Updates.

ICWSM-2009 DataSet: Contains 44 million posts from August 1, 2008 to October 1

Infochimps: A data-Open directory and collection that allows you to share, sell, and download data about any content.

Investor Links: Contains property data

KDD Cup Center: data, worksheets, and results

Kevin Chai List of Datasets: text, SNA and other fields

Konect: Koblenz Network collection, with a large number of various types of network data sets, in order to research in the field of network mining.

Linking Open Data Project for free to everyone

MIT cancer genomics gene Expression datasets and publications: from MIT Whitehead Center for Genome Research

ML data: European Union PASCAL2 Network database

NASDAQ Data Store: Market information available

National government statistical WEB Sites: Data, reports, statistical yearbooks, news and others from approximately 70 websites, including countries in Africa, Europe, Asia and Latin America.

National Space Sciences Data Center (NSSDC): NASA's DataSet, which includes planetary exploration, space and solar physics, life sciences, astrophysics, and more.

Open Data annually: evaluates the state of open information around the world.

OpenData from Socrata: Allows access to more than 10,000 datasets, including business, education, government, and entertainment

Open Source QSL: A large number of sports databases, including baseball, football, basketball and hockey

Peter Skomoroch DataSet Bookmarks pubgene (TM) gene database and Tools: Genome-related publications databases

Quandl, a collaboratively curated portal to millions of financial and economic time-series.

Qunb: A platform for discovering and visualizing data

Robert Schiller Data: Housing construction, stock markets and more from his book Irrational exuberance

Smd:stanford microarray database, storing raw and standard data from microarray experiments

Jerry Smith DataSet Collection: Finance, Government, machine learning, science and other data

SourceForge.net Research data: A project management site that contains statistics on the history and status of the activities of approximately 100,000 projects and over 1 million registered users.

Statlib, Carnegie Mellon University Data Archive

Statoo Datasets Part 1 and Statoo Datasets Part 2

Time Series Data Library

Visual Analytics Benchmark Repository.

UCI KDD Database Repository: Large datasets for machine learning and knowledge discovery research

UCI Machine Learning Repository.

UCR Time Series Data Archive: Provides datasets, papers, links, and code

United States annually according.

Wikiposit: A (virtual) fusion of data from many different websites (mostly financial), allowing users to merge data from different sources

Wolfram Alpha disease and marginalised level dat.

Yahoo Sandbox Datasets: Languages, charts, ratings, advertising and marketing, competitions

YELP Academic Dataset:30 University's 250 closest commercial data and reviews for students and academics to explore and study

199IT compiled from http://www.bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/

(Responsible editor: Mengyishan)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.