Do you need a lot of data to test your app performance? The easiest way to do this is to download data samples from the free data repository on the web. But the biggest drawback of this approach is that the data rarely has unique content and does not necessarily achieve the desired results. Here are more than 70 sites with free large data repositories available.
Wikipedia:database: Provide free copies of all available content to interested users. Data can be obtained in multiple languages. Content can be downloaded together with pictures.
Common Crawl establishes and maintains an open network accessible to all. This data is stored in Amazon S3bucket, and the requester may spend some money to access it.
Common Crawl: Establish and maintain an open network that is open to all.
EDRM file Formats Data Set: consists of 381 folders in 200 file formats.
The Apache Mahout TLP Project creates an extensible machine learning algorithm. Mahout has a number of free and paid corpus corpora.
EDRM Enron Email Data Set v2 consists of Enron mail messages and attachments, which are available in two groups of downloadable compressed files: XML and PST.
CLUEWEB09 is used to support information retrieval and related human language technology research databases. It contains about 1 billion pages collected from January 2009 to February, and contains 10 languages. The database is used for tracking detection by several TREC meetings.
Dmoz– is the largest and most comprehensive, manually edited open Web site directory. It collects different types of Web site links. DMOZ is a major source of Internet search engines.
theinfo.org– This is a large dataset site where academics, designers, artists, etc. can exchange tips and tricks, develop and share tools together, and begin to integrate their unique projects.
Project Gutenberg offers downloads of more than 36000 free e-books that can be downloaded to PCs, Kindle, Android, IOS or other portable devices.
Technologists song data set: Data related to tracks and artists
AWS (Amazon Web Services) public data Sets: Provides a centralized repository of common datasets that seamlessly integrate into the AWS (Amazon Network Services) cloud application.
BIGML Big List of the public data sources.
Bioassay Data: The research article "Virtual screening of biometric data", prepared by Amanda Schierz, has 21 biometric datasets (Active/non-physiological active components) that can be downloaded.
bitly 1.usa.gov Data: Anonymous click government link
Canada Open Data: A pilot project with many government and geospatial datasets
Canada Open Data: A pilot project for many government and geospatial datasets.
Causality Workbench: Data repository
Corral Big Data repository: Provides data-centric technology at the Texas Advanced Computing Center.
Data Source Handbook: a Guide
Datacatalogs.org: Open government data from the United States, the European Union, Canada, Ckan and others
Data.gov.uk: Public available data in the UK (London Datastore)
Data.gov/education: A key guide to educational data resources, including High-value datasets, data visualization, classroom resources, applications to create private data, and more.
DataMarket: Visualization of the world economy, society, nature and industry, with 100 million time series from the United Nations, the World Bank, Eurostat and other key data providers.
Datamob: Open data that can be used well
Datasf.org: Available to City & Shawnee of San Francisco, CA. Purchased data set information Exchange Center
Dataferrett: A collection of data mining tools for accessing and using the Web, many online U.S. government data sets.
Econdata: A lot of the time series of economics, compiled by many American government agencies.
Enron Email DataSet: Data from approximately 150 users, most of whom are senior management of Enron Company
Europeana data: Contains 20 million text, images, video open metadata, and sound collected by European digital libraries, a trustworthy and comprehensive resource for the content of European cultural heritage.
Europeana Data:
FedStats: A comprehensive resource for American statistics and more
FIMI repository for frequent itemset mining: tools and datasets
Financial data Finder at OSU: Large financial Data set directory
Gdelt: Global data on events, positions and tones, described by the British Guardian as "The history of Life, the universe, and everything."
Geo (Geo Gene Expression omnibus): A gene expression/molecular abundance information base that supports MIAME compliant data submissions, a well-planned online resource for browsing, querying, and retrieving gene expression data.
Geoda Center: Geographic and spatial data
Google Ngrams datasets: Millions of book texts from Google scans
Grain harsh: financial data, including stocks, futures, etc.
Hilary Mason research-quality Big Data sets collects many text and picture datasets
Hitcompanies Datasets:hitcompanies randomly sampled 10,000 UK companies with comprehensive data, using artificial intelligence/machine learning for Automatic Updates.
ICWSM-2009 DataSet: Contains 44 million posts from August 1, 2008 to October 1
Infochimps: A data-Open directory and collection that allows you to share, sell, and download data about any content.
Investor Links: Contains property data
KDD Cup Center: data, worksheets, and results
Kevin Chai List of Datasets: text, SNA and other fields
Konect: Koblenz Network collection, with a large number of various types of network data sets, in order to research in the field of network mining.
Linking Open Data Project for free to everyone
MIT cancer genomics gene Expression datasets and publications: from MIT Whitehead Center for Genome Research
ML data: European Union PASCAL2 Network database
NASDAQ Data Store: Market information available
National government statistical WEB Sites: Data, reports, statistical yearbooks, news and others from approximately 70 websites, including countries in Africa, Europe, Asia and Latin America.
National Space Sciences Data Center (NSSDC): NASA's DataSet, which includes planetary exploration, space and solar physics, life sciences, astrophysics, and more.
Open Data annually: evaluates the state of open information around the world.
OpenData from Socrata: Allows access to more than 10,000 datasets, including business, education, government, and entertainment
Open Source QSL: A large number of sports databases, including baseball, football, basketball and hockey
Peter Skomoroch DataSet Bookmarks pubgene (TM) gene database and Tools: Genome-related publications databases
Quandl, a collaboratively curated portal to millions of financial and economic time-series.
Qunb: A platform for discovering and visualizing data
Robert Schiller Data: Housing construction, stock markets and more from his book Irrational exuberance
Smd:stanford microarray database, storing raw and standard data from microarray experiments
Jerry Smith DataSet Collection: Finance, Government, machine learning, science and other data
SourceForge.net Research data: A project management site that contains statistics on the history and status of the activities of approximately 100,000 projects and over 1 million registered users.
Statlib, Carnegie Mellon University Data Archive
Statoo Datasets Part 1 and Statoo Datasets Part 2
Time Series Data Library
Visual Analytics Benchmark Repository.
UCI KDD Database Repository: Large datasets for machine learning and knowledge discovery research
UCI Machine Learning Repository.
UCR Time Series Data Archive: Provides datasets, papers, links, and code
United States annually according.
Wikiposit: A (virtual) fusion of data from many different websites (mostly financial), allowing users to merge data from different sources
Wolfram Alpha disease and marginalised level dat.
Yahoo Sandbox Datasets: Languages, charts, ratings, advertising and marketing, competitions
YELP Academic Dataset:30 University's 250 closest commercial data and reviews for students and academics to explore and study
199IT compiled from http://www.bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/
(Responsible editor: Mengyishan)