A dataset, or data set, is simply a collection of data.
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. Data sets can also consist of a collection of documents or files.
In the open data discipline, data set is the unit to measure the information released in a public open data repository. The European Open Data portal aggregates more than half a million data sets. Some other issues (real-time data sources, non-relational data sets, etc.) increases the difficulty to reach a consensus about it.
How are datasets created?
Different datasets are created in different ways. In this post, you’ll find links to sources with all kinds of datasets. Some of them will be machine-generated data. Some will be data that’s been collected via surveys. Some may be data that’s recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs.
Whenever you’re working with a dataset, it’s important to consider: how was this dataset created? Where does the data come from? Don’t jump right into the analysis; take the time to first understand the data you are working with.
5 websites to find free, interesting datasets
1. FiveThirtyEight
FiveThirtyEight is an interactive news and sports site that has some incredible data visualizations (which you should totally check out). They makes a lot of their data open to the public, meaning you can download and play with the source data yourself!
Here are some examples:
Airline Safety — contains information on accidents from each airline
US Weather History — historical weather data for the US.
Study Drugs — data on who's taking Adderall in the US.
2. BuzzFeed News
BuzzFeed makes the data sets, analysis, libraries, tools, and guides used in its articles available on Github. Check them out to learn from some of the best!
Here are some examples:
Federal Surveillance Planes — contains data on planes used for domestic surveillance.
Zika Virus — data about the geography of the Zika virus outbreak.
Firearm background checks — data on background checks of people attempting to buy firearms.
3. Kaggle
Kaggle, recently acquired by Google, is a place where you can learn, practice, and fine-tune your data science/analytics skills. They have tons of data that’s open to the public, and allow users of the platform to share code so you can learn best practices within the data space. They also host competitions where you can win real money if you have a top ranking model!
Here are some examples:
Federal Surveillance Planes — contains data on planes used for domestic surveillance.
Zika Virus — data about the geography of the Zika virus outbreak.
Firearm background checks — data on background checks of people attempting to buy firearms.
4. Socrata
Socrata hosts cleaned open source data sources ranging from government, business, and education data sets.
Here are some examples:
White House staff salaries — data on what each White House staffer made in 2010.
Radiation Analysis — data on what milk products in what locations in the US were radioactive.
Workplace fatalities by US state — the number of workplace deaths across the US.
5. Awesome-Public-Datasets on Github
This github hosts a library of awesome, public datasets! They are all sorted by category and link you straight to the hosting website.
Here are some examples:
Global Climate Data — climate information for every country in the world with historical data in some cases date back to 1929
Heart rate time series data — two series of data contains 1800 evenly-spaced measurements of instantaneous heart rate from a single subject
Plane crash database — plane crash data dating from 1929 to now.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.