Big Data

7 Free of Charge Online Data Sets Every Data Scientist Should Know

1
votes

“Practice makes perfect”: this statement is true for virtually any occupation. And what can data scientists practice on? Right – on various datasets. There are quite a few things that an aspiring professional or even a curious enthusiast can do with them to improve their skills: 

  • Visualization;
  • Cleaning;
  • Machine learning. 

The more you do it, the better you become at it. Of course, it would be a waste to spend money on practice when there is so much free data available on the Internet. But time is always short, too, – so you hardly dream of spending hours searching for an ideal database for your purpose. 

Don’t worry, you wouldn’t have to. We’ve done this job for you and found some of the best free of charge online datasets covering a very wide range of topics. Enjoy! And if you need to spare some time for your science projects, hand your writing tasks to professionals at https://essaypro.com. Do not bite more than you can chew, outsource!

1.Data.gov

This is the U.S. Government’s open data hub and the biggest U.S. database. The site is user-friendly: you can browse via categories or search for keywords. All the information stored here is public and free of charge. The vast majority of datasets here are submitted by the Federal Government’s departments.

The base covers 14 topics, among which are finance, climate, education, health, science and research, and others. You can see highlights on each topic, download corresponding data in different formats or read data-driven articles.

The website contains lots of information – from government budgets to school performance scores. That’s why it may require additional research and cleaning. So, if you’re looking for a place to find free sets to start with your data cleaning project, this is one of the best choices.

2.World Bank Open Data

The World Bank provides free and open access to global data on its website. You can learn loads from this database. Including the forest area and how it’s changed, the percentage of firms with female participation in ownership and literacy rate among the world’s adult population. There are info and figures available on a total of 10 topics, but some of the statistics may not be quite up-to-date.

You can view the site in 5 languages, including Spanish and Chinese. The main page is easy to navigate – here, too, you can search for keywords or browse the datasets using hyperlinks. Sometimes the sets have missing values, which is great for cleaning practice.

Besides the databases, there are interesting visual pieces, and some data-driven articles, too. This makes the World Bank site a useful resource not only for scientists but also for journalists and researchers.

3.U.S. Bureau of Labor Statistics

Bls.gov is another big source of hard data from the U.S. Government. Though this website is not quite as easy to navigate as data.gov, it is a gold mine of information on labor-connected subjects.

You can look at how Consumer Price Index changes, what the national employment rates are, what the wages are in different areas, how Americans spend their time and much more. There are charts, articles, tables and a column with the latest numbers on the chosen subject.

The problem is, you’ll have to spend quite some time filtering to get through to the info that you’re searching for. However, it gets easier after you get used to the website’s interface. 

YSU

4.NSSDCA – NASA Data Archive

The NASA Space Science Data Coordinated Archive is the permanent storage for NASA space science mission data. There are terabytes of accessible information, which come from over half a thousand space science spacecraft.

The NASA archive is perfect for researchers and enthusiasts of lunar and planetary science, astronomy and astrophysics, solar and space plasma physics. Some of the data here are free of charge, but some can be obtained by request only and require payment.

NSSDCA is also a great place to look for information on NASA flight missions, Earth, lunar and planetary research. There’s also a photo gallery with mesmerizing high-resolution pictures from various space missions. You can search the Master catalog if you know exactly what you’re looking for, or explore the website using hyperlinks.

5.Yelp Dataset 

Yelp.com, an official website of Yelp, Inc., is a crowd-sourced local business review and social networking site. You can do many things via Yelp – write a review of a place you visited, book a hotel, check some info about a nearby restaurant or just have a nice chat.

What’s interesting for IT professionals and enthusiasts is the dataset section of this platform. The repository contains over 6,5 million reviews, near 200 thousand pictures and businesses plus info about 10 metropolitan areas. This information can be used without any fee for educational, academic and personal projects.

Everything in this storage is free of charge, but you need to fill out some of your personal info (including a valid e-mail address) and agree to the Dataset License.

There’s also the Yelp Dataset Challenge section here. This is a chance for students to not only gain the much-needed practice but to win some money, too. 

6.UNISEF Data 

UNICEF is the United Nations Children’s Fund, a U.N. agency providing aid to children all over the world. It has its own database, which is the largest one about children, nutrition, education, and gender equality. UNICEF works in over 190 countries and territories, collecting and storing statistics from all of them.

You can search the directory by topic or by country directly from the main page. There are 15 topics, including child and adolescent health, early childhood development, child nutrition, and climate change.

All the information is up-to-date and well-structured. There’s a detailed overview of every topic, precise and colorful visualization, links to publications and journal articles, plus some notes.

There’s also a beta-version of UNICEF Data Warehouse available on the site, where you can search the repository by entering the keywords and then filtering the results. Streaming data is available on UNICEF Data’s Twitter account, too.

7.FiveThirtyEight

FiveThirtyEight is one of the most well-known, best-established data journalism outlets in the world. There you can find professional data-driven articles covering a wide range of topics: politics, sports, science, health, and culture. You can not only read their stories but also watch and listen to them on their YouTube channel or via podcasts.

But the most valuable thing for scientists and researchers here is that FiveThirtyEight makes the information used in their articles available to the public. One can access it directly from the main page or via Github. All of this can be obtained free of charge. 

Are there more?

Sure. More websites are well-known among data geeks. But the free datasets listed above are mostly official and well-curated. In this case, you can be sure that you are using a credible source, which is important unless you’re willing to practice in fact-checking in addition. The information stored in those seven repositories is more than enough for practice and for building a solid portfolio.

 

Share this

Leave a Reply