In this blog, we will try to explore the big data sets available in public and also look at the challenges around the data set in order for us to make use of it. We are doing this to test a near real time big data infrastructure and code and also to explore the challenges around capacity planning and infrastructure setup.
The git hub has hosted four data sets under the Big Open data. the data sets are the following.
1,000 Genomes – 260 Tera bytes
This data set provides a resource on human genetic variation by sequencing the genomes of a large number of people. Since the size of the data is pretty huge, there are challenges around downloading it and also understanding and retrieving different parts of this data.
Tiny Images Data Set – 300 GB approximately.
This data set consists 80 million images stored in the form of large binary files.This data set also comes with the meta data, which could be used for retrieving the data, and matlab functions for operating on this data set etc.
This data set consists of 20 data sets pertaining to the the bio research. It is shared by around 1850 scientists who are already part of the BioTorrent, a platform which allows sharing scientific data on Bit Torrent.
Measurement Lab – 747 Tera bytes
Here we get internet performance data across the world .
Since these data sets are open source, there are minimum or no documentation. We will need to spend significant amount of time on understanding the data before we use them for testing our big data solution.
You can find more details and the download links here.