Week 4: Thoughts on Open Data

It is not always easy to open a data file (Source: xkcd)

Last semester I took a machine learning course, and one of our homework was about using the New York City Open Data to find a relationship between various factors. For example, my team worked on “Predicting Taxi Ridership and Hottest Pickup Zones in January in New York City Based on Weather Data”. (If you are interested in reading the paper, you can find it here) We have trained our machine learning models using the open source data provided by the City of New York. There is a generous amount of taxi data (about 6Gb) available at the NYC Open Data, and it is all free to use. Our project wouldn’t be possible without this open data set. Of course, one might question why this data is available free of charge. I think their motivation is fostering innovation and produce beneficial products/projects for New York City. I think that is the power of open data and open source. This week’s presentation by Professor Deena Engel and Vicky Steeves was also emphasizing this and many other aspects of open data.

Open data comes in all different formats like CSV, TXT, XML, TSV, and many others. Professor Deena Engel explained that different formats are useful for different purposes. Some data might be more structured than others, and these kinds of data might require a special format. For example, the NYC taxi data we have used in our project was in a CSV file, and there was a fixed number of columns because all taxi rides follow a similar pattern. She has mentioned that XML is more useful for research in humanities because each data point might have a different format. I didn’t know about this distinction, and it will definitely be useful for me in the future. Later, they have mentioned different tools for formatting the data, and they presented a few GUI based tools. I personally prefer to use python and specifically pandas library to deal with data, and most of the time, faster than GUI based tools. There are a lot of built-in functions, and they are very optimized that processing the data takes significantly less time than my handwritten functions.

Data, data everywhere

Data, data everywhere meme (Source)

After Professor Deena Engel, Vicky Steeves talked about finding and evaluating open data. As I have mentioned in the excerpt above, there is a lot of data on the internet, and not all of it is suitable for conducting research. As Vicky Steeves explained, data without metadata is not useful for research. Yes, we might have an extensive data set, but if we don’t know how it was populated, how can we trust it? Let’s assume we have found a good data set with extensive metadata, what about the licenses? Can we use this data set in our research? Do we need to pay? If there is no license attached to the data set, there is nothing much we can do about that data set. It is similar to open source software. There needs to be a license attached to the data set or the software so that we can produce something out of the data set or the software. Before listening to this talk, I wasn’t aware of the license issues related to open data. Now, I know that the license for open data is very important, like the open source software.

Written before or on February 23, 2020