Week 4 - Open Data as a catalyst towards Change
This past week, we had guest speakers Deena Engel and Vicky Steeves talk to our class about open data. From how data is organized to where to look for open data sets, each one of us became equipped with valuable knowledge that we’ll use in future data-driven projects. Now, what exactly is open data? How is it any different than proprietary data?
Proprietary data is owned by a private business or organization and is collected through their means. On the other hand, open data is open source; It can be freely used, modified, and redistributed by anyone. Most importantly, it encourages transparency. With open data initiatives from the Government, Cities, and Corporations, individuals are able to vocalize issues and improve the quality of life of citizens.
Before diving into data, we have to understand how data is formatted, the different ways we can manipulate data, and the importance of cleaning data. During the first part of class, Professor Engel highlighted the different tools and issues that arise when working with data.
Data can be formatted in several ways depending on one’s needs. According to Professor Engel, one can organize data in a CSV file, a TSV file, an XML file, or in JSON format. Although I’ve worked with data before, I’ve never heard of an XML file. XML (or Extensible Markup language) has an uncanny resemblance to HTML Code but one can actually customize the tags the data is enclosed in. Apparently it’s an old-school method, but people still use it today. One format I’m familiar with is JSON. Professor Engel emphasized that JSON is not a database, which many people still fall for. Instead, JSON is a way to format data and for us computer scientists, the key-value pairs makes it easier to parse through.
Now that we know the different ways data can be formatted, we can go over the technologies that can be used to clean data. First off, data is messy. Prior to analyzing data, you have to make sure to clean it and format it in a way that works for you. For instance, one of the challenges Vicky faced when working with data is date formatting. In Vicky’s case, dates can either be numerical (1/01/2020) or they can be worded like so-“January 1, 2020.” Therefore, if you write a script that parses through your data set, you have to make sure that edge cases are accounted for. If you don’t, then most likely, you are losing important data without knowing. Another data scheme that can skew your results is UTF-8 Encoding. Professor Engel mentioned that some data sets may use special characters that have to be encoded for proper use. If cleaning data is new to an individual, Professor Engel highly praised the application OpenRefine-an Open Source data tool that cleans data with ease. However, if you have programming knowledge, it’d be more beneficial to just clean it yourself!
Finding Data sets
During Vicky’s portion of the talk, Vicky went over different resources to find open data sets. One of the sources she’s mentioned is NYC Open Data, which has a collection of data sets on everything NYC-related. Another resource she discussed was Kaggle. Kaggle allows you to search for data you want, and it actually resembles NYU’s advanced search for articles. Except, Kaggle has a much cleaner interface. Nevertheless, searching for data with these resources is easy. But, you have to make sure to always cite your sources.
Future Project with Open Data
I found this Open Data talk with Deena and Vicky beneficial, especially since I’ve wanted to work with open data for so long. As a NYC native, I’ve wanted to unravel trends that’s been going on in the city. I discovered NYC Open Data almost a year ago, but I’ve had trouble finding data that would answer the changes within my neighborhood. Now that I know of other resources where I can find data and the technologies I can use to analyze data, I will start planning this personal project and refine my question that the data can answer.