Week 4 -- Open Data Talk with Deena Engel and Vicky Steeves
This past Wednesday, Deena Engel and Vicky Steeves came to our class to give a presentation on data processing and licensing. The presentation revolved around their specific experiences with open data. I found the talks to be very interesting because I was a data science intern this past summer. The talk exposed me a lot of new technology stacks and several public data resources. I definitely will end up using or at the very least looking into almost all of the resources mentioned in the presentation.
The resources that specifically piqued my interest from the presentation were Kaggle, NYC Open Data, and OpenRefine. As mentioned by Steeves, a lot of data in the world is wasted because of licensing disputes. This is probably because of the large amount of resource allocation required to obtain the data. Thus, many companies choose to keep their data private. But with Kaggle and NYC Open Data, one can easily find an open data set to work with. However, one should always read the licenses associated with that data set and see if there are any usage restrictions. OpenRefine was another interesting resource mentioned because it’s a great tool for dealing with messy data (null values, incorrect date formats, etc). This tool could be very useful in the future when I deal with messy data sets because a lot of the time in the real world, data sets aren’t as organized as they should be.
Overall, the talk was very useful and informative. I’m looking forward to seeing the slides from the presentation because I couldn’t write down all the links and other resources during the presentation. So having the speakers share their slides with our professor to redistribute is very useful. The only thing I wished the talk did more was go into common mistakes associated with the featured tools. Like the speakers mentioned, choosing the right tools is not as important as having the right data. And so, I would’ve definitely liked to see a general guideline or procedure the authors follow when they collect data because they’re far more experienced than I am.