The World of Open Data
Wednesday’s Open Data Lecture
On Wednesday’s lecture, we had two wonderful guest speakers Professor Deena Engel and Vicky Steeves, who gave an overview of what open data is and how to use it. Both Engel and Steeves seemed like they were very knowledgeable in the subject of Open data and were really effective in displaying the uses of proprietary data. Going into this lecture, I was not really too familiar with what open data is, and the protocol of using open data and databases. Before taking this Open Source Software Development course, I did not have any experience with open source projects or open data and had very limited knowledge on the subjects.
What I learned about Open Data
Professor Engel went over the many different types of data, ways of processing data, and the tools that are needed. One tool I heard of but never used was MongoDB. I learned that some languages are better for handling specific data sets. Professor also mentioned OpenRefine, which is used for cleaning data. I’ve heard the term “clean data” before but never actively thought about what that meant. Data cleaning, as I learned, is the processes of refining data by detecting and then correcting anything that may be wrong with the data set such as formatting errors, and incorrect delimiters. OpenRefine is used to reformate/transform data. I believe she mentioned that OpenRefine was like a spreadsheet application that acts as a database.
Vicky Steeves went over the many resources there are to get data sets and how to properly utilize them. She mentioned different sources where we can get open data sets. One example she gave was a GitHub repository named awesome-public-datasets that contains a lot of data sets varying in subjects. GitHub was the only source that she mentioned that I was really familiar with. After the lecture, I went on the repository to see the different data sets and what they contained. I looked through the government data sets. There were over 100 data sets categorized by their respective countries. I thought it is very useful to have a collection of data sets on GitHub is that many people can contribute to add useful data sets in one location for others to use. Another data set source Ms. Steeves mentioned is Kaggle, which I had never heard of before. After class, I researched more about Kaggle and learned that it is an online community of data scientists and machine learning practitioners. Kaggle allows users to use and publish data sets. Also, it has competitions which I found really interesting.
Take Aways
The presentations Wednesday were really helpful in introducing me to the world of open data. I learned a lot of useful things that will help me both in this course and in my professional life. An important take away from the presentation was that open data is not exclusively used by computer scientists. Working on projects are a collaborative experience that involves people from multiple disciplines.