Week 4 - Open Data Talk with Deena Engel and Vicky Steeves

This past week, we were lucky enough to have two guest speakers come in to talk about open data who are both experts in the field. Prior to this talk, I had some experience working with data - I took an online course where I learned some MongoDB, did some web scraping with Python and SQLite, and took the Intro to Data Science course here at NYU where we used pandas. So, going into this talk I was excited to learn more about the different tools these two experts used in their fields and the processes that they use to analyze and use data.

Some things that I found interesting

During the presentation, Professor Engel mostly focused on the technical aspect of data analysis such as programming languages, databases, data formats, etc. One of the interesting tools that Professor Engel brought up was OpenRefine. OpenRefine is an open source (!!!) tool that is used to deal with messy data. By messy data, they mean dealing with null values, incorrect date formats, and general formatting errors that would make it hard to analyze the data set. They described how one would use OpenRefine if you do not know much about dealing with messy data with coding languages, so I think that this will be a useful tool to use in the future if I need to analyze data really quickly and simply.

Going off of the concept of messy data, I was really surprised how important dealing with messy data is to the world of data. In the homework assignments we had to do in Intro to Data Science, the data sets were usually very easy to work with. However, after this talk, I gained new insight on how dealing with messy data is basically half the battle. It takes a while to know how to prepare the datasets so that you can be able to start effectively analyzing it.

During Ms. Steeves’ half of the presentation, she mostly focused on how to utilize open data. She introduced us to so many different websites that provide open data sets such as Kaggle and NYC Open Data. Upon further research, Kaggle has courses and competitions which would be great for more practical experience in open data! Additionally, I shocked to learn about the different licenses that these data sets have. Going into this Open Source course, I never really had any experience paying attention to licenses and citing my resources (it’s a bad habit) so one of the big things I learned from this course is licenses. This applies to the talk too, I didn’t even think about the licenses associated with these data sets.

Conclusion

I really enjoyed the talk that we had, especially because fields related to data are growing so much now! One question that I had going into the talk was the difference between NoSQL and SQL and Professor Engel very clearly described that you would probably want to use SQL when having a structured database is very important in fields like medicine while you would use NoSQL for things like social media sites where not having a field is not that important. As much as I enjoyed the presentation, I wish that we had more time to talk more about the technical aspects of data analysis because there was so much to talk about yet so little time that it was impossible to elaborate on all of the aspects.

Written before or on February 24, 2020