(Week 4) Post-talk thoughts on Open Data
I found the speaker interesting, especially Vicky’s insights in data engineering and Deena’s opinions on some software that data engineers (python) and archivists (OpenRefine) use. I had Deena my freshman year three years ago when I took intro to programming, so it was nice to hear her give a talk.Vicky had some insightful ideas on what it means to be a data engineer, where a statistician begins, and where a data engineer ends.
I had never considered the fact that missing data can be information itself, and not just a mistake. Vicky gave an interesting example of missing pieces of data in a dataset where the missing pieces where what the researchers were actually looking for, and so to remove the blanks would be reducing the information available.
I hadn’t realized how many difficulties there were in the actual aquiring and sifting of data. Now that I think about it though, it makes sense that the hardest part of trying to do projects with lots of data is the process of cleaning it, although this makes me skeptical about its effectiveness if it needs to be managed first by a human. How can some sets of data be useful if it was collected differently at different times? For me, the talk was interesting because for every question answered, I had a few more, like, What toolset would you choose to display infographics vs deep analysis? What kinds of use cases are there for large data projects? Are most projects that you work on non-profit or for-profit? Why is it that data you can write a citation for and use freely for academic purposes is restricted for commercial use? Furthermore, if I analyze and act on the data in a commercial context for pure research, does that still infringe on the intellectual property of the author? I now understand why public domain is more tricky than strictly open-source.
After someone had asked the question as to why a person would use BeautifulSoup versus Scrapy, I did some research. It seems that Scrapy is a more full fledged web scraper with way way more functionality than BeautifulSoup, although it would definitely be overkill for more simple projects or just plain html. In doing this research, I discovered another library called Selenium, which accomplishes similar things but focuses moreseo on application testing and browser automation.