The Case for Web Archives (in research)

This is a proposal I submitted on behalf of the Documenting the Now project for a workshop about APIs and research being held after the 2023 Association of Internet Researchers in Philadelphia. I’m obviously not the first person to beat the drum of using web archives in research–and hopefully I won’t be the last. I think this “post-API” moment is a good time to continue productive conversations about how we can make things easier to create and use web archives at different scales and support methods for exploring and answering research questions using them. Clearly more could be said but the proposal had to be limited to one page. Thanks for some initial feedback from Bergis Jules and Laura Wrubel.

Since 2014 the Documenting the Now project has worked to develop tools and cultivate a community of practice for collecting social media. It began as a collective memory project amidst the protests following the murder of Mike Brown in Ferguson. In the following years the project hosted workshops, a Slack forum for discussing how to collect social media data, and developed several specialized tools specifically for Twitter, including twarc.

According to Google Scholar twarc has been cited over 500 times as a tool for collecting and hydrating Twitter datasets. It has been starred 1,300 times and forked 250 times on GitHub as it has been made part of other software tools such as the Social Feed Manager. Numerous guides have been independently prepared by data scientists, Software Carpentry and even Twitter themselves as part of their Twitter Research outreach.

After the acquisition of Twitter by Elon Musk, the changes to the Twitter’s API have resulted in the complete dismantling of DocNow’s tools, including twarc. New API rules and prices are designed for commercial use, and effectively terminate academic research and independent study of the platform. Twitter has revoked the keys for many twarc users, as well as the Hydrator and the instance of the DocNow tool that we have been running at community.docnow.io.

The truth is that the writing has been on the wall for access to corporate social media platform APIs for some time (Bruns, 2019; Freelon, 2018). Twitter and Reddit have been exceptions to the norm when it comes to access to data. Scraping practices are commonplace, but are fraught with fragility due to interface changes, lack of documentation, provenance and reproducibility difficulties. While social media APIs have typically been layered on top of the web, their key management, authentication, URL routes, and data representations have required the development of specialized tools which present significant maintenance challenges.

With these issues, and the shuttering of API access except for the most powerful, we believe that the social media and web research community needs to embrace the use of general purpose web archiving technology and practices to preserve and study social media platforms and the web.

Service providers such as the Internet Archive’s Wayback Machine, and open source tools such as those provided by the Webrecorder project need support from the research community. Tools for analyzing web archive data such as those by the Archives Unleashed Project (now ARCH), and specialized methodological approaches to using WARC data (Ogden, Summers, & Walker, 2023; Proferes & Summers, 2019; Maemura:2023?) need further development and refinement. Expertise in using WARC data needs to become more commonplace through curriculum development, training and community workshops.

References

Bruns, A. (2019). After the `APIcalypse’: social media platforms and their fight against critical scholarly research. Information, Communication & Society. Retrieved from https://www.tandfonline.com/doi/abs/10.1080/1369118X.2019.1637447

Freelon, D. (2018). Computational research in the post-API age. Political Communication. Retrieved from https://www.tandfonline.com/doi/abs/10.1080/10584609.2018.1477506

Ogden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. Convergence: The International Journal of Research into New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759

Proferes, N., & Summers, E. (2019). Algorithms and agenda-setting in Wikileaks’ #Podestaemails release. Information, Communication & Society, 22(11), 1630–1645. https://doi.org/10.1080/1369118X.2019.1626469

References

👋 Webmentions ✍️ 🤩