Providing Web Archive News Articles as Corpus Data

While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content fr...

Full description

Saved in:
Bibliographic Details
Main Authors: Jon Carlstedt Tønnessen, Magnus Breder Birkenes
Format: Article
Language:English
Published: Ubiquity Press 2025-01-01
Series:Journal of Open Humanities Data
Subjects:
Online Access:https://account.openhumanitiesdata.metajnl.com/index.php/up-j-johd/article/view/281
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars.
ISSN:2059-481X