Providing Web Archive News Articles as Corpus Data

While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content fr...

Full description

Saved in:
Bibliographic Details
Main Authors: Jon Carlstedt Tønnessen, Magnus Breder Birkenes
Format: Article
Language:English
Published: Ubiquity Press 2025-01-01
Series:Journal of Open Humanities Data
Subjects:
Online Access:https://account.openhumanitiesdata.metajnl.com/index.php/up-j-johd/article/view/281
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823859296508575744
author Jon Carlstedt Tønnessen
Magnus Breder Birkenes
author_facet Jon Carlstedt Tønnessen
Magnus Breder Birkenes
author_sort Jon Carlstedt Tønnessen
collection DOAJ
description While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars.
format Article
id doaj-art-00ca624ee22d44b28c195071d5dd6e2a
institution Kabale University
issn 2059-481X
language English
publishDate 2025-01-01
publisher Ubiquity Press
record_format Article
series Journal of Open Humanities Data
spelling doaj-art-00ca624ee22d44b28c195071d5dd6e2a2025-02-11T05:37:28ZengUbiquity PressJournal of Open Humanities Data2059-481X2025-01-01112210.5334/johd.281281Providing Web Archive News Articles as Corpus DataJon Carlstedt Tønnessen0https://orcid.org/0000-0001-8861-0994Magnus Breder Birkenes1https://orcid.org/0009-0000-3278-0388DH-lab, National Library of Norway, OsloDH-lab, National Library of Norway, OsloWhile the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars.https://account.openhumanitiesdata.metajnl.com/index.php/up-j-johd/article/view/281web archivesdigital text analysiswarcmetadata enhancementlegal deposit
spellingShingle Jon Carlstedt Tønnessen
Magnus Breder Birkenes
Providing Web Archive News Articles as Corpus Data
Journal of Open Humanities Data
web archives
digital text analysis
warc
metadata enhancement
legal deposit
title Providing Web Archive News Articles as Corpus Data
title_full Providing Web Archive News Articles as Corpus Data
title_fullStr Providing Web Archive News Articles as Corpus Data
title_full_unstemmed Providing Web Archive News Articles as Corpus Data
title_short Providing Web Archive News Articles as Corpus Data
title_sort providing web archive news articles as corpus data
topic web archives
digital text analysis
warc
metadata enhancement
legal deposit
url https://account.openhumanitiesdata.metajnl.com/index.php/up-j-johd/article/view/281
work_keys_str_mv AT joncarlstedttønnessen providingwebarchivenewsarticlesascorpusdata
AT magnusbrederbirkenes providingwebarchivenewsarticlesascorpusdata