Wikimarks
We provide a methodology and tool-set for harvesting relevance benchmarks for a variety of tasks from Wikipedia. We call these benchmarks Wikimarks. This work is an extension of the infrastructure developed while organizing the Complex Answer Retrieval track at NIST TREC and examines several using Wikipedia to assess several tasks not previously considered in the TREC-CAR context.
We believe that Wikimarks can serve to complement traditional information retrieval benchmarks as they build upon a readily-available source of real-world text content. Furthermore, Wikipedia articles feature exhibit considerable machine-readable structure in the form of page structure, hyperlink structure, and complementary data sources such as Wikidata.
Authors
Laura Dietz, Shubham Chatterjee, Connor Lennox, Sumanta Kashyapi, Pooja Oza, Ben Gamari
University of New Hampshire, USA and Well-Typed LLP, UK
What are you looking for?
- Code and data pipeline to convert any Wikipedia into a machine-readable corpus or Wikimark
- Wikipedia dumps from 2022
- More information about provided Wikimarks
- Download Wikimarks derived from the 2022 Wikipedias
- Evaluation results of reference baselines
Table of Contents
In this online supplement, we provide:
- An approach for deriving Wikimark collections, i.e., Wikipedia-derived benchmarks,
- for a set of four common information retrieval tasks
- A set of tools for deriving Wikimarks from Wikipedia dumps
- A set of machine-readable conversions of the English, Simple English, and Japanese Wikipedias
- A set of Wikimarks derived from these three Wikipedias
- An evaluation of serveral baseline methods using these benchmarks
- An example for the article on Horseshoe crabs.
Data Model
The code and all provided downloads make use of the same data model for Wikipedia articles, paragraphs, and outlines. Both the CBOR and JSONL files are represented with following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLink
s.
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName]
DisambiguationIds -> [$pageId]
CategoryNames -> [$pageName]
CategoryIds -> [$pageId]
InlinkIds -> [$pageId]
InlinkAnchors -> [$anchorText]
WikiDataQid -> [$qid]
SiteId -> [$siteId]
PageTags -> [$pageTags]
PageSkeleton -> Section | Para | Image | ListItem | Infobox
Section -> $sectionHeading [PageSkeleton]
Para -> Paragraph
Paragraph -> $paragraphId, [ParaBody]
ListItem -> $nestingLevel, Paragraph
Image -> $imageURL [PageSkeleton]
ParaBody -> ParaText | ParaLink
ParaText -> $text
ParaLink -> $targetPage $targetPageId $targetPageQid $linkSection $anchorText
Infobox -> $infoboxName [($key, [PageSkeleton])]
Example: Horseshoe Crab
As an example to detail the data model we provide the converted version of the article on Horseshoe crabs in two Wikipedias
- English Wikipedia article -> converted page in JSON
- Simple English Wikipedia article -> converted page in JSON
License
This data set is part of the Wikimarks dataset version v2.6.
The conversions and benchmarks described above are provided by Laura Dietz, Ben Gamari under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The data is based on content extracted from https://dumps.wikipedia.org/ that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.