Wikimarks

We provide a methodology and tool-set for harvesting relevance benchmarks for a variety of tasks from Wikipedia. We call these benchmarks Wikimarks. This work is an extension of the infrastructure developed while organizing the Complex Answer Retrieval track at NIST TREC and examines several using Wikipedia to assess several tasks not previously considered in the TREC-CAR context.

We believe that Wikimarks can serve to complement traditional information retrieval benchmarks as they build upon a readily-available source of real-world text content. Furthermore, Wikipedia articles feature exhibit considerable machine-readable structure in the form of page structure, hyperlink structure, and complementary data sources such as Wikidata.

Authors

Laura Dietz, Shubham Chatterjee, Connor Lennox, Sumanta Kashyapi, Pooja Oza, Ben Gamari

University of New Hampshire, USA and Well-Typed LLP, UK

Publication at SIGIR 2022

What are you looking for?

In this online supplement, we provide:

An approach for deriving Wikimark collections, i.e., Wikipedia-derived benchmarks,
for a set of four common information retrieval tasks
A set of tools for deriving Wikimarks from Wikipedia dumps
A set of machine-readable conversions of the English, Simple English, and Japanese Wikipedias
A set of Wikimarks derived from these three Wikipedias
An evaluation of serveral baseline methods using these benchmarks
An example for the article on Horseshoe crabs.

Data Model

The code and all provided downloads make use of the same data model for Wikipedia articles, paragraphs, and outlines. Both the CBOR and JSONL files are represented with following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLinks.

     Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
     PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
     PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
     RedirectNames       -> [$pageName] 
     DisambiguationNames -> [$pageName] 
     DisambiguationIds   -> [$pageId] 
     CategoryNames       -> [$pageName] 
     CategoryIds         -> [$pageId] 
     InlinkIds           -> [$pageId] 
     InlinkAnchors       -> [$anchorText] 
     WikiDataQid         -> [$qid] 
     SiteId              -> [$siteId] 
     PageTags            -> [$pageTags] 
     
     PageSkeleton -> Section | Para | Image | ListItem | Infobox
     Section      -> $sectionHeading [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ListItem     -> $nestingLevel, Paragraph
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $targetPageQid $linkSection $anchorText
     Infobox      -> $infoboxName [($key, [PageSkeleton])]

Example: Horseshoe Crab

As an example to detail the data model we provide the converted version of the article on Horseshoe crabs in two Wikipedias

English Wikipedia article -> converted page in JSON
Simple English Wikipedia article -> converted page in JSON

License

This data set is part of the Wikimarks dataset version v2.6.

The conversions and benchmarks described above are provided by Laura Dietz, Ben Gamari under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The data is based on content extracted from https://dumps.wikipedia.org/ that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.