Wikimarks

Wikimark Collections

In addition to the raw conversions described above, we also provide several Wikimarks extracted from these conversions:

  • the benchmarks dataset provides Wikimarks for passage retrieval, entity retrieval, query-specific clustering, and entity linking, extracted from the page subsets described below.

  • the unprocessedAllButBenchmark dataset provides all pages except for the set included in benchmarks and is intended to be used for training of systems to be evaluated using benchmarks.

  • the paragraphCorpus dataset is a corpus of paragraphs from articles to be used for passage retrieval evaluation.

These datasets are provided in JSONL or CBOR formats.

Page Subsets

The benchmarks described above are constructed from the following subset of Wikipedia pages from each of the provided Wikipedias:

Vital-articles:

A set of important articles that the Wikipedia community identified. The community strives to provide these articles for all languages. We obtain the set of “vital” articles via Wikidata, then filter the processed articles by Wikidata QID.

trec-car-filter Predicate: qid-set-from-file "./vital-articles.qids"

Good-articles:

A Wikipedia committee defines a set of good articles that are well-written, contain factually accurate and verifiable information and are of broad importance. Such pages are identified either as template “GA” or “good article”, which our pipeline is configured to expose as page tag “Good article”.

trec-car-filter Predicate: has-page-tag ["Good article"]

US-history:

A set of pages in categories that contain the words “United” “States” “history”, such as “History of the United States” or “United States history timelines”.

trec-car-filter Predicate: (category-contains "history" & category-contains "united" & category-contains "states")

Horseshoe-crab:

The single Wikipedia page on horseshoe crabs used in the example above. It is identified by its Wikidata QID.

trec-car-filter Predicate: qid-in-set ["Q1329239"]

Additionally, we provide subsets used in the TREC Complex Answer Retrieval Track for backwards compatibility.

Subset Statistics

Number of articles in each Wikimark subset.
en simple ja
vital-articles.test 521 461 503
vital-articles.train 528 471 539
good-articles.test 17,086 1 809
good-articles.train 17,361 2 838
US-history.test 4,232 9 --
US-history.train 4,284 13 --
horseshoe-crab.train 1 1 --
benchmarkY1.test 131 44 71
benchmarkY1.train 117 42 81
car-train-large.train 884,709 17,335 246,649
test200.test -- 1 42
test200.train 188 12 44

Wikimarks

We provide a methodology for deriving Wikimarks for four common information retrieval tasks:

  • passage retrieval: retrieval of relevant text passages for a keyword query
  • entity retrieval: retrieval of relevant entities (defined to be Wikipedia pages) for a keyword query
  • query-specific clustering: sub-topic clustering of passages for a keyword query
  • query-specific entity-linking: annotation of query-relevant entity links in relevant passages

Wikimarks are created from a subset of Wikipedia pages, such as lists of Wikidata QIDs or category memberships. The page subset is separated into a test set and five train folds. For each of them task-specific datasets, such as queries, candidate sets, and relevance ground truth’s for the Wikimarks are exported. By default the following information is provided for each dataset:

Articles :

Content of processed articles (JSONL or CBOR).

Titles/QIDs:

Page titles and Wikidata QIDs of pages in this subset.

Paragraphs :

Corpus of paragraphs from this article subset.

Provenance:

Information about the Wikipedia dump the subset originated from.

Additionally, task-specific Wikimark data is provides as described in the following.

Wikimarks derived for article-level retrieval and clustering (left) from a given article (right). Paragraph IDs indicated by numbers in black dots; entity IDs as letters in stick figures; ground truth cluster indexes.

Retrieval Wikimark

The retrieval Wikimark is designed to study the quality of retrieval models. For queries derived from Wikipedia titles, any paragraph originating from the Wikipedia article is counted as relevant. This Wikimark was referred to as the “automatic ground truth” in the TREC Complex Answer Retrieval task.

Wikimarks for three kinds of retrieval scenarios are provided:

  • Article: The query is the page title, and the goal is to retrieve paragraphs that are relevant for this query. For the passage retrieval relevance data (i.e., qrels) any paragraph located anywhere on the original page is counted as relevant, all other paragraphs are non-relevant.

  • Toplevel: The query is a combination of page title and heading of a top-level section. The goal is to retrieve paragraphs that are in fact located within this section or one of its subsections.

  • Hierarchical: The query is derived from any section on the page. The goal is to retrieve paragraphs that are exactly in this section, not a subsection.

In addition to passage-level retrieval, we also provide a Wikimark for entity retrieval, where any entity (as represented by their Wikipedia pages) that is linked to from a relevant paragraph is regarded as relevant.

As a corpus for retrieving passages from, we recommend to use the paragraph corpus. As a legal set of entities, we recommend to use an unprocessed dump of Wikipedia pages.

For the retrieval Wikimark, we provide the following information:

Outlines:

Title and section outlines of the articles, to derive query texts from. Page metadata is available.

Topics:

Query IDs for each section—these can also be obtained from the outlines.

Passage Qrels :

Trec-eval compatible qrels files of paragraph IDs for article-level retrieval, top-level section retrieval, and hierarchical section retrieval.

Entity Qrels :

Trec-eval compatible qrels files of entity IDs (same as page IDs) for article, top-level section, and hierarchical section retrieval.

Evaluation.

We recommend to use the retrieval evaluation tool trec-eval with option -c using the provided qrels files.

Query-specific Clustering Wikimark

The task of search result clustering, will, given a search query and a ranking of search results, identify query-specific clusters for presentation. We provide a Wikimark dataset for this clustering task, where the query is taken as page title, and each top-level section defines one ground truth cluster. The search results are taken from the article-level retrieval task. To train on this task in isolation from a retrieval system, we suggest to use all passages that originate from this page.

The query-specific clustering Wikimark is provided as a JSONL gzipped file which contains the following information:

Query:

The query text is derived from the page name; the query ID from the page ID.

Elements:

List of paragraph IDs contained on the page.

True Cluster Labels :

List of true cluster labels for each element. The i’th cluster label is derived from the section ID of the top-level section where the i’th element is located.

True Cluster Index :

Projecting the true cluster labels onto integers from 0, 1….

In this Wikimark, we remove instances with less than two clusters.

Wikimarks derived for query-specific entity linking (bottom) from a the second paragraph (top). The task is to annotate the plain text with entity links (for example with entities a, d, and e). True entities d and e are derived from hyperlinks contained in this paragraph (bold) with given character spans. Since entity a was linked in a previous paragraph and its annotation is to be accepted without penalty.

Entity linking is typically discussed as an NLP task that ignores the context of a search query. However when presenting relevant information for a search query, maybe it would be best not to annotate all possible entity links, but instead focus on linking entities that are relevant for the query. Wikimarks allow us to create a query-specific entity linking dataset, as Wikipedia’s editorial policies are to only include hyperlink to pages when the information is relevant for the topic of the article.

The query-specific entity linking Wikimark is provided as a JSONL gzipped file which contains the following information:

Query:

The query text is derived from the page name; the query ID from the page ID.

Text-only Paragraph:

The text contents of paragraph (without entities links), to be annotated with entity links.

True Linked Paragraph :

The original paragraph (with links) for training and as ground truth.

True Entity Labels :

List of entity IDs that should be linked in this paragraph. These are provided as internal PageIDs as well as Wikidata QIDs.

Acceptable Entity Labels :

List of acceptable entity IDs that can be linked in this paragraph without penalty. List of entities linked in this paragraph and any previous paragraph. These are provided as internal PageIDs as well as Wikidata QIDs.

We remove instances of paragraphs without any linked entities.

Wikipedia’s editorial policies mandate that entities are only linked once per article. Consequently, entities that are mentioned repeatedly are only linked once. Since the entity linking ground truth is derived from hyperlinks, entity linking predictions would get penalized for linking all these entities. To alleviate this without resorting to heuristics, we collect all entities linked in all preceding paragraphs of an article and exposed them as acceptable entity labels. The entity linking evaluation should only give credit to every entity in true labels, but not penalize entities in acceptable labels.

Wikimark Train/Test Instances

Number of train/test instances for each Wikimark.

Relevant Passages Relevant Entities Clustering Instances Entity Linking Instances
en simple ja en simple ja en simple ja en simple jp
vital-articles.test 44,444 7,117 12,448 159,392 20,626 38,975 521 328 393 64,857 9,339 23,217
vital-articles.train 42,008 6,845 13,330 149,609 19,401 42,357 528 324 440 61,984 8,663 25,743
good-articles.test 408,454 7 23,869 1429,087 47 65,031 17,088 1 626 777,081 8 27,903
good-articles.train 415,034 17 24,375 1465,327 87 61,050 17,362 1 626 789,726 39 27,538
US-history.test 83,213 176 206,672 405 4,232 6 169,014 210
US-history.train 83,255 146 205,438 608 4,285 7 160,764 173
horseshoe-crab.train 21 11 69 40 1 1 44 13
benchmarkY1.test 6,554 434 1,160 15,698 1,117 3,018 131 23 56 8,536 454 1,978
benchmarkY1.train 5,588 449 1,396 14,744 1,273 3,440 117 25 60 7,258 513 2,152
car-train-large.train 9,254,925 113,444 1496,289 19,764,159 249,369 3,462,123 885,014 6,918 87,012 25,423,934 185,203 3824,333
test200.train 5,537 109 335 12,345 272 929 188 5 19 9,147 135 612