Wikimarks
Example: Horseshoe Crab
We provide an example of the outputs of our conversion pipeline and how we create Wikimarks for several tasks.
The topic Horseshoe crab
has articles on both English Wikipedias
We will provide outputs of these examples as generated by our conversion pipeline.
Produced Files
The example is contained in the download archive along with other Wikimark collections.
For simplicity, we also provide all files created for the horseshoe crab benchmark here:
These directories also contain *json
files which can be rendered by most web browsers. These filed are not included by default (we only provide jsonl.gz). As multiple objects are provided for paragraphs and entity-linking wikimark, we converted the jsonl format to an array of JSON objects.
With every subset, we provide information about the processed article, provenance, etc along with Wikimark data.
*.pages.*
: The full article*.pages.cbor-paragraphs.*
: Paragraphs contained in this subset (for retrieval, we recommend to use the fullparagraphCorpus
)*.pages.cbor-outlines.*
: Only the section outline without paragraph content. These were used in TREC CAR to represent the query.titles
: page titles in this subsetqids
: Wikidata QIDs of pages in this subsettopics
: Hierarchical section headings in this subset (was used for TREC CAR)*.pages.provenance.*
: Provenance of Wikipedia versions the subset was created from.*.pages.cbor-*.qrels
: Query relevance files for retrieval Wikimarks. These files are compatible with trec_eval.*.pages.cbor-toplevel.cluster.*
: Queries and ground truth files for query-specific clustering. These files are compatible with the cluster evaluation package of scikit.learn*.pages.cbor-entity-linking.*
: Queries and ground truth for query-specific entity linking.LICENSE
andREADME.mkd
: License information. If you redistribute or provide derived data, please include this licensing information to be compliant with Wikipedias CC-SA license.
Dump Conversion
The first phase will convert the raw Wikipedia dump into an easily-machine readable format.
- download the Wikipedia and Wikidata dumps,
- parse the Wikitext format
- resolve redirects, disambiguation pages, and categories and expose them as metadata for each article
We call the outputs of this phase “unprocessed”, which is technically not true (we came to regret the name), but it is unprocessed in the sense that all pieces of information from the original Wikipedia article are preserved.
Processing and Filtering
The next phase will further process the “unprocessed” dump by
- removing non-article pages,
- removing administrative section headings (like “References”, “See also” etc)
- removing infoboxes and images (although the pipeline can be configured to preserve those)
- Optionally, near-duplicate paragraphs can be deduplicated – this can be important as there are many articles written by copying-pasting-modifying existing articles.
- a corpus of all paragraphs across all Wikipedia articles is extracted
We call the outputs of this phase “processed”.
The two example articles will look as follows
- English Wikipedia: page in JSON
- Simple English Wikipedia: page in JSON
Subset selection
During the next phase, different page subsets can be selected to provide a corpus of interest. This could be a random subset of all pages to obtain a smaller collection, or a pages within a set of categories, or articles of different quality levels (such as good or vital articles) or a list of pages that were manually selected, or retrieved from Wikidata’s SPARQL endpoint.
See the main page for a list of example subsets for which subsets are provided and instructions for how to configure the pipeline to produce subsets of your choice.
We select just the article on horseshoe crabs via its Wikidata QID
{ name = "horseshoe-crab";
predicate = "qid-in-set [\"Q1329239\"]";
}
Each subset is automatically divided into train and test splits, and each train split is provided in five folds to enable machine learning with cross-validation.
Wikimark extraction
The final step is to extract different Wikimarks (i.e., Wikipedia-derived benchmarks) from each article. We offer Wikimarks for a set up common IR tasks. In all cases the title of the Wikipedia page constitutes the search query, which is interpreted as an information need on “Tell me about ….”. In this case the information need is “Tell me about Horseshoe crabs”.
Please see Wikimarks page for further information on how Wikimarks are derived.
Passage retrieval
Task Given a query, retrieve relevant passages from the passageCorpus (produced in step 6 of the processing phase). Every paragraph that was given on the original Wikipedia article (after processing) is regarded relevant. All other paragraphs are regarded non-relevant.
Example qrels for English:
enwiki:Horseshoe%20crab 0 04f754dba54b26bab09823bcc19bc31227ca6125 1
enwiki:Horseshoe%20crab 0 089fb8822e86094e88812f29a38fdb14124e9a06 1
enwiki:Horseshoe%20crab 0 08fa5fda300d14d27687bd00341c176a2aff2f39 1
enwiki:Horseshoe%20crab 0 0d39d8da4abef9b0e0250134f089efcd25ae8a4a 1
enwiki:Horseshoe%20crab 0 104ca33fe36d7862188f817b9caeadee08fc688b 1
enwiki:Horseshoe%20crab 0 23667f56d940635112bc94925b6407b60f891d5e 1
enwiki:Horseshoe%20crab 0 240858db3a4d55758738fea3f8a0df49c687e806 1
enwiki:Horseshoe%20crab 0 37b94b0fd1023585d97f184720147fe96737d7b8 1
enwiki:Horseshoe%20crab 0 5cf9bb85c2f46b2dc20014c57a094cace99639fb 1
enwiki:Horseshoe%20crab 0 612f8eccbb814c19d2b96244154e2c9839971a44 1
enwiki:Horseshoe%20crab 0 64cda7e5c2ec7f41b299b1847901a58be969871d 1
enwiki:Horseshoe%20crab 0 64d5a7bac6e28813a939f27d3bae35c2c8382012 1
enwiki:Horseshoe%20crab 0 741d58976a9853e5decd6a37e1a3ebddad35a582 1
enwiki:Horseshoe%20crab 0 81b3a8c494395607c38d9d8f305d01de78a8c3d1 1
enwiki:Horseshoe%20crab 0 8e0ebe03995b9d03f8c1f429dead60c56625f4ba 1
enwiki:Horseshoe%20crab 0 a05a8915d939546da069acc6b17f235dacbeb02f 1
enwiki:Horseshoe%20crab 0 a8a2b4be899206e32081b3888aa5101889da555e 1
enwiki:Horseshoe%20crab 0 b0ab9894c8ca05e76a709997ed24881d8ffe4e4f 1
enwiki:Horseshoe%20crab 0 b20cda148c26bdbc550da6535ea6319f9f2d4ef9 1
enwiki:Horseshoe%20crab 0 c6fd86ddaf732b6ad69e67e740a94e5da9143579 1
enwiki:Horseshoe%20crab 0 cc16e6f76af1e4394ddcf7bd3872ad1cb114c73c 1
enwiki:Horseshoe%20crab 0 def494df77e2249ff7133955b52f37c5e3e9a7b5 1
An example of the first relevant paragragph 04f754dba54b26bab09823bcc19bc31227ca6125
from the paragraphs file
{
"para_body": [
{
"text": "In December 2019, a report of the "
},
{
"target_page_id": "enwiki:United%20States%20Senate",
"text": "US Senate",
"target_page": "United States Senate"
},
{
"text": " which encouraged the "
},
{
"target_page_id": "enwiki:Food%20and%20Drug%20Administration",
"text": "Food and Drug Administration",
"target_page": "Food and Drug Administration"
},
{
"text": " to \"establish processes for evaluating alternative pyrogenicity tests and report back [to the Senate] on steps taken to increase their use\" was released; "
},
{
"target_page_id": "enwiki:People%20for%20the%20Ethical%20Treatment%20of%20Animals",
"text": "PETA",
"target_page": "People for the Ethical Treatment of Animals"
},
{
"text": " backed the report."
}
],
"para_id": "04f754dba54b26bab09823bcc19bc31227ca6125"
}
We provide qrels for different derived information needs:
- article-level (query = “Horseshoe crab”):
- top-level (e.g. query = “Horseshoe crab / Threats”):
- hierarchical (e.g. query = “Horseshoe crab / Threats / Harvest for blood”)
Entity retrieval
Task Given a query, retrieve relevant entities from all Wikipedia entities (as given in the unprocessedAll
or unprocessedAllButBenchmark
collections). All entities which are linked to from the original article are regarded relevant (all other’s non-relevant).
Example qrels for English:
enwiki:Horseshoe%20crab 0 enwiki:Amebocyte 1
enwiki:Horseshoe%20crab 0 enwiki:Arachnid 1
enwiki:Horseshoe%20crab 0 enwiki:Arthropod 1
enwiki:Horseshoe%20crab 0 enwiki:Arthropod%20leg 1
enwiki:Horseshoe%20crab 0 enwiki:Artificial%20insemination 1
enwiki:Horseshoe%20crab 0 enwiki:Book%20lung 1
enwiki:Horseshoe%20crab 0 enwiki:Brackish%20water 1
enwiki:Horseshoe%20crab 0 enwiki:Bulkhead%20(barrier) 1
enwiki:Horseshoe%20crab 0 enwiki:COVID-19%20pandemic 1
enwiki:Horseshoe%20crab 0 enwiki:Carapace 1
[...]
We provide entity qrels for different derived information needs:
- article-level (query = “Horseshoe crab”):
- top-level (e.g. query = “Horseshoe crab / Threats”):
- hierarchical (e.g. query = “Horseshoe crab / Threats / Harvest for blood”)
Example for Hierarchical Entity Qrels:
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:Amebocyte 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:COVID-19%20pandemic 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:Eli%20Lilly%20and%20Company 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:Fishing%20bait 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:Food%20and%20Drug%20Administration 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:Hemocyanin 1
enwiki:Horseshoe%20crab/Threats/Harvest%20for%20blood 0 enwiki:History%20of%20COVID-19%20vaccine%20development 1
[...]
Query-specific Clustering
Task Given a query and a set of relevant paragraphs, produce clusters to represent relevant subtopics. As a ground truth, all paragraphs in the same top-level section are regarded as a cluster. The set of paragraphs provided on the article is taken as a set of relevant paragraphs.
{
"query_text": "Horseshoe crab",
"query_id": "enwiki:Horseshoe%20crab",
"true_cluster_idx": [
3,
1,
3,
0,
3,
3,
3,
0,
1,
2,
1,
3,
2,
3,
2,
0,
2,
3,
3
],
"elements": [
"04f754dba54b26bab09823bcc19bc31227ca6125",
"089fb8822e86094e88812f29a38fdb14124e9a06",
"08fa5fda300d14d27687bd00341c176a2aff2f39",
"0d39d8da4abef9b0e0250134f089efcd25ae8a4a",
"104ca33fe36d7862188f817b9caeadee08fc688b",
"240858db3a4d55758738fea3f8a0df49c687e806",
"37b94b0fd1023585d97f184720147fe96737d7b8",
"5cf9bb85c2f46b2dc20014c57a094cace99639fb",
"612f8eccbb814c19d2b96244154e2c9839971a44",
"64cda7e5c2ec7f41b299b1847901a58be969871d",
"64d5a7bac6e28813a939f27d3bae35c2c8382012",
"741d58976a9853e5decd6a37e1a3ebddad35a582",
"81b3a8c494395607c38d9d8f305d01de78a8c3d1",
"8e0ebe03995b9d03f8c1f429dead60c56625f4ba",
"a05a8915d939546da069acc6b17f235dacbeb02f",
"a8a2b4be899206e32081b3888aa5101889da555e",
"b20cda148c26bdbc550da6535ea6319f9f2d4ef9",
"c6fd86ddaf732b6ad69e67e740a94e5da9143579",
"def494df77e2249ff7133955b52f37c5e3e9a7b5"
],
"query_id": "enwiki:Horseshoe%20crab"
****
The true_cluster_idx
correspond to the following section headings of the article
- 0: Anatomy and behavior
- 1: Breeding
- 2: Taxonomy
- 3: Threats
For example the first paragraph 04f754dba54b26bab09823bcc19bc31227ca6125
is supposed to be clustered with paragraph 08fa5fda300d14d27687bd00341c176a2aff2f39
as these are both in the section on “Threats” (idx 3). Content of paragraphs are available separately.
Query-specific Entity Linking
Task Given a query and the text of a paragraph, identify entity mentions in this paragraph and annotate them with their Wikipedia entity ids. These paragraphs are taken from article, of which the hyperlinks are removed. We take every span in the paragrpah that was originally annotated with a hyperlink to another Wikipedia page as a true span, and the entity it is linked to as a true entity.
The article on horseshoe crabs contains the following example paragraph:
In December 2019, a report of the US Senate which encouraged the Food and Drug Administration to “establish processes for evaluating alternative pyrogenicity tests and report back [to the Senate] on steps taken to increase their use” was released;[57] PETA backed the report.[58]
We convert this paragraph into an entity linking ground truth, where only the query and the text_only_paragraph
are to be used for the prediction, the remainder is held out for evaluation only.
Example Entity Linking Wikimark:
{
"query_text": "Horseshoe crab",
"query_id": "enwiki:Horseshoe%20crab",
"paragraph_id": "04f754dba54b26bab09823bcc19bc31227ca6125",
"text_only_paragraph": {
"para_body": [
{
"text": "In December 2019, a report of the US Senate which encouraged the Food and Drug Administration to \"establish processes for evaluating alternative pyrogenicity te
sts and report back [to the Senate] on steps taken to increase their use\" was released; PETA backed the report."
}
],
"para_id": "04f754dba54b26bab09823bcc19bc31227ca6125"
},
"true_label_page_ids": [
"enwiki:United%20States%20Senate",
"enwiki:Food%20and%20Drug%20Administration",
"enwiki:People%20for%20the%20Ethical%20Treatment%20of%20Animals"
],
"true_label_qids": [
"Q66096",
"Q204711",
"Q151888"
],
"true_linked_paragraph": {
"para_body": [
{
"text": "In December 2019, a report of the "
},
{
"target_page_id": "enwiki:United%20States%20Senate",
"target_qid": "Q66096",
"text": "US Senate",
"target_page": "United States Senate"
},
{
"text": " which encouraged the "
},
{
"target_page_id": "enwiki:Food%20and%20Drug%20Administration",
"target_qid": "Q204711",
"text": "Food and Drug Administration",
"target_page": "Food and Drug Administration"
},
{
"text": " to \"establish processes for evaluating alternative pyrogenicity tests and report back [to the Senate] on steps taken to increase their use\" was released; "
},
{
"target_page_id": "enwiki:People%20for%20the%20Ethical%20Treatment%20of%20Animals",
"target_qid": "Q151888",
"text": "PETA",
"target_page": "People for the Ethical Treatment of Animals"
},
{
"text": " backed the report."
}
],
"para_id": "04f754dba54b26bab09823bcc19bc31227ca6125"
},
"acceptable_label_qids": [
"Q30",
"Q833",
"Q1358",
[...]
],
"acceptable_label_page_ids": [
"enwiki:Albian",
"enwiki:Amebocyte",
"enwiki:Anisian",
[...]
]
}
Due to Wikipedia’s editorial policy a hyperlink to an entity is only inserted at its first occurrence. To avoid wrongfully penalizing entity links because of this policy, we provide acceptable_label
entities, which are entities linked in a preceeding paragraphs. During evaluation, a linker should only receive credit for true_label
entities, but should not be penalize for entities contained in acceptable_label
.
Questions?
Please contact us if you have remaining questions about the Wikimarks data format or want to propose Wikimarks for additional tasks. Our pipeline software can be customized and extended to produce additional styles of wikimarks.