Wikimarks

Tools, Installation, and Usage

Data Conversion Pipeline

Our pipeline for converting Wikipedias and generating Wikimarks is found in the TREMA-UNH/trec-car-release project. Please follow installation, configuration and usage instructions in the README.

Sourcecode for Conversion Tools

This pipeline builds upon the conversion tools provided by the trec-car-create package, which provides utilities for converting, extracting, inspecting, filtering, and generating benchmarks from Wikipedia. Please follow installation and compilation instructions described in the README.

Language Bindings for CBOR

Language bindings for java and python to read the CBOR file formats are provided in the trec-car-tools packages.

Alternatively the equivalent JSONL format can be used with any JSON parsing package. Because of the high amount of redundancy, we provide JSONL files as gzipped and recommend to open them directly with a GzipCompressed file handler. (See data model on the main page.)