View on GitHub

kgbench-loader

Dataset loader for the kgbench repository.

kgbench

A set of benchmark repositories for node classification on knowledge graphs. Paper: kgbench: A Collection of Datasets for Multimodal and Relational Learning on Heterogeneous Knowledge

We offer a set of node classification benchmark tasks on relational data, with the following aims:

Installation

Download or clone the repository. In the root directory (where setup.py is located), run

pip install . 

Please do not use the kgbench-data repository, only use the kgbench-loader. The former should only be used to study how the data was created, or to load the data in a non-python environment.

Loading data in python

The following snippet loads the amplus dataset

import kgbench as kg

data = kg.load('amplus') # Load with numpy arrays, and train/validation split

data = kg.load('amplus', torch=True) # Load with pytorch arrays

data = kg.load('amplus', final=True) # Load with numpy arrays and train/test split

The data object contains all relevant information contained in the dataset, preprocessed for direct use in pytorch or in any numpy-based machine learning framework.

The following are the most important attributes of the data object:

These are all the attributes required to implement a classifier for the relational setting. That is, the setting where literals are treated as atomic nodes. In the multimodal setting, where the content of literals is also taken into account, the following attributes and methods can also be used.

The scripts directory contains the scripts needed to convert any RDF knowledge graph to the format listed above, allowing it to be imported using the kgbench dataloader.

Experiments

Three example baselines are implemented in the directory experiments. These should give a fairly complete idea of the way the library can be used. See the paper for model details.

Loading data in other languages

If you aren’t working in python, you’ll have to load the data yourself. This can be done with any standard CSV loader.

Each datasets is laid out in the following files:

RDF Data

For each dataset, the original RDF is available, together with the scripts that extract the CSV form in a directory named raw. Since the sources of the data differ per dataset, what is contained in this directory differs per dataset.

IMPORTANT Notes for use

Make sure to follow these instructions to run a correct experiment that is comparable to other experiments on these datasets:

Datasets

The following benchmark datasets are available. See the paper for more extensive descriptions.

The following datasets are available for unit testing:

Datatypes

KGBench datasets contain byte-to-string encoded literals. These string literals encode byte-level data, potentially containing images, video or audio (although only images are currently used in the datasets).

We define the following datatypes:

In most cases this information is sufficient to correctly decode the byte-level information. To provide a fully unambiguous definition of how a literal should be decoded, it is necessary also to specify its MIME-type. This can be done by adding extra statements to the graph, but this is outside the scope of the kgbench project.

In our datasets, every media type uses a uniform choice of codec (that is, all images are either JPEG or PNG, but these are not mixed within one dataset). This choice is specified in the dataset metadata.