YAML specification๏
This document describes the YAML specification of metadata.yaml file for
OnDiskDataset. metadata.yaml file is used to specify the dataset
information, including the graph structure, feature data and tasks.
dataset_name: <string>
graph:
nodes:
- type: <string>
num: <int>
- type: <string>
num: <int>
edges:
- type: <string>
format: <string>
path: <string>
- type: <string>
format: <string>
path: <string>
feature_data:
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
tasks:
- name: <string>
num_classes: <int>
train_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
validation_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
test_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
dataset_name๏
The dataset_name field is used to specify the name of the dataset. It is
user-defined.
graph๏
The graph field is used to specify the graph structure. It has two fields:
nodes and edges.
nodes:listThe
nodesfield is used to specify the number of nodes for each node type. It is a list ofnodeobjects. Eachnodeobject has two fields:typeandnum.
type:string, optionalThe
typefield is used to specify the node type. It isnullfor homogeneous graphs. For heterogeneous graphs, it is the node type.
num:intThe
numfield is used to specify the number of nodes for the node type. It is mandatory for both homogeneous graphs and heterogeneous graphs.
edges:listThe
edgesfield is used to specify the edges. It is a list ofedgeobjects. Eachedgeobject has three fields:type,formatandpath. -type:string, optionalThe
typefield is used to specify the edge type. It isnullfor homogeneous graphs. For heterogeneous graphs, it is the edge type.
format:stringThe
formatfield is used to specify the format of the edge data. It can only becsvfor now.
path:stringThe
pathfield is used to specify the path of the edge data. It is relative to the directory ofmetadata.yamlfile.
feature_data๏
The feature_data field is used to specify the feature data. It is a list of
feature objects. Each feature object has five canonical fields: domain,
type, name, format and path. Any other fields will be passed to
the Feature.metadata object.
domain:stringThe
domainfield is used to specify the domain of the feature data. It can be eithernodeoredge.
type:string, optionalThe
typefield is used to specify the type of the feature data. It isnullfor homogeneous graphs. For heterogeneous graphs, it is the node or edge type.
name:stringThe
namefield is used to specify the name of the feature data. It is user-defined.
format:stringThe
formatfield is used to specify the format of the feature data. It can be eithernumpyortorch.
in_memory:bool, optionalThe
in_memoryfield is used to specify whether the feature data is loaded into memory. It can be eithertrueorfalse. Default istrue.
path:stringThe
pathfield is used to specify the path of the feature data. It is relative to the directory ofmetadata.yamlfile.
tasks๏
The tasks field is used to specify the tasks. It is a list of task
objects. Each task object has at least three fields: train_set,
validation_set, test_set. And you are free to add other fields
such as num_classes and all these fields will be passed to the
Task.metadata object.
name:string, optionalThe
namefield is used to specify the name of the task. It is user-defined.
num_classes:int, optionalThe
num_classesfield is used to specify the number of classes of the task.
train_set:listThe
train_setfield is used to specify the training set. It is a list ofsetobjects. Eachsetobject has two fields:typeanddata.
type:string, optionalThe
typefield is used to specify the node/edge type of the set. It isnullfor homogeneous graphs. For heterogeneous graphs, it is the node or edge type.
data:listThe
datafield is used to loadtrain_set. It is a list ofdataobjects. Eachdataobject has four fields:name,format,in_memoryandpath.
name:stringThe
namefield is used to specify the name of the data. It is mandatory and used to specify the data fields ofMiniBatchfor sampling. It can be eitherseed_nodes,labels,node_pairs,negative_srcsornegative_dsts. If any other name is used, it will be added into theMiniBatchdata fields.
format:stringThe
formatfield is used to specify the format of the data. It can be eithernumpyortorch.
in_memory:bool, optionalThe
in_memoryfield is used to specify whether the data is loaded into memory. It can be eithertrueorfalse. Default istrue.
path:stringThe
pathfield is used to specify the path of the data. It is relative to the directory ofmetadata.yamlfile.
validation_set:list
test_set:listThe
validation_setandtest_setfields are used to specify the validation set and test set respectively. They are similar to thetrain_setfield.