DataLoader

class dgl.graphbolt.DataLoader(datapipe, num_workers=0, persistent_workers=True, overlap_feature_fetch=True, overlap_graph_fetch=False, num_gpu_cached_edges=0, gpu_cache_threshold=1, max_uva_threads=6144)[source]

Bases: DataLoader

Multiprocessing DataLoader.

Iterates over the data pipeline with everything before feature fetching (i.e. dgl.graphbolt.FeatureFetcher) in subprocesses, and everything after feature fetching in the main process. The datapipe is modified in-place as a result.

When the copy_to operation is placed earlier in the data pipeline, the num_workers argument is required to be 0 as utilizing CUDA in multiple worker processes is not supported.

Parameters:

datapipe (DataPipe) – The data pipeline.
num_workers (int, optional) – Number of worker processes. Default is 0.
persistent_workers (bool, optional) – If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers instances alive.
overlap_feature_fetch (bool, optional) – If True, the data loader will overlap the UVA feature fetcher operations with the rest of operations by using an alternative CUDA stream. Default is True.
overlap_graph_fetch (bool, optional) – If True, the data loader will overlap the UVA graph fetching operations with the rest of operations by using an alternative CUDA stream. Default is False.
num_gpu_cached_edges (int, optional) – If positive and overlap_graph_fetch is True, then the GPU will cache frequently accessed vertex neighborhoods to reduce the PCI-e bandwidth demand due to pinned graph accesses.
gpu_cache_threshold (int, optional) – Determines how many times a vertex needs to be accessed before its neighborhood ends up being cached on the GPU.
max_uva_threads (int, optional) – Limits the number of CUDA threads used for UVA copies so that the rest of the computations can run simultaneously with it. Setting it to a too high value will limit the amount of overlap while setting it too low may cause the PCI-e bandwidth to not get fully utilized. Manually tuned default is 6144, meaning around 3-4 Streaming Multiprocessors.