DataLoader๏
- class dgl.graphbolt.DataLoader(datapipe, num_workers=0, persistent_workers=True, overlap_feature_fetch=True, overlap_graph_fetch=False, num_gpu_cached_edges=0, gpu_cache_threshold=1, max_uva_threads=6144)[source]๏
Bases:
DataLoader
Multiprocessing DataLoader.
Iterates over the data pipeline with everything before feature fetching (i.e.
dgl.graphbolt.FeatureFetcher
) in subprocesses, and everything after feature fetching in the main process. The datapipe is modified in-place as a result.When the copy_to operation is placed earlier in the data pipeline, the num_workers argument is required to be 0 as utilizing CUDA in multiple worker processes is not supported.
- Parameters:
datapipe (DataPipe) โ The data pipeline.
num_workers (int, optional) โ Number of worker processes. Default is 0.
persistent_workers (bool, optional) โ If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers instances alive.
overlap_feature_fetch (bool, optional) โ If True, the data loader will overlap the UVA feature fetcher operations with the rest of operations by using an alternative CUDA stream. Default is True.
overlap_graph_fetch (bool, optional) โ If True, the data loader will overlap the UVA graph fetching operations with the rest of operations by using an alternative CUDA stream. Default is False.
num_gpu_cached_edges (int, optional) โ If positive and overlap_graph_fetch is True, then the GPU will cache frequently accessed vertex neighborhoods to reduce the PCI-e bandwidth demand due to pinned graph accesses.
gpu_cache_threshold (int, optional) โ Determines how many times a vertex needs to be accessed before its neighborhood ends up being cached on the GPU.
max_uva_threads (int, optional) โ Limits the number of CUDA threads used for UVA copies so that the rest of the computations can run simultaneously with it. Setting it to a too high value will limit the amount of overlap while setting it too low may cause the PCI-e bandwidth to not get fully utilized. Manually tuned default is 6144, meaning around 3-4 Streaming Multiprocessors.