Benchmark the performance of KNN algorithms

In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph().

Given a dataset of N samples with D dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K nearest neighbors for each of the N samples among the dataset.

Empirically, the three parameters, N, D, and K, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:

A synthetic dataset with mixed gaussian samples: N = 1000, D = 3.
A point cloud sample from ModelNet: N = 10000, D = 3.
Subsets of MNIST
- A small subset: N = 1000, D = 784
- A medium subset: N = 10000, D = 784
- A large subset: N = 50000, D = 784

Some notes:

bruteforce-sharemem is an optimized implementation of bruteforce on GPU.
kd-tree is currently only implemented on CPU.
bruteforce-blas conducts matrix multiplication, thus is memory inefficient.
nn-descent is an approximate algorithm, and we also report the recall rate of its result.

Results

In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.

The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.

Mixed Gaussian:

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.010	0.011	0.002	0.003
kd-tree	0.004	0.006	n/a	n/a
bruteforce	0.004	0.006	0.126	0.009
bruteforce-sharemem	n/a	n/a	0.002	0.003
nn-descent	0.014 (R: 0.985)	0.148 (R: 1.000)	0.016 (R: 0.973)	0.077 (R: 1.000)

Point Cloud

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.359	0.432	0.010	0.010
kd-tree	0.007	0.026	n/a	n/a
bruteforce	0.074	0.167	0.008	0.039
bruteforce-sharemem	n/a	n/a	0.004	0.017
nn-descent	0.161 (R: 0.977)	1.345 (R: 0.999)	0.086 (R: 0.966)	0.445 (R: 0.999)

Small MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.014	0.015	0.002	0.002
kd-tree	0.179	0.182	n/a	n/a
bruteforce	0.173	0.228	0.123	0.170
bruteforce-sharemem	n/a	n/a	0.045	0.054
nn-descent	0.060 (R: 0.878)	1.077 (R: 0.999)	0.030 (R: 0.952)	0.457 (R: 0.999)

Medium MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.897	0.970	0.019	0.023
kd-tree	18.902	18.928	n/a	n/a
bruteforce	14.495	17.652	2.058	2.588
bruteforce-sharemem	n/a	n/a	2.257	2.524
nn-descent	0.804 (R: 0.755)	14.108 (R: 0.999)	0.158 (R: 0.900)	1.794 (R: 0.999)

Large MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	21.829	22.135	Out of Memory	Out of Memory
kd-tree	542.688	573.379	n/a	n/a
bruteforce	373.823	432.963	10.317	12.639
bruteforce-sharemem	n/a	n/a	53.133	58.419
nn-descent	4.995 (R: 0.658)	75.487 (R: 0.999)	1.478 (R: 0.860)	15.698 (R: 0.999)

Conclusion

As long as you have enough memory, bruteforce-blas is the default algorithm to go with.
Specifically, when D is small and the data is on CPU, kd-tree is the best algorithm.