Benchmark the performance of KNN algorithms
In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph().
Given a dataset of N samples with D dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K nearest neighbors for each of the N samples among the dataset.
Empirically, the three parameters, N, D, and K, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:
A synthetic dataset with mixed gaussian samples:
N = 1000,D = 3.A point cloud sample from ModelNet:
N = 10000,D = 3.- Subsets of MNIST
A small subset:
N = 1000,D = 784A medium subset:
N = 10000,D = 784A large subset:
N = 50000,D = 784
Some notes:
bruteforce-sharememis an optimized implementation ofbruteforceon GPU.kd-treeis currently only implemented on CPU.bruteforce-blasconducts matrix multiplication, thus is memory inefficient.nn-descentis an approximate algorithm, and we also report the recall rate of its result.
Results
In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.
The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.
Mixed Gaussian:
Model |
CPU |
GPU |
||
|---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.010 |
0.011 |
0.002 |
0.003 |
kd-tree |
0.004 |
0.006 |
n/a |
n/a |
bruteforce |
0.004 |
0.006 |
0.126 |
0.009 |
bruteforce-sharemem |
n/a |
n/a |
0.002 |
0.003 |
nn-descent |
0.014 (R: 0.985) |
0.148 (R: 1.000) |
0.016 (R: 0.973) |
0.077 (R: 1.000) |
Point Cloud
Model |
CPU |
GPU |
||
|---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.359 |
0.432 |
0.010 |
0.010 |
kd-tree |
0.007 |
0.026 |
n/a |
n/a |
bruteforce |
0.074 |
0.167 |
0.008 |
0.039 |
bruteforce-sharemem |
n/a |
n/a |
0.004 |
0.017 |
nn-descent |
0.161 (R: 0.977) |
1.345 (R: 0.999) |
0.086 (R: 0.966) |
0.445 (R: 0.999) |
Small MNIST
Model |
CPU |
GPU |
||
|---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.014 |
0.015 |
0.002 |
0.002 |
kd-tree |
0.179 |
0.182 |
n/a |
n/a |
bruteforce |
0.173 |
0.228 |
0.123 |
0.170 |
bruteforce-sharemem |
n/a |
n/a |
0.045 |
0.054 |
nn-descent |
0.060 (R: 0.878) |
1.077 (R: 0.999) |
0.030 (R: 0.952) |
0.457 (R: 0.999) |
Medium MNIST
Model |
CPU |
GPU |
||
|---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.897 |
0.970 |
0.019 |
0.023 |
kd-tree |
18.902 |
18.928 |
n/a |
n/a |
bruteforce |
14.495 |
17.652 |
2.058 |
2.588 |
bruteforce-sharemem |
n/a |
n/a |
2.257 |
2.524 |
nn-descent |
0.804 (R: 0.755) |
14.108 (R: 0.999) |
0.158 (R: 0.900) |
1.794 (R: 0.999) |
Large MNIST
Model |
CPU |
GPU |
||
|---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
21.829 |
22.135 |
Out of Memory |
Out of Memory |
kd-tree |
542.688 |
573.379 |
n/a |
n/a |
bruteforce |
373.823 |
432.963 |
10.317 |
12.639 |
bruteforce-sharemem |
n/a |
n/a |
53.133 |
58.419 |
nn-descent |
4.995 (R: 0.658) |
75.487 (R: 0.999) |
1.478 (R: 0.860) |
15.698 (R: 0.999) |
Conclusion
As long as you have enough memory,
bruteforce-blasis the default algorithm to go with.Specifically, when
Dis small and the data is on CPU,kd-treeis the best algorithm.