Getting started with kdtools

There are now two options for using kdtools, either on a C++ vector of arrays (arrayvec object) or natively on a data frame. Sorting on arrayvec objects is fast. Passing a data frame is slower, but automatically supports mixed types.

Please see the pkgdown site for more articles on kdtools

Data Frame Interface

Step 1. Sort data frame

When working with a data frame, you can specify which columns to use and the order of inclusion using the cols argument. Omitting the cols argument uses all columns in order.

# sort by weight, miles-per-gallon and displacement
mtcars_sorted <- kd_sort(mtcars, cols = ~ wt + mpg + disp);
head(mtcars_sorted, 3)
#>                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Datsun 710    22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
tail(mtcars_sorted, 3)
#>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3

Step 2. Search sorted data frame

cols <- colspec(mtcars, ~wt + mpg + disp)

lower <- c(2.5, 17, 120)
upper <- c(3.6, 22, 330)

kd_range_query(mtcars_sorted, lower, upper, cols = cols)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1

weights <- 1 / diag(var(mtcars[, cols]))
kd_nearest_neighbors(mtcars_sorted, lower, 2, cols = cols, w = weights)
#>                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Arrayvec Interface

The kdtools package can be used to search for multidimensional points in a boxed region and find nearest neighbors in 1 to 9 dimensions. The package uses binary search on a sorted sequence of values. The current package is limited to matrices of real values. If you are interested in using string or mixed types in different dimensions, see the methods vignette.

Using kdtools is straightforward. There are four steps:

Step 1. Convert your matrix of values into a arrayvec object

library(kdtools)
x = matrix(runif(3e3), nc = 3)
y = matrix_to_tuples(x)
y[1:3, c(1, 3)]
#>           [,1]      [,2]
#> [1,] 0.9063688 0.3226261
#> [2,] 0.4677785 0.6382602
#> [3,] 0.5900272 0.7796014

The arrayvec object can be manipulated as if it were a matrix.

Step 2. Sort the data

kd_sort(y, inplace = TRUE, parallel = TRUE)
#>            [,1]       [,2]       [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 995 more rows)

Step 3. Search the data

rq = kd_range_query(y, c(0, 0, 0), c(1/4, 1/4, 1/4)); rq
#>            [,1]       [,2]       [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 9 more rows)
i = kd_nearest_neighbor(y, c(0, 0, 0)); y[i, ]
#> [1] 0.02096311 0.07102602 0.09382227
nns = kd_nearest_neighbors(y, c(0, 0, 0), 100); nns
#>            [,1]       [,2]       [,3]
#> [1,] 0.02096311 0.07102602 0.09382227
#> [2,] 0.09239822 0.11579286 0.01748684
#> [3,] 0.04734198 0.12577159 0.12383529
#> [4,] 0.08320526 0.01440299 0.19734563
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 95 more rows)
nni = kd_nn_indices(y, c(0, 0, 0), 10); nni
#>  [1]  2  1  4  3  5 10 12  9 14 16

The kd_nearest_neighbor and kd_nn_indices functions return row-indices. The other functions return arrayvec objects.

Step 4. Convert back to a matrix for use in R

head(tuples_to_matrix(rq))
#>            [,1]       [,2]       [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> [6,] 0.02192858 0.15486310 0.24297762
head(tuples_to_matrix(nns))
#>            [,1]       [,2]       [,3]
#> [1,] 0.02096311 0.07102602 0.09382227
#> [2,] 0.09239822 0.11579286 0.01748684
#> [3,] 0.04734198 0.12577159 0.12383529
#> [4,] 0.08320526 0.01440299 0.19734563
#> [5,] 0.06199959 0.20603226 0.01682205
#> [6,] 0.13167283 0.02859779 0.17847762

If you pass a matrix instead of an arrayvec object to any of the functions, it will be converted to an arrayvec object internally and results will be returned as matrices. This is slower and provided for convenience.

Timothy H. Keitt

2022-09-25