There are now two options for using kdtools, either on a C++ vector of arrays (arrayvec object) or natively on a data frame. Sorting on arrayvec objects is fast. Passing a data frame is slower, but automatically supports mixed types.
Please see the pkgdown site for more articles on kdtools
When working with a data frame, you can specify which columns to use
and the order of inclusion using the cols
argument.
Omitting the cols
argument uses all columns in order.
# sort by weight, miles-per-gallon and displacement
mtcars_sorted <- kd_sort(mtcars, cols = ~ wt + mpg + disp);
head(mtcars_sorted, 3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
tail(mtcars_sorted, 3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
cols <- colspec(mtcars, ~wt + mpg + disp)
lower <- c(2.5, 17, 120)
upper <- c(3.6, 22, 330)
kd_range_query(mtcars_sorted, lower, upper, cols = cols)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
weights <- 1 / diag(var(mtcars[, cols]))
kd_nearest_neighbors(mtcars_sorted, lower, 2, cols = cols, w = weights)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
The kdtools package can be used to search for multidimensional points in a boxed region and find nearest neighbors in 1 to 9 dimensions. The package uses binary search on a sorted sequence of values. The current package is limited to matrices of real values. If you are interested in using string or mixed types in different dimensions, see the methods vignette.
Using kdtools is straightforward. There are four steps:
library(kdtools)
x = matrix(runif(3e3), nc = 3)
y = matrix_to_tuples(x)
y[1:3, c(1, 3)]
#> [,1] [,2]
#> [1,] 0.9063688 0.3226261
#> [2,] 0.4677785 0.6382602
#> [3,] 0.5900272 0.7796014
The arrayvec object can be manipulated as if it were a matrix.
kd_sort(y, inplace = TRUE, parallel = TRUE)
#> [,1] [,2] [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 995 more rows)
rq = kd_range_query(y, c(0, 0, 0), c(1/4, 1/4, 1/4)); rq
#> [,1] [,2] [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 9 more rows)
i = kd_nearest_neighbor(y, c(0, 0, 0)); y[i, ]
#> [1] 0.02096311 0.07102602 0.09382227
nns = kd_nearest_neighbors(y, c(0, 0, 0), 100); nns
#> [,1] [,2] [,3]
#> [1,] 0.02096311 0.07102602 0.09382227
#> [2,] 0.09239822 0.11579286 0.01748684
#> [3,] 0.04734198 0.12577159 0.12383529
#> [4,] 0.08320526 0.01440299 0.19734563
#> [5,] 0.06199959 0.20603226 0.01682205
#> (continues for 95 more rows)
nni = kd_nn_indices(y, c(0, 0, 0), 10); nni
#> [1] 2 1 4 3 5 10 12 9 14 16
The kd_nearest_neighbor
and kd_nn_indices
functions return row-indices. The other functions return arrayvec
objects.
head(tuples_to_matrix(rq))
#> [,1] [,2] [,3]
#> [1,] 0.09239822 0.11579286 0.01748684
#> [2,] 0.02096311 0.07102602 0.09382227
#> [3,] 0.08320526 0.01440299 0.19734563
#> [4,] 0.04734198 0.12577159 0.12383529
#> [5,] 0.06199959 0.20603226 0.01682205
#> [6,] 0.02192858 0.15486310 0.24297762
head(tuples_to_matrix(nns))
#> [,1] [,2] [,3]
#> [1,] 0.02096311 0.07102602 0.09382227
#> [2,] 0.09239822 0.11579286 0.01748684
#> [3,] 0.04734198 0.12577159 0.12383529
#> [4,] 0.08320526 0.01440299 0.19734563
#> [5,] 0.06199959 0.20603226 0.01682205
#> [6,] 0.13167283 0.02859779 0.17847762
If you pass a matrix instead of an arrayvec object to any of the functions, it will be converted to an arrayvec object internally and results will be returned as matrices. This is slower and provided for convenience.