Overview of set family * set() is a loopable low overhead version of :=; * You can use setnames() to set or change column names; * setorder() and setcolorder() reorder the rows, columns of a data.table. * setkey() set keys in a data.table
set()
To assign the column am to 1:
1
2
set(dtm, j = which(colnames(dtm)=='am'), value = 1)
To make the data.table a database table. According to the document:
setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always.
setkey reorders (or sorts) the rows of a data.table by the columns provided. In versions 1.9+, for integer columns… It is extremely fast, but is limited by the range of integer values being \<= 1e5.
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: NA NA 8 NA NA NA NA NA NA NA NA NA Renault
It returns NA in all columns but the 2 keys.
select groups
By default, subsetting data.table returns all the records, it is related to argument mult of which the default value is all. You can also specify ‘first’ or ‘last’.
You could also specify certain rows (i). To display each i, you need to specify by = .EACHI.
When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in ‘DT’ that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i.
1
dti[c('setosa', 'versicolor'), .SD[c(1, .N)], by = .EACHI]
Let’s check the records of brand ‘Merc’ and 7 cylinders.
1
dtm[.('Merc', 5)]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: NA NA 5 NA NA NA NA NA NA NA NA NA Merc
There is no such record. How to return records without NA? Use argument roll.
roll = TRUE: i’s row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward.
1
dtm[.(c('Merc', 'Toyota'), 5), roll = TRUE]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: Merc 230 22.8 5 140.8 95 3.92 3.150 22.9 1 0 4 2 Merc
## 2: Toyota Corolla 33.9 5 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota
roll = 'nearest': the nearest value is joined to.
1
2
# compare the commands below
dtm[.(c('Merc', 'Toyota'), 3), roll = TRUE]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: NA NA 3 NA NA NA NA NA NA NA NA NA Merc
## 2: NA NA 3 NA NA NA NA NA NA NA NA NA Toyota
1
dtm[.(c('Merc', 'Toyota'), 3), roll = 'nearest']
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: Merc 240D 24.4 3 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc
## 2: Toyota Corona 21.5 3 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota
roll can also be a finite number.
1
2
# compare the commands below
dtm[.('Merc', 6:10)]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 Merc
## 2: Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 Merc
## 3: NA NA 7 NA NA NA NA NA NA NA NA NA Merc
## 4: Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 Merc
## 5: Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 Merc
## 6: Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3 Merc
## 7: NA NA 9 NA NA NA NA NA NA NA NA NA Merc
## 8: NA NA 10 NA NA NA NA NA NA NA NA NA Merc
1
dtm[.('Merc', 6:10), roll = 1]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 Merc
## 2: Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 Merc
## 3: Merc 280C 17.8 7 167.6 123 3.92 3.44 18.9 1 0 4 4 Merc
## 4: Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 Merc
## 5: Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 Merc
## 6: Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3 Merc
## 7: Merc 450SLC 15.2 9 275.8 180 3.07 3.78 18.0 0 0 3 3 Merc
## 8: NA NA 10 NA NA NA NA NA NA NA NA NA Merc
There is no Mercedez with 7, 9 and 10 cylinders, but we allow it to roll for distance of 1. So 7 and 9 cylinders are both joined but 10 because it’s more than 1 from the prevailing observation.
Now we take a look on argument rollends. This argument is actually a vector of two logical values (a single logical is recycled).
1
2
3
# compare the commands below
setkey(dtm, am, carb)
dtm[.(1, (-1):8)]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: NA NA NA NA NA NA NA NA NA 1 NA -1 NA
## 2: NA NA NA NA NA NA NA NA NA 1 NA 0 NA
## 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun
## 4: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat
## 5: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat
## 6: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota
## 7: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda
## 8: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus
## 9: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche
## 10: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Volvo
## 11: NA NA NA NA NA NA NA NA NA 1 NA 3 NA
## 12: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford
## 13: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda
## 14: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda
## 15: NA NA NA NA NA NA NA NA NA 1 NA 5 NA
## 16: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari
## 17: NA NA NA NA NA NA NA NA NA 1 NA 7 NA
## 18: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati
1
dtm[.(1, (-1):8), roll = TRUE]
## model mpg cyl disp hp drat wt qsec vs am gear carb brand
## 1: NA NA NA NA NA NA NA NA NA 1 NA -1 NA
## 2: NA NA NA NA NA NA NA NA NA 1 NA 0 NA
## 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun
## 4: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat
## 5: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat
## 6: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota
## 7: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda
## 8: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus
## 9: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche
## 10: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Volvo
## 11: Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 3 Volvo
## 12: Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford
## 13: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda
## 14: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda
## 15: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 5 Mazda
## 16: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari
## 17: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 7 Ferrari
## 18: Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati
Enter the command library(help = data.table) and you could see many other small functions, like frank, foverlaps, etc. All of them are aiming at speeding up the available functions with similar effect in base or other packages. You could take a look on them when necessary.