Conforming to its' philosophy of freedom (of choice), reading data into R can be performed in various ways.
read.table()
and familyUsage: read.table(file, header = FALSE, sep = "", colClasses = NA, stringsAsFactors = TRUE, ...)
data.table()
package provides an alternative.fread()
fast file reader function is a fast and efficient way to read in data into RUsage: fread(input, ...)
where ...
takes in same arguments as that of read.table
.
read.table()
and fread()
with a 20MB .csv fileThe file flights.csv
can be downloaded from here
# install packages if not present
#install.packages(c("data.table", "rbenchmark"))
# load install packages
library(data.table); library(rbenchmark)
# file saved in windows default directory (~ = C:/Users/.../Documents)
read_base <- function(x) raw <- read.csv("~/flights.csv")
read_DT <- function(x) rawDT <<- fread("~/flights.csv")
# reading a 20MB .csv file
benchmark(read_base(), read_DT(), replications = 1,
columns = c("test", "elapsed"))
## test elapsed
## 1 read_base() 3.12
## 2 read_DT() 0.28
fread()
is almost 5x as fast as read.csv()
Having read the flights.csv
data into R using fread()
function, here are the first few rows
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013 1 1 517 2 830 11 UA N14228
## 2: 2013 1 1 533 4 850 20 UA N24211
## 3: 2013 1 1 542 2 923 33 AA N619AA
## 4: 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5: 2013 1 1 554 -6 812 -25 DL N668DN
## 6: 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1: 1545 EWR IAH 227 1400 5 17
## 2: 1714 LGA IAH 227 1416 5 33
## 3: 1141 JFK MIA 160 1089 5 42
## 4: 725 JFK BQN 183 1576 5 44
## 5: 461 LGA ATL 116 762 5 54
## 6: 1696 EWR ORD 150 719 5 54
The data contains flights information of all planes that departed NYC (i.e. JFK, LGA or EWR airports) in 2013.
Data Source: RITA, Bureau of transportation statistics
data.table()
data.table()
packagedata.table()
provides faster and efficient manipulation of data stored in RAM
Perform the following operations:
General form:
DT[i, j, by]
: Take DT, subset rows usingi
, calculatej
grouped byby
DT[i, j, by]
: Take DT, subset rows using i
, calculate j
, grouped by by
Rows can be filtered using column names satisfying conditions
head(rawDT[origin == "JFK" & dep_delay > 30])
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013 1 1 826 71 1136 51 AA N3GVAA
## 2: 2013 1 1 848 853 1001 851 MQ N942MQ
## 3: 2013 1 1 909 59 1331 16 AA N5EXAA
## 4: 2013 1 1 1337 77 1649 78 B6 N636JB
## 5: 2013 1 1 1428 59 1803 83 B6 N635JB
## 6: 2013 1 1 1515 38 1834 52 B6 N178JB
## flight origin dest air_time distance hour minute
## 1: 443 JFK MIA 160 1089 8 26
## 2: 3944 JFK BWI 41 184 8 48
## 3: 655 JFK STT 184 1623 9 9
## 4: 673 JFK LAX 352 2475 13 37
## 5: 355 JFK BUR 371 2465 14 28
## 6: 347 JFK SRQ 171 1041 15 15
DT[i, j, by]
: Take DT, subset rows using i
, calculate j
grouped by by
New variables can be created in the j
argument of data table operation
head(rawDT[, .(air_speed = distance/air_time)])
## air_speed
## 1: 6.167401
## 2: 6.237885
## 3: 6.806250
## 4: 8.612022
## 5: 6.568966
## 6: 4.793333
.()
is an alias for list()
to perform multiple operations separated by ','.()
is not used, the result is a vector, else the result is a data.table
data.table
To add the air_speed
variable in the rawDT data.table
, use :=
operator
head(rawDT[, air_speed := distance/air_time])
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013 1 1 517 2 830 11 UA N14228
## 2: 2013 1 1 533 4 850 20 UA N24211
## 3: 2013 1 1 542 2 923 33 AA N619AA
## 4: 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5: 2013 1 1 554 -6 812 -25 DL N668DN
## 6: 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute air_speed
## 1: 1545 EWR IAH 227 1400 5 17 6.167401
## 2: 1714 LGA IAH 227 1416 5 33 6.237885
## 3: 1141 JFK MIA 160 1089 5 42 6.806250
## 4: 725 JFK BQN 183 1576 5 44 8.612022
## 5: 461 LGA ATL 116 762 5 54 6.568966
## 6: 1696 EWR ORD 150 719 5 54 4.793333
data.table
air_speed
and total_delay (=dep_delay + arr_delay)
use a chained operation# in rawDT[, create var1][, create var2][print rows 1:3]
rawDT[, air_speed := distance/air_time][, total_delay := dep_delay + arr_delay][1:3]
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013 1 1 517 2 830 11 UA N14228
## 2: 2013 1 1 533 4 850 20 UA N24211
## 3: 2013 1 1 542 2 923 33 AA N619AA
## flight origin dest air_time distance hour minute air_speed total_delay
## 1: 1545 EWR IAH 227 1400 5 17 6.167401 13
## 2: 1714 LGA IAH 227 1416 5 33 6.237885 24
## 3: 1141 JFK MIA 160 1089 5 42 6.806250 35
DT[i, j, by]
: Take DT, subset rows using i
, calculate j
grouped by by
na.rm = TRUE
which removes the NA
(missing values) from the data when calculating the mean
# calculate average air speed by carrier and print rows 1 to 5
rawDT[, .(avg_air_speed = mean(distance/air_time, na.rm = TRUE)), by = carrier][1:5]
## carrier avg_air_speed
## 1: UA 7.014730
## 2: AA 6.957879
## 3: B6 6.666191
## 4: DL 6.974380
## 5: EV 6.049060
Summarize data using necessary arguments of i
, j
and by
of data.table
Examples:
origin == "JFK
)origin == "JFK
), find the carrier with maximum average departure delay head(rawDT[origin == "JFK", .N, by = day])
## day N
## 1: 1 3663
## 2: 2 3661
## 3: 3 3696
## 4: 4 3638
## 5: 5 3608
## 6: 6 3640
head(rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = carrier][order(-avg_dep_delay)])
## carrier avg_dep_delay
## 1: 9E 19.00152
## 2: EV 18.52036
## 3: VX 13.27944
## 4: MQ 13.19997
## 5: B6 12.75745
## 6: AA 10.30216
smryDT <- rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = month]
# setorder works much faster than base::order from previous example
setorder(smryDT, -avg_dep_delay)
head(smryDT)
## month avg_dep_delay
## 1: 7 23.76926
## 2: 6 20.49973
## 3: 12 14.78835
## 4: 8 12.91436
## 5: 5 12.51943
## 6: 4 12.24906
Joins in data.table
are performed using merge.data.table()
function. However, data.tables
need to be sorted by key(s)
which are established using setkey()
for an existing data.table
or key
argument while creating one
data.tables
with same key
value to joindt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"); head(dt1)
## A X
## 1: a 1
## 2: b 2
## 3: c 3
## 4: d 4
## 5: e 5
## 6: f 6
dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"); head(dt2)
## A Y
## 1: e 1
## 2: f 2
## 3: g 3
## 4: h 4
## 5: i 5
## 6: j 6
# left outer join
merge.data.frame(x = dt1, y = dt2, all.x=TRUE)
## A X Y
## 1 a 1 NA
## 2 b 2 NA
## 3 c 3 NA
## 4 d 4 NA
## 5 e 5 1
## 6 f 6 2
## 7 g 7 3
## 8 h 8 4
## 9 i 9 5
## 10 j 10 6
# right inner join
merge.data.frame(x = dt2, y = dt1, all.x=FALSE)
## A Y X
## 1 e 1 5
## 2 f 2 6
## 3 g 3 7
## 4 h 4 8
## 5 i 5 9
## 6 j 6 10
# full outer join
merge.data.frame(x = dt1, y = dt2, all = TRUE)
## A X Y
## 1 a 1 NA
## 2 b 2 NA
## 3 c 3 NA
## 4 d 4 NA
## 5 e 5 1
## 6 f 6 2
## 7 g 7 3
## 8 h 8 4
## 9 i 9 5
## 10 j 10 6
## 11 k NA 7
## 12 l NA 8
## 13 m NA 9
## 14 n NA 10
melt
data.tables
can be reshaped using the melt
and dcast
functions:
melt
: Wide-to-long (melting)Usage: melt(data, id.vars, measure.vars, variable.name = "variable", value.name = "value", ...)
where,
data
A data.table
to melt
id.vars
vector of id variables; if missing, all non-id columns are assigned
measure.vars
vector of measure variables; if missing, all non-id columns are assigned
variable.name
name for the measured variable names column
value.name
name for the molten data values column
...
advanced argument for melt
functions
Create the data to melt
library(reshape2)
DT <- data.table(
i1 = c(1:3, NA),
i2 = c(5, 6, 7, 8),
f1 = c("A", "C", "D", "Q"),
c1 = c("XY", "FE", "AA", "GG"))
DT
## i1 i2 f1 c1
## 1: 1 5 A XY
## 2: 2 6 C FE
## 3: 3 7 D AA
## 4: NA 8 Q GG
(DT.melt1 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1")))
## i1 i2 variable value
## 1: 1 5 f1 A
## 2: 2 6 f1 C
## 3: 3 7 f1 D
## 4: NA 8 f1 Q
## 5: 1 5 c1 XY
## 6: 2 6 c1 FE
## 7: 3 7 c1 AA
## 8: NA 8 c1 GG
#rename variable and value columns
(DT.melt2 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1"), variable.name = "Factors", value.name = "data_value"))
## i1 i2 Factors data_value
## 1: 1 5 f1 A
## 2: 2 6 f1 C
## 3: 3 7 f1 D
## 4: NA 8 f1 Q
## 5: 1 5 c1 XY
## 6: 2 6 c1 FE
## 7: 3 7 c1 AA
## 8: NA 8 c1 GG
dcast
dcast.data.table
: Long-to-wide (casting)Usage: dcast.data.table(data, formula, fun.aggregate = NULL, ...)
where,
data
A molten data.table
formula
A formula of the form LHS ~ RHS to cast, eg: var1 + var2 ~ var3. The first varies slowest, and the last fastest. "..." represents all other variables not used in the formula and "." represents no variable
fun.aggregate
Aggregation function needed if variables do not identify a single observation for each output cell
...
other advanced arguments
dcast
the molten data.table
(DT.dcast <- dcast.data.table(DT.melt2, i1+i2~Factors))
## Using 'data_value' as value column. Use 'value.var' to override
## i1 i2 f1 c1
## 1: NA 8 Q GG
## 2: 1 5 A XY
## 3: 2 6 C FE
## 4: 3 7 D AA
/
#