Conforming to its' philosophy of freedom (of choice), reading data into R can be performed in various ways.
read.table() and familyUsage: read.table(file, header = FALSE, sep = "", colClasses = NA, stringsAsFactors = TRUE, ...)
data.table() package provides an alternative.fread() fast file reader function is a fast and efficient way to read in data into RUsage: fread(input, ...) where ... takes in same arguments as that of read.table.
read.table() and fread() with a 20MB .csv fileThe file flights.csv can be downloaded from here
# install packages if not present
#install.packages(c("data.table", "rbenchmark")) 
# load install packages
library(data.table); library(rbenchmark)
# file saved in windows default directory (~ = C:/Users/.../Documents)
read_base <- function(x) raw <- read.csv("~/flights.csv")
read_DT   <- function(x) rawDT <<- fread("~/flights.csv")
# reading a 20MB .csv file
benchmark(read_base(), read_DT(), replications = 1,
  columns = c("test", "elapsed")) 
##          test elapsed
## 1 read_base()    3.12
## 2   read_DT()    0.28
fread() is almost 5x as fast as read.csv()
Having read the flights.csv data into R using fread() function, here are the first few rows
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013     1   1      517         2      830        11      UA  N14228
## 2: 2013     1   1      533         4      850        20      UA  N24211
## 3: 2013     1   1      542         2      923        33      AA  N619AA
## 4: 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5: 2013     1   1      554        -6      812       -25      DL  N668DN
## 6: 2013     1   1      554        -4      740        12      UA  N39463
##    flight origin dest air_time distance hour minute
## 1:   1545    EWR  IAH      227     1400    5     17
## 2:   1714    LGA  IAH      227     1416    5     33
## 3:   1141    JFK  MIA      160     1089    5     42
## 4:    725    JFK  BQN      183     1576    5     44
## 5:    461    LGA  ATL      116      762    5     54
## 6:   1696    EWR  ORD      150      719    5     54
The data contains flights information of all planes that departed NYC (i.e. JFK, LGA or EWR airports) in 2013.
Data Source: RITA, Bureau of transportation statistics
data.table()data.table() packagedata.table() provides faster and efficient manipulation of data stored in RAM
Perform the following operations:
General form:
DT[i, j, by]: Take DT, subset rows usingi, calculatejgrouped byby
DT[i, j, by]: Take DT, subset rows using i, calculate j, grouped by by 
Rows can be filtered using column names satisfying conditions
head(rawDT[origin == "JFK" & dep_delay > 30])
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013     1   1      826        71     1136        51      AA  N3GVAA
## 2: 2013     1   1      848       853     1001       851      MQ  N942MQ
## 3: 2013     1   1      909        59     1331        16      AA  N5EXAA
## 4: 2013     1   1     1337        77     1649        78      B6  N636JB
## 5: 2013     1   1     1428        59     1803        83      B6  N635JB
## 6: 2013     1   1     1515        38     1834        52      B6  N178JB
##    flight origin dest air_time distance hour minute
## 1:    443    JFK  MIA      160     1089    8     26
## 2:   3944    JFK  BWI       41      184    8     48
## 3:    655    JFK  STT      184     1623    9      9
## 4:    673    JFK  LAX      352     2475   13     37
## 5:    355    JFK  BUR      371     2465   14     28
## 6:    347    JFK  SRQ      171     1041   15     15
DT[i, j, by]: Take DT,  subset rows using i,  calculate j  grouped by by 
New variables can be created in the j argument of data table operation
head(rawDT[, .(air_speed = distance/air_time)])
##    air_speed
## 1:  6.167401
## 2:  6.237885
## 3:  6.806250
## 4:  8.612022
## 5:  6.568966
## 6:  4.793333
.() is an alias for list() to perform multiple operations separated by ','.() is not used, the result is a vector, else the result is a data.tabledata.tableTo add the air_speed variable in the rawDT data.table, use := operator
head(rawDT[, air_speed := distance/air_time])
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013     1   1      517         2      830        11      UA  N14228
## 2: 2013     1   1      533         4      850        20      UA  N24211
## 3: 2013     1   1      542         2      923        33      AA  N619AA
## 4: 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5: 2013     1   1      554        -6      812       -25      DL  N668DN
## 6: 2013     1   1      554        -4      740        12      UA  N39463
##    flight origin dest air_time distance hour minute air_speed
## 1:   1545    EWR  IAH      227     1400    5     17  6.167401
## 2:   1714    LGA  IAH      227     1416    5     33  6.237885
## 3:   1141    JFK  MIA      160     1089    5     42  6.806250
## 4:    725    JFK  BQN      183     1576    5     44  8.612022
## 5:    461    LGA  ATL      116      762    5     54  6.568966
## 6:   1696    EWR  ORD      150      719    5     54  4.793333
data.tableair_speed and total_delay (=dep_delay + arr_delay) use a chained operation# in rawDT[, create var1][, create var2][print rows 1:3]
rawDT[, air_speed := distance/air_time][, total_delay := dep_delay + arr_delay][1:3]
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1: 2013     1   1      517         2      830        11      UA  N14228
## 2: 2013     1   1      533         4      850        20      UA  N24211
## 3: 2013     1   1      542         2      923        33      AA  N619AA
##    flight origin dest air_time distance hour minute air_speed total_delay
## 1:   1545    EWR  IAH      227     1400    5     17  6.167401          13
## 2:   1714    LGA  IAH      227     1416    5     33  6.237885          24
## 3:   1141    JFK  MIA      160     1089    5     42  6.806250          35
DT[i, j, by]: Take DT,  subset rows using i, calculate j  grouped by by
na.rm = TRUE which removes the NA (missing values) from the data when calculating the mean# calculate average air speed by carrier and print rows 1 to 5
rawDT[, .(avg_air_speed = mean(distance/air_time, na.rm = TRUE)), by = carrier][1:5]
##    carrier avg_air_speed
## 1:      UA      7.014730
## 2:      AA      6.957879
## 3:      B6      6.666191
## 4:      DL      6.974380
## 5:      EV      6.049060
Summarize data using necessary arguments of i, j and by of data.table
Examples:
origin == "JFK)origin == "JFK), find the carrier  with maximum average departure delay head(rawDT[origin == "JFK", .N, by = day])
##    day    N
## 1:   1 3663
## 2:   2 3661
## 3:   3 3696
## 4:   4 3638
## 5:   5 3608
## 6:   6 3640
head(rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = carrier][order(-avg_dep_delay)])
##    carrier avg_dep_delay
## 1:      9E      19.00152
## 2:      EV      18.52036
## 3:      VX      13.27944
## 4:      MQ      13.19997
## 5:      B6      12.75745
## 6:      AA      10.30216
smryDT <- rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = month]
# setorder works much faster than base::order from previous example
setorder(smryDT, -avg_dep_delay)
head(smryDT)
##    month avg_dep_delay
## 1:     7      23.76926
## 2:     6      20.49973
## 3:    12      14.78835
## 4:     8      12.91436
## 5:     5      12.51943
## 6:     4      12.24906
Joins in data.table are performed using merge.data.table() function. However, data.tables need to be sorted by key(s) which are established using setkey() for an existing data.table or key argument while creating one
data.tables with same key value to joindt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"); head(dt1)
##    A X
## 1: a 1
## 2: b 2
## 3: c 3
## 4: d 4
## 5: e 5
## 6: f 6
dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"); head(dt2)
##    A Y
## 1: e 1
## 2: f 2
## 3: g 3
## 4: h 4
## 5: i 5
## 6: j 6
# left outer join
merge.data.frame(x = dt1, y = dt2, all.x=TRUE)
##    A  X  Y
## 1  a  1 NA
## 2  b  2 NA
## 3  c  3 NA
## 4  d  4 NA
## 5  e  5  1
## 6  f  6  2
## 7  g  7  3
## 8  h  8  4
## 9  i  9  5
## 10 j 10  6
# right inner join
merge.data.frame(x = dt2, y = dt1, all.x=FALSE)
##   A Y  X
## 1 e 1  5
## 2 f 2  6
## 3 g 3  7
## 4 h 4  8
## 5 i 5  9
## 6 j 6 10
# full outer join
merge.data.frame(x = dt1, y = dt2, all = TRUE)
##    A  X  Y
## 1  a  1 NA
## 2  b  2 NA
## 3  c  3 NA
## 4  d  4 NA
## 5  e  5  1
## 6  f  6  2
## 7  g  7  3
## 8  h  8  4
## 9  i  9  5
## 10 j 10  6
## 11 k NA  7
## 12 l NA  8
## 13 m NA  9
## 14 n NA 10
meltdata.tables can be reshaped using the melt and dcast functions:
melt: Wide-to-long (melting)Usage: melt(data, id.vars, measure.vars, variable.name = "variable", value.name = "value", ...) 
where,
data A data.table to melt
id.vars vector of id variables; if missing, all non-id columns are assigned
measure.vars vector of measure variables; if missing, all non-id columns are assigned
variable.name name for the measured variable names column
value.name name for the molten data values column
... advanced argument for melt functions
Create the data to melt
library(reshape2)
DT <- data.table(
      i1 = c(1:3, NA), 
      i2 = c(5, 6, 7, 8), 
      f1 = c("A", "C", "D", "Q"), 
      c1 = c("XY", "FE", "AA", "GG"))
DT
##    i1 i2 f1 c1
## 1:  1  5  A XY
## 2:  2  6  C FE
## 3:  3  7  D AA
## 4: NA  8  Q GG
(DT.melt1 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1")))
##    i1 i2 variable value
## 1:  1  5       f1     A
## 2:  2  6       f1     C
## 3:  3  7       f1     D
## 4: NA  8       f1     Q
## 5:  1  5       c1    XY
## 6:  2  6       c1    FE
## 7:  3  7       c1    AA
## 8: NA  8       c1    GG
#rename variable and value columns
(DT.melt2 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1"), variable.name = "Factors", value.name = "data_value"))
##    i1 i2 Factors data_value
## 1:  1  5      f1          A
## 2:  2  6      f1          C
## 3:  3  7      f1          D
## 4: NA  8      f1          Q
## 5:  1  5      c1         XY
## 6:  2  6      c1         FE
## 7:  3  7      c1         AA
## 8: NA  8      c1         GG
dcastdcast.data.table: Long-to-wide (casting)Usage: dcast.data.table(data, formula, fun.aggregate = NULL, ...) 
where,
data A molten data.table
formula A formula of the form LHS ~ RHS to cast, eg: var1 + var2 ~ var3. The first varies slowest, and the last fastest.  "..." represents all other variables not used in the formula and "." represents no variable
fun.aggregate Aggregation function needed if variables do not identify a single observation for each output cell
... other advanced arguments
dcast the molten data.table(DT.dcast <- dcast.data.table(DT.melt2, i1+i2~Factors))
## Using 'data_value' as value column. Use 'value.var' to override
##    i1 i2 f1 c1
## 1: NA  8  Q GG
## 2:  1  5  A XY
## 3:  2  6  C FE
## 4:  3  7  D AA
/
#