Not just a speedy way to import data into R, fread has useful options for importing subsets of data, setting data types for columns, running system commands, and more
Like all functions in the data.table R package, freadΒ is fast. Very fast. But thereβs more to freadΒ than speed. It has several helpful features and options when importing external data into R. Here are five of the most useful.
Note: If youβd like to follow along, download the New York Times CSV file of daily Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv.
Use freadβs nrows option
Is your file large? Would you like to examine its structure before importing the whole thing β without having to open it in a text editor or Excel? Use freadβs nrows option to import only a portion of a file for exploration.
The code below imports just the first 10 rows of the CSV.
mydt10 <- fread("us-counties.csv", nrows = 10)
If you just want to see column names without any data at all, you can use nrows = 0.Β
Use freadβs select option
Once you know the file structure, you can choose which columns to import. freadβs select option lets you pick columns you want to keep. select takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings:
mydt <- fread("us-counties.csv",
select = c("date", "county", "state", "cases"))
As always, numbers donβt need quotation marks:
mydt <- fread("us-counties.csv", select = c(1,2,3,5))
You can use an R object with a vector of column names inside fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases; then I use that vector inside fread.
my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)
The opposite of select is drop. You can choose to import all columns except the ones you specify with drop, such as:
mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))
Like with select, drop takes a vector of column names or numerical positions.Β
Use fread with grep
If youβre familiar with Unix, you canΒ execute command-line tools right from inside fread. For example, if I just wanted California data, I could use grep to only import lines that contain the text βCalifornia.β Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense.
ca <- fread("grep California us-counties.csv")
Unfortunately, grep doesnβt understand the original fileβs column names, so you end up with default names.
head(ca)
V1 V2 V3 V4 V5 V6
1: 2020-01-25 Orange California 6059 1 0
2: 2020-01-26 Los Angeles California 6037 1 0
3: 2020-01-26 Orange California 6059 1 0
4: 2020-01-27 Los Angeles California 6037 1 0
5: 2020-01-27 Orange California 6059 1 0
6: 2020-01-28 Los Angeles California 6037 1 0
However, fread lets us specify column names with the col.names option. I can set the names based on names from mydt10 that I created above.
ca <- fread("grep California us-counties.csv",
col.names = names(mydt10))
> head(ca)
date county state fips cases deaths
1: 2020-01-25 Orange California 6059 1 0
2: 2020-01-26 Los Angeles California 6037 1 0
3: 2020-01-26 Orange California 6059 1 0
4: 2020-01-27 Los Angeles California 6037 1 0
5: 2020-01-27 Orange California 6059 1 0
6: 2020-01-28 Los Angeles California 6037 1 0
We can also use regular expressions, with grepβs -E option, letting us do more complex searches, such as looking for four states at once.Β
states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv",
col.names = names(mydt10))
Once again, a reminder: This is looking for each of those state names anywhere in the row, not just in the state column. If you run the code above and check what states are included in the results with unique(states4$state), youβll see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. Thatβs because both Oklahoma and Missouri have counties named Texas.
So, grep during file import is a way to filter out a lot of data you donβt want from a very large data set; but it doesnβt guarantee you only get what you want. After this kind of import, you should still filter specifically on column data to make sure you didnβt get anything unexpected.
Use freadβs colClasses option
You can set column classes during import β for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though itβs in year-month-day format. We can set the column named date to the data type DateΒ during import using the colClasses option.Β
mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))
Now, dates are Dates.
> str(mydt)
Classes βdata.tableβ and 'data.frame': 322651 obs. of 6 variables:
$ date : Date, format: "2020-01-21" "2020-01-22" "2020-01-23" ...
$ county: chr "Snohomish" "Snohomish" "Snohomish" "Cook" ...
$ state : chr "Washington" "Washington" "Washington" "Illinois" ...
$ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
$ cases : int 1 1 1 1 1 1 1 1 1 1 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...
Use fread on zipped files
You can import a zipped file without unzipping it first. fread can import gz and bz2 files directly, such asΒ mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the unzip system command within fread, using the syntaxΒ mydt <- fread(cmd = 'unzip -cq myfile.zip').
For more R tips, head toΒ InfoWorldβs Do More With R page.


