by Sharon Machlis

Contributing Writer

5 handy options in R data.table’s fread

how-to

Jul 16, 20205 mins

Not just a speedy way to import data into R, fread has useful options for importing subsets of data, setting data types for columns, running system commands, and more

Do More With R [video teaser/video series] - R Programming Guide - Tips & Tricks

Like all functions in the data.table R package, fread is fast. Very fast. But there’s more to fread than speed. It has several helpful features and options when importing external data into R. Here are five of the most useful.

Note: If you’d like to follow along, download the New York Times CSV file of daily Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv.

Use fread’s nrows option

Is your file large? Would you like to examine its structure before importing the whole thing – without having to open it in a text editor or Excel? Use fread’s nrows option to import only a portion of a file for exploration.

The code below imports just the first 10 rows of the CSV.

mydt10 <- fread("us-counties.csv", nrows = 10)

If you just want to see column names without any data at all, you can use nrows = 0.

Use fread’s select option

Once you know the file structure, you can choose which columns to import. fread’s select option lets you pick columns you want to keep. select takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings:

mydt <- fread("us-counties.csv", 
          select = c("date", "county", "state", "cases"))

As always, numbers don’t need quotation marks:

mydt <- fread("us-counties.csv", select = c(1,2,3,5))

You can use an R object with a vector of column names inside fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases; then I use that vector inside fread.

my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)

The opposite of select is drop. You can choose to import all columns except the ones you specify with drop, such as:

mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))

Like with select, drop takes a vector of column names or numerical positions.

Use fread with grep

If you’re familiar with Unix, you can execute command-line tools right from inside fread. For example, if I just wanted California data, I could use grep to only import lines that contain the text “California.” Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense.

ca <- fread("grep California us-counties.csv")

Unfortunately, grep doesn’t understand the original file’s column names, so you end up with default names.

head(ca)
           V1          V2         V3   V4 V5 V6
1: 2020-01-25      Orange California 6059  1  0
2: 2020-01-26 Los Angeles California 6037  1  0
3: 2020-01-26      Orange California 6059  1  0
4: 2020-01-27 Los Angeles California 6037  1  0
5: 2020-01-27      Orange California 6059  1  0
6: 2020-01-28 Los Angeles California 6037  1  0

However, fread lets us specify column names with the col.names option. I can set the names based on names from mydt10 that I created above.

ca <- fread("grep California us-counties.csv", 
             col.names = names(mydt10))
> head(ca)
         date      county      state fips cases deaths
1: 2020-01-25      Orange California 6059     1      0
2: 2020-01-26 Los Angeles California 6037     1      0
3: 2020-01-26      Orange California 6059     1      0
4: 2020-01-27 Los Angeles California 6037     1      0
5: 2020-01-27      Orange California 6059     1      0
6: 2020-01-28 Los Angeles California 6037     1      0

We can also use regular expressions, with grep’s -E option, letting us do more complex searches, such as looking for four states at once.

states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv", 
           col.names = names(mydt10))

Once again, a reminder: This is looking for each of those state names anywhere in the row, not just in the state column. If you run the code above and check what states are included in the results with unique(states4$state), you’ll see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. That’s because both Oklahoma and Missouri have counties named Texas.

So, grep during file import is a way to filter out a lot of data you don’t want from a very large data set; but it doesn’t guarantee you only get what you want. After this kind of import, you should still filter specifically on column data to make sure you didn’t get anything unexpected.

Use fread’s colClasses option

You can set column classes during import – for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though it’s in year-month-day format. We can set the column named date to the data type Date during import using the colClasses option.

mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))

Now, dates are Dates.

> str(mydt)
Classes ‘data.table’ and 'data.frame':	322651 obs. of  6 variables:
 $ date  : Date, format: "2020-01-21" "2020-01-22" "2020-01-23" ...
 $ county: chr  "Snohomish" "Snohomish" "Snohomish" "Cook" ...
 $ state : chr  "Washington" "Washington" "Washington" "Illinois" ...
 $ fips  : int  53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
 $ cases : int  1 1 1 1 1 1 1 1 1 1 ...
 $ deaths: int  0 0 0 0 0 0 0 0 0 0 ...

Use fread on zipped files

You can import a zipped file without unzipping it first. fread can import gz and bz2 files directly, such as mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the unzip system command within fread, using the syntax mydt <- fread(cmd = 'unzip -cq myfile.zip').

For more R tips, head to InfoWorld’s Do More With R page.

by Sharon Machlis

Contributing Writer

Follow Sharon Machlis on LinkedIn

Sharon Machlis was a longtime writer and editor at Computerworld and later the Director of Editorial Data & Analytics at parent company Foundry. She is also the author of Practical R for Mass Communications and Journalism.

Sharon's Do more with R video tutorials won a Jesse H. Neal award for Best Instructional Content.

Recently retired, Sharon is still passionate about R and generative AI, and also blogs about the retirement life. You can find her on Bluesky at @smachlis.bsky.social, Mastodon at @smach@masto.machlis.com, and LinkedIn.

Show me more

Topics

About

Policies

Our Network

More

5 handy options in R data.table’s fread

Not just a speedy way to import data into R, fread has useful options for importing subsets of data, setting data types for columns, running system commands, and more

Use fread’s nrows option

Use fread’s select option

Use fread with grep

Use fread’s colClasses option

Use fread on zipped files

More from this author

How to create your own RAG applications in R

GenAI tools for R: New tools to make R programming easier

3 of the best LLM integration tools for R

FAQ: Getting started with Bluesky

Create searchable Bluesky bookmarks with R

Shiny for Python adds chat component for generative AI chatbots

Maker of RStudio launches new R and Python IDE

5 easy ways to run an LLM locally

Show me more

Databricks adds Data Science Agent to automate analytics tasks

Rust Innovation Lab launched, sponsors first project

PostgreSQL 18 to boost OLTP performance, but misses AI readiness

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust