Discover quick and easy ways to count by groups in R, including reports as data frames, graphics, and ggplot graphs
Counting by multiple groups โ sometimes called crosstab reports โ can be a useful way to look at data ranging from public opinion surveys to medical tests. For example, how did people vote by gender and age group? How many software developers who use both R and Python are men vs. women?
There are a lot of ways to do this kind of counting by categories in R. Here, Iโd like to share some of my favorites.
For the demos in this article, Iโll use a subset of the Stack Overflow Developers survey, which surveys developers on dozens of topics ranging from salaries to technologies used. Iโll whittle it down with columns for languages used, gender, and if they code as a hobby. I also added my own LanguageGroup column for whether a developer reported using R, Python, both, or neither.
If youโd like to follow along, the last page of this article has instructions on how to download and wrangle the data to get the same data set Iโm using.
The data has one row for each survey response, and the four columns are all characters.
str(mydata)
'data.frame': 83379 obs. of 4 variables:
$ Gender : chr "Man" "Man" "Man" "Man" ...
$ LanguageWorkedWith: chr "HTML/CSS;Java;JavaScript;Python" "C++;HTML/CSS;Python" "HTML/CSS" "C;C++;C#;Python;SQL" ...
$ Hobbyist : chr "Yes" "No" "Yes" "No" ...
$ LanguageGroup : chr "Python" "Python" "Neither" "Python" ...
I filtered the raw data to make the crosstabs more manageable, including removing missing values and taking the two largest genders only, Man and Woman.
The janitor package
So, whatโs the gender breakdown within each language group? For this type of reporting in a data frame, one of my go-to tools is the janitor packageโs tabyl() function.ย
The basic tabyl() function returns a data frame with counts. The first column name you add to a tabyl() argument becomes the row, and theย second one the column.ย
library(janitor)
tabyl(mydata, Gender, LanguageGroup)
Gender Both Neither Python R
Man 3264 43908 29044 969
Woman 374 3705 1940 175
Whatโs nice about tabyl() is itโs very easy to generate percents, too. If you want to see percents for each column instead of raw totals, add adorn_percentages("col"). You can then pipe those results into a formatting function such asย adorn_pct_formatting().
tabyl(mydata, Gender, LanguageGroup) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1)
Gender Both Neither Python R
Man 89.7% 92.2% 93.7% 84.7%
Woman 10.3% 7.8% 6.3% 15.3%
To see percents by row, add adorn_percentages("row").ย
If you want to add a third variable, such as Hobbyist, thatโs easy too.
tabyl(mydata, Gender, LanguageGroup, Hobbyist) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1)
However, it gets a little harder to visually compare results in more than two levels this way. This code returns a list with one data frame for each third-level choice:
$No
Gender Both Neither Python R
Man 79.6% 86.7% 86.4% 74.6%
Woman 20.4% 13.3% 13.6% 25.4%
$Yes
Gender Both Neither Python R
Man 91.6% 93.9% 95.0% 88.0%
Woman 8.4% 6.1% 5.0% 12.0%
The CGPfunctions package
The CGPfunctions package is worth a look for some quick and easy ways to visualize crosstab data. Install it from CRAN with the usual install.packages("CGPfunctions").
The package has two functions of interest for examining crosstabs: PlotXTabs() and PlotXTabs2(). This codeย returns bar graphs of the data (first graph below):
library(CGPfunctions)
PlotXTabs(mydata)
Screen shot by Sharon Machlis, IDG
Result ofย PlotXTabs(mydata).
PlotXTabs2(mydata) creates a graph with a different look, and some statistical summaries (second graph at left).
If you donโt need or want those summaries, you can remove them with results.subtitle = FALSE, such asย PlotXTabs2(mydata, LanguageGroup, Gender, results.subtitle = FALSE).
Screen shot by Sharon Machlis, IDG
Result ofย PlotXTabs(mydata).ย
PlotXTabs2() has a couple of dozen argument options, including title, caption, legends, color scheme, and one of four plot types: side, stack, mosaic, or percent. There are also options familiar to ggplot2 users, such as ggtheme and palette. You can see more details in the functionโs help file.
The vtree package
The vtree package generates graphics for crosstabs as opposed to graphs.ย Running the main vtree() function on one variable, such asย
library(vtree)
vtree(mydata, "LanguageGroup")
gets you this basic response:
Sharon Machlis, IDG
Basic vtree() function on one variable.
Iโm not keen on the color defaults here, but you can swap in an RColorBrewer palette. vtreeโs palette argument uses palette numbers, not names; you can see how theyโre numbered in the vtree package documentation. I could choose 3 for Greens and 5 for Purples, for example. Unfortunately, those defaults give you a more intense color for lower count numbers, which doesnโt always make sense (and doesnโt work well for me in this example). I can change that default behavior with sortfill = TRUE to use the more intense color for the higher value.ย
vtree(mydata, "LanguageGroup", palette = 3, sortfill = TRUE)
Sharon Machlis, IDG
vtree() after changing to a new palette.
If you find the dark color makes it hard to read text, there are some options. One option is to use the plain argument, such asย vtree(mydata, "LanguageGroup", plain = TRUE).ย Another option is to set a single fill colorย instead of a palette, using the fillcolor argument, such asย vtree(mydata, LanguageGroup",ย fillcolor = "#99d8c9").
To look at two variables in a crosstab report, simply add a second column name and palette or color if you donโt want the default. You can use the plain option or specify two palettes or two colors. Below I chose specific colors instead of palettes, and I also rotated the graph to read vertically.
vtree(mydata, c("LanguageGroup", "Gender"),
fillcolor = c( LanguageGroup = "#e7d4e8", Gender = "#99d8c9"),
horiz = FALSE)
Sharon Machlis, IDG
vtree() for two variables.
You can add more than two categories,ย although it gets a bit harder to read and follow as the tree grows. If youโre only interested in some of the branches, you can specify which to display with the keep argument. Below, I set vtree() to show only people who use R without Python or who use both R and Python.
vtree(mydata, c("Gender", "LanguageGroup", "Hobbyist"),
horiz = FALSE, fillcolor = c(LanguageGroup = "#e7d4e8",
Gender = "#99d8c9", Hobbyist = "#9ecae1"),
keep = list(LanguageGroup = c("R", "Both")), showcount = FALSE)
With the tree getting so busy, I think it helps to have either the count or the percent as node labels, not both. So that last argument in the code above,ย showcount = FALSE, sets the graph to display only percents and not counts.
Sharon Machlis, IDG
Three-level vtree graphic with a subset of nodes, displaying percents only.ย
More count by group options
There are other useful ways to group and count in R, including base R, dplyr, and data.table. Base R has theย xtabs() function specifically for this task. Note the formula syntax below: a tilde and then one variable plus another variable.
xtabs(~ LanguageGroup + Gender, data = mydata)
Gender
LanguageGroup Man Woman
Both 3264 374
Neither 43908 3705
Python 29044 1940
R 969 175
dplyrโs count() function combines โgroup byโ and โcount rows in each groupโ into a single function.
library(dplyr)
my_summary <- mydata %>%
count(LanguageGroup, Gender, Hobbyist, sort = TRUE)
my_summary
LanguageGroup Gender Hobbyist n
1 Neither Man Yes 34419
2 Python Man Yes 25093
3 Neither Man No 9489
4 Python Man No 3951
5 Both Man Yes 2807
6 Neither Woman Yes 2250
7 Neither Woman No 1455
8 Python Woman Yes 1317
9 R Man Yes 757
10 Python Woman No 623
11 Both Man No 457
12 Both Woman Yes 257
13 R Man No 212
14 Both Woman No 117
15 R Woman Yes 103
16 R Woman No 72
In the three lines of code below, I load the data.table package, create a data.table from my data, and then use the special .N data.table symbol that stands for number of rows in a group.ย
library(data.table)
mydt <- setDT(mydata)
mydt[, .N, by = .(LanguageGroup, Gender, Hobbyist)]
Visualizing with ggplot2
As with most data, ggplot2 is a good choice to visualize summarized results. The first ggplot graph below plots LanguageGroup on the X axis and the count for each on the Y axis. Fill color represents whether someone says they code as a hobby. And, facet_wrap says: Make a separate graph for each value in the Gender column.
library(ggplot2)
ggplot(my_summary, aes(LanguageGroup, n, fill = Hobbyist)) +
geom_bar(stat = "identity") +
facet_wrap(facets = vars(Gender))
Sharon Machlis, IDG
Using ggplot2 to compare language use by gender.
Because there are relatively few women in the sample, itโs difficult to compare percentages across genders when both graphs use the same Y-axis scale. I can change that, though, so each graph uses a separate scale, by adding the argument scales = โfree_yโ to the facet_wrap() function:
ggplot(my_summary, aes(LanguageGroup, n, fill = Hobbyist)) +
geom_bar(stat = "identity") +
facet_wrap(facets = vars(Gender), scales = "free_y")
Now itโs easier to compare multiple variables by gender.
For more R tips, head to theย โDo More With Rโ page on InfoWorldย or check out theย โDo More With Rโ YouTube playlist.
See the next page for info on how to download and wrangle data used in this demo.
Get the data
You can see more about the survey and download the complete data atย https://insights.stackoverflow.com/survey.ย I used the 2019 Stack Overflow Developers Survey data for this article because Iโve worked with that data set before, but the 2020 data will likely work as well.
Here is the code I used to get the data subset I worked with in this article:
library(dplyr)
library(janitor)
library(stringr)
# Also need rio, tidyr installed on your system
raw_data <- rio::import("data/survey_results_public.csv")
mydata <- raw_data %>%
select(Gender, LanguageWorkedWith, Hobbyist) %>%
mutate(
LanguageGroup = case_when(
str_detect(LanguageWorkedWith, "bRb") & str_detect(LanguageWorkedWith, "bPythonb") ~ "Both",
str_detect(LanguageWorkedWith, "bRb") & !str_detect(LanguageWorkedWith, "bPythonb") ~ "R",
!str_detect(LanguageWorkedWith, "bRb") & str_detect(LanguageWorkedWith, "bPythonb") ~ "Python",
!str_detect(LanguageWorkedWith, "bRb") & !str_detect(LanguageWorkedWith, "bPythonb") ~ "Neither"
)
) %>%
filter(Gender %in% c("Man", "Woman")) %>%
tidyr::drop_na()


