Annotate ggplot with text labels using built-in functions and create non-overlapping labels with the ggrepel package.
Labeling all or some of your data with text can help tell a story โ even when your graph is using other cues like color and size. ggplot has a couple of built-in ways of doing this, and the ggrepel package adds some more functionality to those options.ย
For this demo, Iโll start with a scatter plot looking at percentage of adults with at least a four-year college degree vs. known Covid-19 cases per capita in Massachusetts counties. (The theory: A college education might mean youโre more likely to have a job that lets you work safely from home. Of course there are plenty of exceptions, and many other factors affect infection rates.)
If you want to follow along, you can get the code to re-create my sample data on page 2 of this article.
Creating a scatter plot with ggplot
To start, the code below loads several libraries and sets scipen = 999 so I donโt get scientific notation in my graphs:
library(ggplot2)
library(ggrepel)
library(dplyr)
options(scipen = 999)
Here is the data structure for the ma_data data frame:
head(ma_data)
Place AdultPop Bachelors PctBachelors CovidPer100K Positivity Region
1 Barnstable 165336 70795 0.4281887 7.0 0.0188 Southeast
2 Berkshire 92946 31034 0.3338928 9.0 0.0095 West
3 Bristol 390230 109080 0.2795275 30.8 0.0457 Southeast
4 Dukes and Nantucket 20756 9769 0.4706591 25.3 0.0294 Southeast
5 Essex 538981 212106 0.3935315 29.5 0.0406 Northeast
6 Franklin 53210 19786 0.3718474 4.7 0.0052 West
The next group of code creates a ggplot scatter plot with that data, including sizing points by total county population and coloring them by region. geom_smooth() adds a linear regression line, and I also tweak a couple of ggplot design defaults. The graph is stored in a variable called ma_graph.
ma_graph <- ggplot(ma_data, aes(x = PctBachelors, y = CovidPer100K,
size = AdultPop, color = Region)) +
geom_point() +
scale_x_continuous(labels = scales::percent) +
geom_smooth(method='lm', se = FALSE, color = "#0072B2", linetype = "dotted") +
theme_minimal() +
guides(size = FALSE)
That creates a basic scatter plot:
Sharon Machlis, IDG
Basic scatter plot with ggplot2.
However, itโs currently impossible to know which points represent what counties. ggplotโs geom_text() function adds labels to all the points:
ma_graph +
geom_text(aes(label = Place))
Sharon Machlis
ggplot scatter plot with default text labels.
geom_text() uses the same color and size aesthetics as the graph by default. But sizing the text based on point size makes the small pointsโ labels hard to read. I can stop that behavior by setting size = NULL.
It can also be a bit difficult to read labels when theyโre right on top of the points. geom_text() lets you โnudgeโ them a bit higher with the nudge_y argument.
Thereโs another built-in ggplot labeling function called geom_label(), which is similar to geom_text() but adds a box around the text. The following code using geom_label() produces the graph shown below.
ma_graph +
geom_label(aes(label = Place, size = NULL), nudge_y = 0.7)
Sharon Machlis, IDG
ggplot scatter plot with geom_label().
These functions work well when points are spaced out. But if data points are closer together, labels can end up on top of each other โ especially in a smaller graph. I added a fake data point close to Middlesex County in the Massachusetts data. If I re-run the code with the new data, Fake blocks part of the Middlesex label.
ma_graph2 <- ggplot(ma_data_fake, aes(x = PctBachelors, y = CovidPer100K, size = AdultPop, color = Region)) +
geom_point() +
scale_x_continuous(labels = scales::percent) +
geom_smooth(method='lm', se = FALSE, color = "#0072B2", linetype = "dotted") +
theme_minimal() +
guides(size = FALSE)
ma_graph2
ma_graph2 +
geom_label(aes(label = Place, size = NULL, color = NULL), nudge_y = 0.75)
Sharon Machlis, IDG
ggplot2 scatter plot with default geom_label() labels on top of each other
Enter ggrepel.
Creating non-overlapping labels with ggrepel
The ggrepel package has its own versions of ggplotโs text and label geom functions: geom_text_repel() and geom_label_repel(). Using those functionsโ defaults will automatically move one of the labels below its point so it doesnโt overlap with the other one.
As with ggplotโs geom_text() and geom_label(), the ggrepel functions allow you to set color to NULL and size to NULL. You can also use the sameย nudge_y arguments to create more space between the labels and the points.
ma_graph2 +
geom_label_repel(data = subset(ma_data_fake, Region == "MetroBoston"),
aes(label = Place, size = NULL, color = NULL), nudge_y = 0.75)
Sharon Machlis, IDG
Scatter plot with geom_label_repel().
The graph above has the Middlesex label above the point and the Fake label below, so thereโs no risk of overlap.
Focusing attention on subsets of data with ggrepel
Sometimes you may want to label only a few points of special interest and not all of your data. You can do so by specifying a subset of data in the data argument of geom_label_repel():
ma_graph2 + geom_label_repel(data = subset(ma_data_fake, Region == "MetroBoston"),
aes(label = Place, size = NULL, color = NULL),
nudge_y = 2,
segment.size = 0.2,
segment.color = "grey50",
direction = "x"
)
Sharon Machlis, IDG
Scatter plot with only some points labeled.ย
Customizing labels and lines with ggrepel
There is more customization you can do with ggrepel. For example, you can set the width and color of labelsโ pointer lines with segment.size and segment.color.ย
You can even turn label lines into arrows with the arrow argument:
ma_graph2 + geom_label_repel(aes(label = Place, size = NULL),
arrow = arrow(length = unit(0.03, "npc"),
type = "closed", ends = "last"),
nudge_y = 3,
segment.size = 0.3
)
Sharon Machlis, IDG
Scatter plot with ggrepel labels and arrows.
And you can use ggrepel to label lines in a multi-series line graph as well as points in a scatter plot.
For this demo, Iโll useย another data frame, mydf, which has some quarterly unemployment data for four US states. The code for that data frame is also on page 2. mydf has three columns: Rate, State, and Quarter.
In the graph below, I find it a little hard to see which line goes with what state, because I have to look back and forth between the lines and the legend.
graph2 <- ggplot(mydf, aes(x = Quarter, y = Rate, color = State, group = State)) +
geom_line() +
theme_minimal() +
scale_y_continuous(expand = c(0, 0), limits = c(0, NA))
graph2
Sharon Machlis, IDG
ggplot line graph.
In the next code block, Iโll add a label for each line in the series, and Iโll have geom_label_repel() point to the second-to-last quarter and not the last quarter. The code calculates what the second-to-last quarter is and then tells geom_label_repel() to use filtered data for only that quarter. The code usesย the State column as the label, โnudgesโ the data .75 horizontally, removes all the other data points, and gets rid of the graphโs default legend.
second_to_last_quarter <- max(mydf$Quarter[mydf$Quarter != max(mydf$Quarter)])
graph2 +
geom_label_repel(data = filter(mydf, Quarter == second_to_last_quarter),
aes(label = State),
nudge_x = .75,
na.rm = TRUE) +
theme(legend.position = "none")
Sharon Machlis, IDG
Line graph with ggrepel labels.
Why not label the last quarter instead of the second-to-last one? I tried that first, and the pointer lines ended up looking like a continuation of the graphโs data:
Sharon Machlis, IDG
Line graph with confusing label pointing lines.
The top two lines should not be starting to trend downward at the end!
If you want to find out more about ggrepel, check out the ggrepel vignette with
vignette("ggrepel", "ggrepel")
Code to create data used in this demo:
ma_data <- data.frame(
stringsAsFactors = FALSE,
Place = c("Barnstable","Berkshire","Bristol",
"Dukes and Nantucket","Essex","Franklin",
"Hampden","Hampshire","Middlesex",
"Norfolk","Plymouth","Suffolk",
"Worcester"),
AdultPop = c(165336L,92946L,390230L,20756L,
538981L,53210L,316312L,99377L,1116442L,
488612L,355335L,546850L,564408L),
Bachelors = c(70795L,31034L,109080L,9769L,212106L,
19786L,85913L,46210L,616179L,258768L,
130354L,244827L,202881L),
PctBachelors = c(0.428188658,0.333892798,0.279527458,
0.470659087,0.393531497,0.371847397,
0.271608412,0.464996931,0.551913131,
0.529598127,0.366848186,0.447704124,
0.359458052),
CovidPer100K = c(7,9,30.8,25.3,29.5,4.7,28.1,10.4,
16.7,13.9,14.5,27.4,20),
Positivity = c(0.0188,0.0095,0.0457,0.0294,0.0406,
0.0052,0.0446,0.0063,0.0165,0.0184,
0.0288,0.0171,0.0251),
Region = c("Southeast","West","Southeast",
"Southeast","Northeast","West","West",
"West","MetroBoston","MetroBoston",
"Southeast","MetroBoston","Central")
)
ma_data_fake <- data.frame(
stringsAsFactors = FALSE,
Place = c("Barnstable","Berkshire","Bristol",
"Dukes and Nantucket","Essex","Franklin",
"Hampden","Hampshire","Middlesex",
"Norfolk","Plymouth","Suffolk",
"Worcester","Fake"),
AdultPop = c(165336L,92946L,390230L,20756L,
538981L,53210L,316312L,99377L,1116442L,
488612L,355335L,546850L,564408L,1106400L),
Bachelors = c(70795L,31034L,109080L,9769L,212106L,
19786L,85913L,46210L,616179L,258768L,
130354L,244827L,202881L,610100L),
PctBachelors = c(0.428188658,0.333892798,0.279527458,
0.470659087,0.393531497,0.371847397,
0.271608412,0.464996931,0.551913131,
0.529598127,0.366848186,0.447704124,
0.359458052,0.5394678),
CovidPer100K = c(7,9,30.8,25.3,29.5,4.7,28.1,10.4,
16.7,13.9,14.5,27.4,20,16.3),
Positivity = c(0.0188,0.0095,0.0457,0.0294,0.0406,
0.0052,0.0446,0.0063,0.0165,0.0184,
0.0288,0.0171,0.0251,0.0155),
Region = c("Southeast","West","Southeast",
"Southeast","Northeast","West","West",
"West","MetroBoston","MetroBoston",
"Southeast","MetroBoston","Central",
"MetroBoston")
)
mydf <- structure(list(Rate = c(4.5999999999999996, 4.5, 4.2000000000000002,
4.2000000000000002, 4.2999999999999998, 4.0999999999999996,
4.0999999999999996, 4.0999999999999996, 4.0999999999999996, 7, 8.9000000000000004,
4.7000000000000002, 4.5999999999999996, 4.2999999999999998, 4.0999999999999996,
4, 3.8999999999999999, 4, 4, 3.7999999999999998, 6.5999999999999996, 8.6999999999999993,
3.7999999999999998, 3.6000000000000001, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001,
3, 2.8999999999999999, 3, 6.0999999999999996, 8.0999999999999996, 2.7999999999999998, 2.8999999999999999,
3, 2.7999999999999998, 3.1000000000000001, 3.1000000000000001, 3.2000000000000002, 3.2999999999999998,
3.2000000000000002, 4, 4.2999999999999998), State = c("CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA",
"CA", "CA", "CA", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "MA", "MA", "MA", "MA",
"MA", "MA", "MA", "MA", "MA", "MA", "MA", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE"),
Quarter = c("2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01",
"2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01", "2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01",
"2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01", "2018-01-01",
"2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01",
"2020-04-01", "2020-07-01", "2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01",
"2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01")), row.names = c(NA, -44L), class = "data.frame")
ย


