Sharon Machlis
Contributing Writer

How to create your own RSS reader with R

how-to
Dec 28, 202215 mins
JavaScriptR LanguageWeb Development

Sure, you could use one of the commercial or open-source RSS readers. But isnโ€™t it more fun to code your own?

Data stream. Bright, colorful background with bokeh effect

RSS feeds have been around since the late โ€™90s, and they remain a handy way to keep up with multiple news sources. Choose your feeds wisely, and your RSS reader will let you easily scan headlines from multiple sources and stay up to date on fast-moving topics. And while there are several capable commercial and open-source RSS readers available, itโ€™s a lot more satisfying to code your own.

Itโ€™s surprisingly easy to create your own RSS feed reader in R. Just follow these eight steps.

Create a Quarto document or R script file

You can use a plain R script, but Quarto adds some useful, out-of-the-box styling. Quarto also gives you easier access to using JavaScript for the final display if you so choose. But the tutorial code works fine in an R file, too.

Unlike an R script, though, my Quarto document needs a YAML header to start. Iโ€™ll add a few settings in the YAML to generate a single HTML file (embed-resources: true), and not display my code (echo: false) or any code messages or warnings:

---
title: "Sharon's RSS Feed"
format: 
  html
embed-resources: true
editor: source
execute: 
  echo: false
  warning: false
  message: false
---

Load needed packages

Next, Iโ€™ll add some R code inside an R code block (```{r} and ``` enclose a block of executable code in Quarto; you donโ€™t need those if youโ€™re using a plain R script) and load the packages Iโ€™ll need. As you might guess from its name, tidyRSS is a library for reading RSS feeds into R.

``{r}
library(tidyRSS)
library(dplyr)
library(DT)
library(purrr)
library(stringr)
library(lubridate)
```

Add RSS feeds

Selecting relevant feeds is a key part of a useful RSS reader experience. I find mine based on sources I like and then checking websites or searching to see if RSS feeds exist. (As an optional exercise, you can use theย rvest package to read sitemaps and wrangle them into RSS format, but thatโ€™s beyond the scope of this tutorial. Maybe in a future article!)

You may want to store your feeds in a separate CSV or Excel file and have your app import them. This way, you donโ€™t have to touch the app code each time you update your feed list. For the sake of demo simplicity here, though, Iโ€™ll create a data frame in my script file with the feeds I want and my titles for each.

Since I write for both InfoWorld and Computerworld Iโ€™ll add both of those feeds. In addition, Iโ€™ll pull in a few R-specific RSS feeds, including R-Bloggers, R Weekly, and Mastodonโ€™s #rstats and #QuartoPub RSS feeds at fosstodon.org, the Mastodon instance I use. In the code below, I save the feed info to a data frame call myfeeds with both feed URLs and my desired title for each feed. I then arrange them by feed title:

```{r}
myfeeds <- data.frame(feed_title = c("All InfoWorld", 
                                  "All Computerworld", 
                                  "Mastodon rstats", 
                                  "Mastodon QuartoPub",
                                  "R Bloggers",
                                  "R Weekly"),
                     feed_url = c("https://www.infoworld.com/index.rss",
                                  "https://www.computerworld.com/index.rss",
                                  "http://fosstodon.org/tags/rstats.rss",
                                  "http://fosstodon.org/tags/QuartoPub.rss",
                      "https://feeds.feedburner.com/Rbloggers",
                      "https://rweekly.org/atom.xml")
           ) |>
  arrange(feed_title)
```

Note: From here on, I wonโ€™t be including the ```{r} ```ย Quarto code โ€œfencesโ€ around the R code. All the rest of the R code still needs to be โ€œfencedโ€ in a Quarto doc, though.

Get all the feeds into the same format

This is the most manual part of the process. Ideally, all feeds would be structured exactly the same way, be in the format I want, and never have missing data. In the real world, of course, RSS data can be as messy as any other data set. So, I want to check my feeds and see if/how they need to be cleaned.

In addition, I want to be able to import atom feeds like R-Bloggersโ€™ as well as RSS feeds, which means I need to account for those.

To keep things simple, my reader will only display title, item date/time updated, item description, and a way to click to the original (the URL) item.

Iโ€™ll start by importing each of my feeds into R using tidyRSS, but as a list, one list entry for each feed, and then examine each to see what problems may arise.

feed_test <- map(myfeeds$feed_url, tidyfeed)

Iโ€™m not including that code above in my final RSS reader file; itโ€™s for development only.

Create a feed wrangling function

My wrangling function starts simply enough:

wrangle_feed <- function(the_feed_url, the_feed_dataframe = myfeeds) {
  my_feed_data <- tidyRSS::tidyfeed(the_feed_url)
  return(my_feed_data)
}

Iโ€™d like the feed title to be what I call it in my spreadsheet, not what the feed creator titled it. So, Iโ€™ll use my feed data frame to look up the title and replace the existing feed title with this code: my_feed_data$feed_title <- the_feed_dataframe$feed_title[the_feed_dataframe$feed_url == the_feed_url][1]

I want to select item_title, item_date, item_description, and item_link. But if itโ€™s an atom feed, those will be called something different: entry_title, entry_last_updated, entry_content, and entry_url. Before I select the columns I want, Iโ€™ll check if itโ€™s an atom feed and, if so, rename the atom columns with

  if("entry_url" %in% names(my_feed_data)) {
    my_feed_data <- my_feed_data |>
      rename(item_title = entry_title, item_pub_date = entry_last_updated, item_link = entry_url, item_description = entry_content) 
  }

Mastodon RSS feeds donโ€™t have titles for the posts. I could add the same default title to each post, such as a generic โ€œMastodon Post,โ€ but Iโ€™d prefer a title like โ€œMastodon Post by {username}.โ€ Most Mastodon post URLs include the author handle starting with @, although occasionally one wonโ€™t. I can extract the user name from the mastodon URL and add a custom title with the code below, defaulting to โ€œMastodon Postโ€ if there is no obvious author in the link.

if(str_detect(my_feed_data$feed_title[1], "Mastodon")) {
  my_feed_data <- my_feed_data |>
    mutate(
      item_author = str_replace_all(item_link, "^.*?/(@.*?)/.*?$", "1"),
      item_title = if_else(str_detect(item_author, "@"), paste0("Mastodon Post by ", item_author), "Mastodon Post")
    )
  }

Itโ€™s easy for me to find all the Mastodon feeds because I included โ€œMastodonโ€ in those feed titles.

The str_replace_all()ย code uses a regular expression to find the author in the URL. The pattern "^.*?/(@.*?)/.*?$" will drop everything from the start of the string to the / before an @, keep everything from the @ until just before the next /, and then drop everything else.

Next Iโ€™ll do some additional data wrangling, including selecting and renaming the columns I want and making each item clickable back to the original source.

The code below selects and renames columns and also creates a clickable headline column.

 my_feed_data <- my_feed_data |>
  select(Headline = item_title, Date = item_pub_date, URL = item_link, 
         Description = item_description, Feed = feed_title) |>
  mutate(
    Headline = str_glue("<a target='_blank' title='{Headline}' href='https://www.infoworld.com/{URL}' rel="noopener">{Headline}</a>")
)

Many people like clickable headlines. However, I prefer a clickable >> at the end of the description instead of a clickable headline. The code below is one way to do that.

my_feed_data <- my_feed_data |>
  select(Headline = item_title, Date = item_pub_date, URL = item_link, Description = item_description, Feed = feed_title) |>
  mutate(
    Description = str_glue("{Description},  <a target='_blank' href='https://www.infoworld.com/{URL}' rel="noopener"> >></a>"),
  ) 

Add some optional data tweaks

The code so far is enough to generate data for a basic feed reader, but the app will look better with some optional tweaks.

For example, the R Bloggers atom feed includes full blog content, but I donโ€™t want to download full content into my RSS reader because that makes quick scanning more difficult. Other descriptions may be longer than Iโ€™d like as well.

Below is a function that trims the description afterย max_chars number of charactersโ€”but at the nearest complete word, so as not to cut off in the middle of a word. It then adds ellipses. The function first checks that thereโ€™s a description at all, so the code wonโ€™t break if the description is missing.

trim_if_too_long <- function(item_description, max_chars = 600) {
  if(!is.na(item_description)) {
    if(nchar(item_description) > max_chars) {
      item_description <-  stringr::str_sub(item_description, 1, max_chars)
      item_description <-  str_replace_all(item_description, "s[^s]+$", ". . . ")
    }
    return(item_description)
  } else {
      return("")
    }
}

The function only makes changes if the item description is greater than max_chars (currently defaulting to 600). If the description is in fact longer, the first line of code trims the text to max_chars length. The second line uses a regular expression to replace anything thatโ€™s a space followed by one or more characters that arenโ€™t spaces at the end of the text string with an ellipsis. In other words, the regex removes any incomplete words at the end of the description and then adds three dots.

If you want to use this function in your RSS reader, make sure to place it above the wrangle_feed function definition in your Quarto doc or R script.

To apply the function to each feedโ€™s description, Iโ€™ll use purrrโ€™s map_chr() function:

map_chr(Description, trim_if_too_long)

and add that to my feed wrangling before I add my clickable >> arrows:

 my_feed_data <- my_feed_data |>
  select(Headline = item_title, Date = item_pub_date, URL = item_link, Description = item_description, Feed = feed_title) |>
  mutate(
      Description = purrr::map_chr(Description, trim_if_too_long),
      Description = str_glue("{Description},  <a target='_blank' href='https://www.infoworld.com/{URL}' rel="noopener"> >></a>"),
  )  

A few of the feeds Iโ€™ve chosen include โ€œTo read this article in full, please click hereโ€ text at the end, but thatโ€™s not clickable. Itโ€™s easy to remove text like that with str_replace_all().

If you want to use this code in your app, make sure to add that code before you do any other description wrangling:

 my_feed_data <- my_feed_data |>
  select(Headline = item_title, Date = item_pub_date, URL = item_link, Description = item_description, Feed = feed_title) |>
  mutate(
    Description = str_remove_all(Description, "To read this article in full, please click here"),
    Description = purrr::map_chr(Description, trim_if_too_long),
    Description = str_glue("{Description},  <a target='_blank' href='https://www.infoworld.com/{URL}' rel="noopener"> >></a>")
  )  

One more small nit: I donโ€™t like my date/time displaying like 2022-11-16T08:00:00Z. The lubridate packageโ€™s format_ISO8601() function makes it easy to set the desired precisionโ€”in this case, I want ymdhm but not seconds. After that, Iโ€™ll replace the โ€œTโ€ with a space so the date column will appear in a format such as 2022-12-21 18:44.

Date = format_ISO8601(Date, precision = "ymdhm"),
Date = str_replace_all(Date, "T", " ")

Below is my full wrangle_feed() function (not showing the separate trim_if_too_long() function above it).

wrangle_feed <- function(the_feed_url, the_feed_dataframe = myfeeds) {
  my_feed_data <- tidyfeed(the_feed_url)
  my_feed_data$feed_title <- the_feed_dataframe$feed_title[the_feed_dataframe$feed_url == the_feed_url][1]
 if("entry_url" %in% names(my_feed_data)) {
    my_feed_data <- my_feed_data |>
      rename(item_title = entry_title, item_pub_date = entry_last_updated, item_link = entry_url, item_description = entry_content) 
 }
 if(str_detect(my_feed_data$feed_title[1], "Mastodon")) {
    my_feed_data <- my_feed_data |>
      mutate(
        item_author = str_replace_all(item_link, "^.*?/(@.*?)/.*?$", "1"),
        item_title = if_else(str_detect(item_author, "@"), paste0("Mastodon Post by ", item_author), "Mastodon Post")
      )
 }  
 my_feed_data <- my_feed_data |>
  select(Headline = item_title, Date = item_pub_date, URL = item_link, Description = item_description,  Feed = feed_title) |>
  mutate(
    Description = str_remove_all(Description, "To read this article in full, please click here"),
    Description = purrr::map_chr(Description, trim_if_too_long),
    Description = str_glue("{Description},  <a target='_blank' href='https://www.infoworld.com/{URL}' rel="noopener"> >></a>"),
    Date = format_ISO8601(Date, precision = "ymdhm"),
    Date = str_replace_all(Date, "T", " ")
    )    
return(my_feed_data)  
}

Handle a missing or broken feed

I want to make sure this code doesnโ€™t blow up and stop on a single error if one of the feeds is unavailable. I can do that by making a โ€œsafe,โ€ error-handling version of the function with purrrโ€™s possibly():

wrangle_feed_safely <- possibly(wrangle_feed, otherwise = NULL)

The wrangle_feed_safely() version of the function returns NULL if thereโ€™s an error instead of stopping. Now I can run the function on all my feed URLs and get a single data frame returned with purrrโ€™s map_df(). The code below also arranges results by descending date so the newest entries appear first, regardless of source.

mydata <- map_df(myfeeds$feed_url, wrangle_feed_safely) |>
  arrange(desc(Date))

Display the results

The hard part is done, we have our data! Now itโ€™s time to display the results.

Iโ€™ll make a copy of the data frame without the URL field for use in my display table, since I donโ€™t want to show the URL field (Iโ€™ve got the clickable >> in my description). I wouldnโ€™t make a copy for a huge data set, but this is small, and itโ€™s a bit of a backup in case I decide later on that I still want the URL field.

mytabledata <- select(mydata, -URL)

One of the easiest ways to display this data is with a table. Iโ€™ll use the DT package because I like its ability to use regular-expression searching. Regex searching is especially handy when searching for something like โ€œR,โ€ because a regex lets you search for patterns such as R as a separate word and not just R that might be starting any capitalized word.

In my Quarto document, Iโ€™ll enclose the table code chunk in a โ€œcolumn-pageโ€ CSS style class with :::{.column-page}ย at the top and :::ย at the end, as you can see in the code below. That tells my Quarto document to make the table wider than usualโ€”a full page width. column-page is a built-in CSS style that increases the content width. But you donโ€™t have to know how to code HTML and CSS in order to make this modification.

If this option still isnโ€™t wide enough (sometimes the table still scrolls because of, say, a ridiculously long URL in a post that wonโ€™t line break), you can use {.column-screen} instead of {.column-page} to remove the page margins altogether.

The code below also makes some tweaks to the default DT datatable. filter = 'top' adds search filters above each column. escape = FALSE displays HTML as HTML instead of showing the underlying code. I add regex=TRUE and caseInsensitive=TRUE and ignore-case searching to the search options. I also tweak the page length and page length menu options, and set my third column (Description) to be 80% of the table width. (If youโ€™re wondering why the target column is 2 when I want the third column, itโ€™s because DT is a wrapper for a JavaScript library, and the underlying library uses the JS convention of starting to count at 0).

:::{.column-page}

```{r}
DT::datatable(mytabledata, filter = 'top', escape = FALSE, rownames = FALSE,
  options = list(
  search = list(regex = TRUE, caseInsensitive = TRUE),  
  pageLength = 25,
  lengthMenu = c(25, 50, 100, 200),
  autowidth = TRUE,
  columnDefs = list(list(width = '80%', targets = list(2)))
  )
)
```

:::
Table with columns for Headline, Date, Description, and Feed showing results with JavaScript Sharon Machlis

Example of the RSS feed reader table, searching for JavaScript entries.

Thanks to regex searching, you can search for R as a separate word with the regular expression bRb. The b indicates a โ€œword boundaryโ€ such as a space, punctuation mark, or beginning or end of a line.

And there you have it, a simple RSS reader! There are more modications you could make, including caching results and further tweaking the display. For example, adding

Available feeds: `r knitr::combine_words(sort(unique(mydata$Feed)))`

to the Quarto document after parsing the RSS feeds will show a list of all the available feeds.ย 

For more on Quarto and how you might use JavaScript with R in a Quarto document, see โ€œA beginnerโ€™s guide to using Observable JavaScript, R, and Python with Quarto.โ€ And for more R tips, head to InfoWorldโ€™sย Do More With R page.