Returning to R
Table of Contents
This week, I had a wonderful experience when returning to R. After several years of writing Python as a Machine Learning Engineer (and occasionally dabbling in Julia), The Tidyverse ecosystem in R astounded me with how short and expressive the functional style of programming was for analysing and visualising data.
I first used R in an undergraduate statistics course (which was not a subject I enjoyed at the time). In the 11 years that passed, I went through mathematical neuroscience and ended up in the field of machine learning and data science — where Python reigns supreme, especially for neural networks.
I was curious about returning to R to see how my perspective has changed, and I picked up a copy of “R for Data Science”. Before I knew it, I had fallen in love with the Tidyverse — and I absolutely understand why many data scientists swear by it.
One of the best mantras I’ve come across for good software design comes from the title of a classic book on web design and usability by Steven Krug:
Don’t make me think
Great software is pleasant to use, because it gets out of your way and works intuitively. The principles that the Tidyverse is built on perfectly aligns with this philosophy
- Design for humans
- Reuse existing data structures
- Design for the pipe and functional programming
The end result: an ecosystem of packages that gives non-programmers a powerful system to handle any data science problem using short and clear code. Adopting a functional style from the get-go made code significantly shorter and easer to compose, debug, and understand.
I experienced this first hand with the latest dataset from Tidy Tuesday
library(tidyverse)
library(tidytuesdayR)
# Get data
tuesdata <- tidytuesdayR::tt_load('2026-02-10')
df <- tuesdata$scheduleFirst I wanted to explore the schema of the dataset, here’s a 2 step pipeline I used:
get_schema <- function(df) {
tibble(
col = df |> schema() |> names(),
dtype = map_chr(df, ~ class(.x)[1]),
)
}
summarise_schema <- function(df_schema) {
df_schema |>
group_by(dtype) |>
summarise(count = n(), cols = paste(col, collapse = ", ")) |>
arrange(desc(count))
}
df_raw |> get_schema() |> summarise_schema()Very little cleaning was needed for this dataset. Here’s how I coded a couple of the plots I used to investigate some of the posed questions:
# How are medal events distributed across the days of the Olympics?
plot_bar_medal_events_by_date <- function(df) {
df |>
group_by(date=as.Date(start_datetime_utc), discipline_name) |>
ggplot(aes(x=date, fill=discipline_name)) +
geom_bar()
}
df |> plot_bar_medal_events_by_date()
# What is the typical duration of different types of events?
plot_boxplot_duration_by_discipline <- function(df) {
df |>
mutate(duration = end_datetime_utc - start_datetime_utc) |>
ggplot(aes(x=duration, y=discipline_name)) +
geom_boxplot()
}
df |> plot_boxplot_duration_by_discipline()Comparing this with how I would write this in Python with pandas, the verbosity of Python would require using significantly more lines of code to express the same outcome. The resulting Python code would be more of a chore to write and harder to read — two aspects that run counter to a good exploratory data analysis task.
Writing code — whether for software or data science — is an exercise in communication. Code that is easy to read for others (and your future self) is something we should all strive for. R and the Tidyverse is absolutely worth celebrating, because it makes this significantly easier for those without a background in computer science
One fascinating blog post I read on this topic was written by Rasmus Bååth, who had the opposite experience as an R programmer entering the world of Python. I highly recommend reading their post for more concrete discussions and examples on R versus Python for data analysis.
Reply to this post by email blZake@proZbableodyssey.blog (remove Z characters) ↪