# load packages
library(tidyverse)
# load dataset
library(palmerpenguins)
data(penguins)ggplot2::geom_errorbar()
Function of the Week
1 geom_errorbar()
In this document, I will introduce the geom_errorbar() function and show what it’s for.
The tidyverse package automatically includes ggplot2
The Palmer penguins dataset contains data on 344 penguins from the Palmer Archipelago
1.1 What is it for?
Estimates from a dataset often have some degree of uncertainty
Error bars are a graphical representation of this uncertainty or variability
- They can show variability via standard deviation, standard error, or confidence intervals
geom_errorbar()is a function that lets us add error bars to existing plots to understand the level of variability around measurements- Critical inputs into this function are the min and max values for the bars for each measurement (in our case,
yminandymax)
- Critical inputs into this function are the min and max values for the bars for each measurement (in our case,
# make a summary table with aggregated stats
penguins_summary <- penguins %>%
filter(!is.na(body_mass_g)) %>%
group_by(species) %>%
summarise(
mean_mass = mean(body_mass_g),
sd_mass = sd(body_mass_g),
se_mass = sd_mass / sqrt(n()),
ci_lower = mean_mass - qt(0.975, df = n() - 1) * se_mass,
ci_upper = mean_mass + qt(0.975, df = n() - 1) * se_mass
)1.1.1 95% Confidence intervals
Confidence intervals are good for inference and show us the range where the true population mean is likely to fall for a given group and give us an idea of the uncertainty in a measurement.
ggplot(penguins_summary, aes(x = species, y = mean_mass)) +
geom_col(fill = "thistle") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 0.2) +
ggtitle("Mean Mass by Species") +
labs(x = "Species", y="Mean Body Mass (g)") +
theme_minimal()From the above, we can conclude that there is a statistically significant difference in body mass between Gentoo and Adelie/Chinstrap penguins, but not between Adelie and Chinstrap. We can additionally conduct a t-test to confirm this significance.
1.1.2 Changing colors & width
We can change the colors of the data as well as the error bars, along with the width of error bars.
ggplot(penguins_summary, aes(x = species, y = mean_mass)) +
geom_col(fill = "beige") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 1, color = "purple") +
ggtitle("Mean Mass by Species") +
labs(x = "Species", y="Mean Body Mass (g)") +
theme_minimal()1.1.3 Other changes
Size (thickness) of the error bars with the
size = XargumentType of line with the
linetype = "X"argument (“solid”, “dashed”, “dotted”, etc.)Transparency (opacity) with the
alpha = Xargument
1.1.4 Other measurements
The inputs for min and max for error bars can be anything such as standard deviation:
ggplot(penguins_summary, aes(x = species, y = mean_mass)) +
geom_col(fill = "thistle") +
geom_errorbar(aes(ymin = mean_mass - sd_mass, ymax = mean_mass + sd_mass),
width = 0.2) +
ggtitle("Mean Mass by Species") +
labs(x = "Species", y="Mean Body Mass (g)") +
theme_minimal()Standard deviation doesn’t provide statistical significance, but we can see the spread of data within each species. For example, Gentoo penguins have the larger standard deviation (taller error bars) meaning more variation.
1.1.5 Going horizontal
We can also add error bars for horizontal plots! Just change the function to be geom_errorbarh and the inputs are now xmin and xmax, and height instead of width.
ggplot(penguins_summary, aes(y = species, x = mean_mass)) +
geom_col(fill = "thistle") +
geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper),
height = 0.2) +
ggtitle("Mean Mass by Species") +
labs(x = "Mean Body Mass (g)", y="Species") +
theme_minimal()1.1.6 Other use cases
It doesn’t have to be just bar plots. We can add error bars most plot types!
# econ summary table
econ_summary <- economics %>%
summarise(
mean_unemploy = mean(unemploy, na.rm = TRUE),
sd_unemploy = sd(unemploy, na.rm = TRUE),
se_unemploy = sd_unemploy / sqrt(n()),
ci_lower = unemploy - qt(0.975, df = n() - 1) * se_unemploy,
ci_upper = unemploy + qt(0.975, df = n() - 1) * se_unemploy
)
# plot with 95% CI
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line(color = "red") +
geom_errorbar(aes(ymin = econ_summary$ci_lower, ymax = econ_summary$ci_upper), width = 1) +
ggtitle("US Unemployment Over Time") +
labs(x = "Date", y="# Unemployed (000s)") +
theme_minimal()1.2 Is it helpful?
Yes! It’s a great addition to any data visualization and allows us to:
Understand variability of measurements
Compare differences between groups and assess their significance
Provides context to raw data and shows uncertainty