• Steven Ponce
  • About
  • Data Visualizations
  • Projects
  • Resume
  • Email

On this page

  • Steps to Create this Graphic

Gender Representation in the International Mathematical Olympiad

Left: Total number of male and female contestants by country
Proportion of total contestants who were male and female each year

TidyTuesday
Data Visualization
R Programming
2024
Author

Steven Ponce

Published

September 21, 2024

Figure 1: Gender Representation in the International Mathematical Olympiad. On the left, a bar chart shows the total number of male and female contestants by country (raw counts), with the UK, Russia, and Romania having the highest totals. Male contestants dominate in nearly every country. On the right, a line chart shows the proportion of male and female contestants each year from 1959 to 2024, with males consistently comprising around 87% of contestants and females around 13%. Annotations highlight that despite an increase in overall contestants, the gender gap has remained the same.

Steps to Create this Graphic

1. Load Packages & Setup

Code
```{r}
#| label: load

pacman::p_load(
  tidyverse,         # Easily Install and Load the 'Tidyverse'
  ggtext,            # Improved Text Rendering Support for 'ggplot2'
  showtext,          # Using Fonts More Easily in R Graphs
  janitor,           # Simple Tools for Examining and Cleaning Dirty Data
  skimr,             # Compact and Flexible Summaries of Data
  scales,            # Scale Functions for Visualization
  lubridate,         # Make Dealing with Dates a Little Easier
  MetBrewer,         # Color Palettes Inspired by Works at the Metropolitan Museum of Art
  MoMAColors,        # Color Palettes Inspired by Artwork at the Museum of Modern Art in New York City
  glue,              # Interpreted String Literals
  patchwork,         # The Composer of Plots
  geomtextpath       # Curved Text in 'ggplot2' 
 )  

# ### |- figure size ----
camcorder::gg_record(
  dir    = here::here("temp_plots"),
  device = "png",
  width  =  11,
  height =  10,
  units  = "in",
  dpi    = 320
)

### |- resolution ----
showtext_opts(dpi = 320, regular.wt = 300, bold.wt = 800)
```

2. Read in the Data

Code
```{r}
#| label: read

tt <-tidytuesdayR::tt_load(2024, week = 39) 

country_results <- tt$country_results_df |> clean_names() |> glimpse()
individual_results <- tt$individual_results_df |> clean_names() |> glimpse()
timeline <- tt$timeline_df |> clean_names() |> glimpse()

tidytuesdayR::readme(tt)
rm(tt)
```

3. Examing the Data

Code
```{r}
#| label: examine

glimpse(country_results)
glimpse(individual_results)
glimpse(timeline)
```

4. Tidy Data

Code
```{r}
#| label: tidy

# first plot data (pyramid style chart) ----

# Calculate the total number of male and female contestants per country
gender_by_country_summary <- timeline |>
  group_by(country) |>
  summarize(
    male = sum(male_contestant, na.rm = TRUE),
    female = sum(female_contestant, na.rm = TRUE),
    total_contestants = male + female
  ) |>
  ungroup() |>
  arrange(desc(total_contestants))

# Prepare the data for a pyramid chart
gender_by_country <- gender_by_country_summary |>
  pivot_longer(
    cols = c(male, female),
    names_to = "gender",
    values_to = "count"
  ) |>
  mutate(count = ifelse(gender == "female", -count, count)) # Negative for female

# Modify the country labels to shorten or reformat names
gender_by_country <- gender_by_country |>
  mutate(
    country = case_when(
      country == "United States of America" ~ "USA",
      country == "United Kingdom" ~ "UK",
      country == "People's Republic of China" ~ "China",
      country == "Union of Soviet Socialist Republics" ~ "USSR",
      country == "Republic of Korea" ~ "South Korea",
      country == "Russian Federation" ~ "Russia",
      country == "German Democratic Republic" ~ "East Germany",
      TRUE ~ country # Keep all other country names as they are
    )
  )


# second plot (line chart) ----

# Data prep: Normalize by total contestants and calculate the gap
gender_representation_normalized <- timeline |>
    filter(!is.na(female_contestant) & !is.na(male_contestant)) |>
    mutate(
        total_contestants = female_contestant + male_contestant,
        female_percentage = (female_contestant / total_contestants) * 100,
        male_percentage = (male_contestant / total_contestants) * 100
        ) |>
    select(year, female_percentage, male_percentage)

# Pivot longer
gender_representation_normalized_long <- gender_representation_normalized |>
    pivot_longer(
      cols = c(female_percentage, male_percentage), 
      names_to = "gender", 
      values_to = "percentage"
      ) |>
    mutate(gender = ifelse(gender == "female_percentage", "Female", "Male"))

# Split the data into two separate datasets for ribbon use
male_data <- gender_representation_normalized_long |> filter(gender == "Male")
female_data <- gender_representation_normalized_long |> filter(gender == "Female")
```

5. Visualization Parameters

Code
```{r}
#| label: params

### |- plot aesthetics ----
bkg_col      <- colorspace::lighten('#f7f5e9', 0.05)    
title_col    <- "gray20"           
subtitle_col <- "gray20"     
caption_col  <- "gray30"   
text_col     <- "gray20"    
col_palette  <- MoMAColors::moma.colors(palette_name = 'Klein', type = "discrete")[c(1,2)]

### |-  titles and caption ----
# icons
tt <- str_glue("#TidyTuesday: { 2024 } Week { 39 } &bull; Source: IMO Team and Individual Results<br>")
li <- str_glue("<span style='font-family:fa6-brands'>&#xf08c;</span>")
gh <- str_glue("<span style='font-family:fa6-brands'>&#xf09b;</span>")
mn <- str_glue("<span style='font-family:fa6-brands'>&#xf4f6;</span>")

# text
male   <- str_glue("<span style='color:{ col_palette[2] }'>**male**</span>")
female <- str_glue("<span style='color:{ col_palette[1] }'>**female**</span>")

title_text    <- str_glue("Gender Representation in the International Mathematical Olympiad")
subtitle_text <- str_glue("__Left:__ Total number of  { male } and { female } contestants by country (_raw counts_).<br>
                          __Right:__ Proportion of total contestants who were { male } and { female }  each year (_% of total contestants per year_)")
caption_text  <- str_glue("{tt} {li} stevenponce &bull; {mn} @sponce1(graphic.social) {gh} poncest &bull; #rstats #ggplot2")

### |-  fonts ----
font_add("fa6-brands", "fonts/6.4.2/Font Awesome 6 Brands-Regular-400.otf")
font_add_google("Oswald", regular.wt = 400, family = "title")
font_add_google("Merriweather Sans", regular.wt = 400, family = "subtitle")
font_add_google("Merriweather Sans", regular.wt = 400, family = "text")
font_add_google("Noto Sans", regular.wt = 400, family = "caption")
showtext_auto(enable = TRUE)

### |-  plot theme ----
theme_set(theme_minimal(base_size = 14, base_family = "text"))                

theme_update(
    plot.title.position   = "plot",
    plot.caption.position = "plot",
    legend.position       = 'plot',
    plot.background       = element_rect(fill = bkg_col, color = bkg_col),
    panel.background      = element_rect(fill = bkg_col, color = bkg_col),
    plot.margin           = margin(t = 10, r = 20, b = 10, l = 20),
    axis.title.x          = element_text(margin = margin(10, 0, 0, 0), size = rel(1.1), 
                                         color = text_col, family = "text", face = "bold", hjust = 0.5),
    axis.title.y          = element_text(margin = margin(0, 10, 0, 0), size = rel(1.1), 
                                         color = text_col, family = "text", face = "bold", hjust = 0.5),
    axis.text             = element_text(size = rel(0.8), color = text_col, family = "text"),
)  
```

6. Plot

Code
```{r}
#| label: plot

### |-  first plot ----  

# Pyramid style chart
p1 <- ggplot(gender_by_country, aes(x = reorder(country, total_contestants), y = count, fill = gender)) +
  geom_bar(stat = "identity", width = 0.75, alpha = 0.85) +

  # Geoms
  # Adding labels outside the bars
  geom_text(aes(label = comma(abs(count))),
    position = position_nudge(y = ifelse(gender_by_country$gender == "female", -50, 50)),
    size = 3.6, hjust = ifelse(gender_by_country$gender == "female", 1, 0), color = text_col
  ) +

  # Adding a single country label next to the bars
  geom_text(aes(y = -900, label = country), # Position countries next to the bars
    size = 3.6, hjust = 0.5, vjust = 0, color = text_col
  ) +

  # Scales
  scale_y_continuous(
    breaks = seq(-1000, 1000, by = 500),
    labels = scales::comma_format(),
    limits = c(-1200, 1600)
  ) +
  scale_fill_manual(values = col_palette) +
  coord_flip(clip = "off") +

  # labs
  labs(
    x = NULL,
    y = "Number of Contestants",
    fill = "Gender"
  ) +

  # Theme
  theme(
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major.y = element_blank(),
  )


### |-  second plot ----  

# Create the plot, including the ribbon and the textlines
p2 <- ggplot() +

  # Geoms
  # Add ribbon to fill the area between male and female percentages
  geom_ribbon(aes(x = male_data$year, ymin = female_data$percentage, ymax = male_data$percentage),
    fill = "lightblue", alpha = 0.5
  ) +

  # Add the geom_textline for male and female percentages
  geom_textline(aes(x = year, y = percentage, color = gender, label = gender),
    data = gender_representation_normalized_long,
    linewidth = 1,
    family = "text",
    size = 5,
    fontface = "bold",
    hjust = 0.5, # move labels to the right
    offset = unit(0.3, "cm"), # move labels up
    text_smoothing = 30 # smooth text (more legible)
  ) +

  # Adding geom_point and geom_text for the start and end percentages for male and female
  geom_point(
    data = filter(gender_representation_normalized_long, year == min(year) | year == max(year)),
    aes(x = year, y = percentage, color = gender), size = 4
  ) +

  # Female
  geom_text(
    data = filter(
      gender_representation_normalized_long,
      (year == min(year) | year == max(year)) & gender == "Female"
    ),
    aes(
      x = year, y = percentage, label = scales::percent(percentage / 100, accuracy = 1),
      color = gender
    ), size = 5, nudge_x = -0.005, vjust = -1.3, fontface = "bold", family = "text"
  ) +

  # Male
  geom_text(
    data = filter(
      gender_representation_normalized_long,
      (year == min(year) | year == max(year)) & gender == "Male"
    ),
    aes(
      x = year, y = percentage, label = scales::percent(percentage / 100, accuracy = 1),
      color = gender
    ), size = 5, nudge_x = 0.005, vjust = 1.9, fontface = "bold", family = "text"
  ) +

  # Labs
  labs(
    x = "Year",
    y = "Percentage of Contestants",
    color = "Gender"
  ) +

  # Scales
  scale_x_continuous() +
  scale_y_continuous(labels = scales::label_percent(scale = 1)) +
  scale_color_manual(values = col_palette) +
  coord_cartesian(clip = "off")


#### |-  combined plot ----  
# Annotation and aspect ratio of p2
p2 <- p2 + 
    annotate(
        "text", 
        x = 1962, 
        y = 50, 
        label = "Even though the overall contestant count has increased,\n the gender gap has remained the same.", 
        size = 4, 
        fontface = "italic", 
        family = "text", 
        color = 'gray40',
        hjust = 0
    ) +
    theme(aspect.ratio = 0.85)  

# Combine plots
combined_plot <- (p1 | p2) +
    patchwork::plot_layout(
        ncol = 2, 
        widths = c(1, 1.25),  # Adjusting relative widths
        guides = 'collect'    # Collect legends
    ) +
    
    # Labs
    plot_annotation(
        title = title_text,
        subtitle = subtitle_text,
        caption = caption_text
    ) &
    
    # Theme 
    theme(
        plot.margin = margin(10, 20, 10, 20),  
        
        plot.title = element_markdown(
            size = rel(1.7),   
            family = "title",
            face = "bold",
            color = title_col,
            lineheight = 1.1,
            margin = margin(t = 5, b = 5)
        ),
        plot.subtitle = element_markdown(
            size = rel(0.88), 
            family = 'subtitle',
            color = subtitle_col,
            lineheight = 1.1, 
            margin = margin(t = 5, b = 5)
        ),
        plot.caption = element_markdown(
            size = rel(0.65),
            family = "caption",
            color = caption_col,
            lineheight = 1.1,
            hjust = 0.5,
            halign = 1,
            margin = margin(t = 5, b = 5)
        )
    )

# Show the combined plot
combined_plot  
```

7. Save

Code
```{r}
#| label: save

### |-  plot image ----  

library(ggplotify)
# Convert patchwork plot to grob 
# There was some issues between patchwork and ggsave
plot_grob <- as.grob(combined_plot)

# Save the plot again
ggsave(
    filename = here::here("data_visualizations/TidyTuesday/2024/tt_2024_39.png"),
    plot = plot_grob,
    width  = 11,
    height = 10,
    units  = "in",
    dpi    = 320
)

### |-  plot thumbnail----  
magick::image_read(here::here("data_visualizations/TidyTuesday/2024/tt_2024_39.png")) |> 
  magick::image_resize(geometry = "400") |> 
  magick::image_write(here::here("data_visualizations/TidyTuesday/2024/thumbnails/tt_2024_39.png"))
```

8. Session Info

Code
```{r, eval=TRUE}
info <- capture.output(sessioninfo::session_info())
# Remove lines that contain "[1]" and "[2]" (the file paths)
filtered_info <- grep("\\[1\\]|\\[2\\]", info, value = TRUE, invert = TRUE)
cat(filtered_info, sep = "\n")
```
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14 ucrt)
 os       Windows 11 x64 (build 22631)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2025-05-22
 pandoc   3.4 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date (UTC) lib source
 P digest        0.6.37  2024-08-19 [?] RSPM (R 4.4.0)
 P evaluate      1.0.1   2024-10-10 [?] RSPM (R 4.4.0)
 P fastmap       1.2.0   2024-05-15 [?] RSPM (R 4.4.0)
 P htmltools     0.5.8.1 2024-04-04 [?] RSPM (R 4.4.0)
 P htmlwidgets   1.6.4   2023-12-06 [?] CRAN (R 4.4.0)
 P jsonlite      1.8.9   2024-09-20 [?] RSPM (R 4.4.0)
 P knitr         1.49    2024-11-08 [?] RSPM (R 4.4.0)
 P rmarkdown     2.29    2024-11-04 [?] RSPM (R 4.4.0)
 P rstudioapi    0.17.1  2024-10-22 [?] RSPM (R 4.4.0)
 P sessioninfo   1.2.2   2021-12-06 [?] RSPM (R 4.4.0)
 P xfun          0.49    2024-10-31 [?] RSPM (R 4.4.0)
 P yaml          2.3.10  2024-07-26 [?] RSPM (R 4.4.0)


 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top

© 2024 Steven Ponce

Source Issues