Bonusseminar 1: Grafische Darstellungen mit ggplot2

The grammar of graphics: ggplot2

Literatur zum Nachlesen

  • Kapitel data visualization in R for data science

  • Data visualization with ggplot2 (https://rstudio.github.io/cheatsheets/data-visualization.pdf)

Vorbereitung

R Pakete

ggplot2
# library(tidyverse)
library(ggplot2)

Anstatt das {ggplot2} Paket einzeln zu laden, könnten wir auch direkt das gesamte {tidyverse} benutzen.

palmerpenguins
library(palmerpenguins)

Das {palmerpenguins} Paket enthält den Datensatz penguins.

Funktionsweise von ggplot2

  1. Erstelle ein plot object
ggplot(data = penguins)

  1. Definiere die aesthetics
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g))

  1. Stelle Daten mit einem geom dar
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

  1. Füge zusätzliche layers hinzu
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

  1. Optional: Füge labels, themes, etc. hinzu
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Flipper length and body mass",
       x = "Flipper length (mm)",
       y = "Body mass (g)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

  1. Optional: Modifiziere beliebige Details!
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(shape = island)) +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "Flipper length and body mass",
       subtitle = "for penguins from different islands",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       shape = "Island") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Unsere Standardplots

Histogramme

Der Grundbaustein für Histogramme ist geom_histogram().

ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Boxplots

Der Grundbaustein für Boxplots ist geom_boxplot().

ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Streudiagramme

Wie wir schon gesehen haben, ist der Grundbaustein von Streudiagrammen geom_point().

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Übungen

  1. Erstellen Sie die drei Standardplots mit einer anderen Variable (bzw. Kombination von Variablen) aus dem penguins Datensatz.

    Beispiele:

    ggplot(data = penguins, aes(x = bill_length_mm)) +
      geom_histogram()
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_bin()`).

    ggplot(data = penguins, aes(x = bill_length_mm)) +
      geom_boxplot()
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_boxplot()`).

    ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
    geom_point()
    Warning: Removed 2 rows containing missing values or values outside the scale range
    (`geom_point()`).

  2. Nehmen wir an, wir interessieren uns für den (linearen) Zusammenhang zwischen der Schnabellänge und der Schnabeltiefe der Pinguine im penguins Datensatz:

    cor(penguins$bill_length_mm, penguins$bill_depth_mm, use = "complete.obs")
    [1] -0.2350529

    Versuchen Sie mithilfe von Grafiken herauszufinden, warum diese Korrelation irreführend sein könnte.

    Wenn wir das dazu gehörige Streudiagramm betrachten, fällen uns vielleicht komische Punktewolken auf:

    ggplot(data = penguins,
           aes(x = bill_length_mm, y = bill_depth_mm)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(title = "Penguin bill dimensions",
        subtitle = "Palmer Station LTER",
        x = "Bill length (mm)",
        y = "Bill depth (mm)") +
      theme(plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")
    `geom_smooth()` using formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_smooth()`).
    Warning: Removed 2 rows containing missing values or values outside the scale range
    (`geom_point()`).

    Tatsächlich kommt der negative Zusammenhang nur durch die verschiedenen Spezies der Pinguine zustande. Innerhalb jeder Spezies ist der Zusammenhang positiv:

    ggplot(data = penguins,
           aes(x = bill_length_mm, y = bill_depth_mm, group = species)) +
      geom_point(aes(color = species, 
        shape = species),
        size = 3,
        alpha = 0.8) +
      geom_smooth(method = "lm", se = FALSE, aes(color = species)) +
      scale_color_manual(values = c("darkorange","purple","cyan4")) +
      labs(title = "Penguin bill dimensions",
        subtitle = "Bill length and depth for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
        x = "Bill length (mm)",
        y = "Bill depth (mm)",
        color = "Penguin species",
        shape = "Penguin species") +
      theme(legend.position.inside = c(0.85, 0.15),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")
    `geom_smooth()` using formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_smooth()`).
    Warning: Removed 2 rows containing missing values or values outside the scale range
    (`geom_point()`).

    Dieses Phänomen wird in der Literatur manchmal als Ökologischer Trugschluss bezeichnet.

  3. Suchen Sie sich einen der Standardplots aus und überlegen Sie sich eine interessante Erweiterung (z.B. Verwendung von Farben, Formen, Gruppierungsvariablen, Beschriftungen, …). Erstellen Sie eine grobe Skizze auf einem Blatt Papier (oder Tablet). Versuchen Sie, den Plot mithilfe von Google und/oder ChatGPT umzusetzen.

    Beispiele:

    ggplot(data = penguins, aes(x = flipper_length_mm)) +
      geom_histogram(aes(fill = species), 
        alpha = 0.5, 
        position = "identity") +
      scale_fill_manual(values = c("darkorange","purple","cyan4")) +
      labs(x = "Flipper length (mm)",
        y = "Frequency",
        title = "Penguin flipper lengths")
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_bin()`).

    ggplot(data = penguins, 
           aes(x = species, y = flipper_length_mm, color = species)) +
      geom_violin(alpha = 0.7) +
      geom_boxplot(width = 0.3, show.legend = FALSE) +
      geom_jitter(alpha = 0.5, show.legend = FALSE, position = position_jitter(width = 0.2, seed = 0)) +
      scale_color_manual(values = c("darkorange","purple","cyan4")) +
      labs(x = "Species",
        y = "Flipper length (mm)")
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_ydensity()`).
    Warning: Removed 2 rows containing non-finite outside the scale range
    (`stat_boxplot()`).
    Warning: Removed 2 rows containing missing values or values outside the scale range
    (`geom_point()`).