Beginner
The Tidyverse
Understand the tidyverse ecosystem, its core packages, tidy data principles, tibbles vs data frames, and the pipe operator.
What is the Tidyverse?
The tidyverse is an opinionated collection of R packages designed for data science. All packages share a common design philosophy, grammar, and data structures. Install and load everything at once:
R
# Install (only once) install.packages("tidyverse") # Load (every session) library(tidyverse) # Attaches: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
Core Packages
| Package | Purpose | Key Functions |
|---|---|---|
| ggplot2 | Data visualization | ggplot(), aes(), geom_*() |
| dplyr | Data manipulation | filter(), select(), mutate(), summarise() |
| tidyr | Data tidying | pivot_longer(), pivot_wider(), separate() |
| readr | Data import | read_csv(), read_tsv(), write_csv() |
| purrr | Functional programming | map(), map_dbl(), walk() |
| tibble | Modern data frames | tibble(), as_tibble(), tribble() |
| stringr | String manipulation | str_detect(), str_replace(), str_extract() |
| forcats | Factor handling | fct_reorder(), fct_lump(), fct_recode() |
Tidy Data Principles
Data is "tidy" when it follows three rules:
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
R
# NOT tidy (wide format) # country 2020 2021 2022 # USA 100 110 120 # UK 80 85 90 # TIDY (long format) # country year value # USA 2020 100 # USA 2021 110 # USA 2022 120 # UK 2020 80 # UK 2021 85 # UK 2022 90
Tibbles vs Data Frames
Tibbles are the tidyverse's enhanced data frames. Key differences:
R
# Create a tibble tb <- tibble( name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 35), score = c(92.5, 87.3, 95.1) ) # Create row-by-row with tribble tb2 <- tribble( ~name, ~age, ~score, "Alice", 25, 92.5, "Bob", 30, 87.3, "Charlie", 35, 95.1 ) # Convert data frame to tibble as_tibble(mtcars) # Tibble advantages: # - Never converts strings to factors # - Prints only first 10 rows (no console flooding) # - Shows column types in output # - Stricter subsetting (no partial name matching)
The Pipe Operator
The pipe is central to tidyverse code. It passes the result of one step as the first argument to the next:
R
library(dplyr) # Without pipe (nested, hard to read) arrange(filter(select(mtcars, mpg, cyl, hp), cyl == 6), desc(mpg)) # With pipe (reads top-to-bottom, left-to-right) mtcars |> select(mpg, cyl, hp) |> filter(cyl == 6) |> arrange(desc(mpg)) # Read as: "Take mtcars, THEN select columns, # THEN filter rows, THEN arrange."
Pipe shortcut: In RStudio, press
Ctrl+Shift+M (Windows/Linux) or Cmd+Shift+M (macOS) to insert the pipe operator.
Lilly Tech Systems