Beginner

Introduction to R for Data Science

Understand why R is a top choice for data science, discover the tidyverse ecosystem, and set up your data science environment.

R for Data Science Overview

R is one of the two dominant languages in data science (alongside Python). What sets R apart is its deep integration with statistical methods and its world-class data visualization capabilities through ggplot2.

This course focuses on the tidyverse — a coherent collection of packages that share a common philosophy for data science in R. You will learn to import, tidy, transform, visualize, and communicate data effectively.

The Tidyverse Ecosystem

The tidyverse is a collection of R packages designed by Hadley Wickham and the team at Posit (formerly RStudio). The core packages include:

  • ggplot2 — Data visualization using the grammar of graphics
  • dplyr — Data manipulation (filter, select, mutate, summarise)
  • tidyr — Data tidying (reshaping and cleaning)
  • readr — Fast data import (CSV, TSV)
  • purrr — Functional programming with lists and vectors
  • tibble — Modern data frames
  • stringr — String manipulation
  • forcats — Factor (categorical data) handling

Why R for Data Science?

  • ggplot2: The most powerful and flexible visualization system in any language.
  • dplyr: Intuitive, readable syntax for data manipulation that reads like English.
  • Shiny: Build interactive web dashboards directly from R without knowing HTML/CSS/JS.
  • R Markdown: Combine code, results, and narrative in reproducible documents.
  • Statistical depth: Access to cutting-edge statistical methods before they appear in other languages.

R vs Python for Data Science

AspectRPython
Visualizationggplot2 (superior for static plots)matplotlib, seaborn, plotly
Data wranglingdplyr + tidyr (very readable)pandas (powerful but verbose)
StatisticsUnmatched depth and breadthscipy, statsmodels
ML engineeringtidymodels, caretscikit-learn, TensorFlow (larger ecosystem)
DashboardsShiny (easy, R-native)Streamlit, Dash
ReportingR Markdown, QuartoJupyter notebooks

Setting Up Your DS Environment

R
# Install the entire tidyverse
install.packages("tidyverse")

# Additional useful DS packages
install.packages(c(
  "readxl",       # Excel files
  "janitor",      # Data cleaning helpers
  "skimr",        # Quick data summaries
  "lubridate",    # Dates and times
  "scales",       # Formatting for ggplot2
  "plotly",       # Interactive plots
  "DT"            # Interactive tables
))

# Load the tidyverse
library(tidyverse)

Hadley Wickham and the Tidy Data Philosophy

Hadley Wickham is the Chief Scientist at Posit and the architect of the tidyverse. His 2014 paper "Tidy Data" established the principles that guide modern R data science:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

When data is in "tidy" format, it becomes dramatically easier to visualize, model, and transform. The entire tidyverse is built around this principle.

📚
Prerequisites: This course assumes you have completed the Basics of R course or have equivalent experience with R fundamentals (variables, functions, data structures).