Beginner

Databricks Lakehouse Platform

Get started with the Databricks platform — workspaces, notebooks, compute clusters, Unity Catalog, and the integrated ML environment.

What is Databricks?

Databricks is a unified data analytics platform built on Apache Spark. Founded by the creators of Spark, Delta Lake, and MLflow, it provides a managed lakehouse environment on AWS, Azure, and GCP.

Workspace Setup

SQL — Creating Your First Catalog and Schema
-- Unity Catalog: three-level namespace
-- catalog.schema.table

-- Create a catalog for your ML project
CREATE CATALOG IF NOT EXISTS ml_project;

-- Create schemas for different data layers
CREATE SCHEMA IF NOT EXISTS ml_project.bronze;
CREATE SCHEMA IF NOT EXISTS ml_project.silver;
CREATE SCHEMA IF NOT EXISTS ml_project.gold;
CREATE SCHEMA IF NOT EXISTS ml_project.ml_features;

Notebooks

Python — Databricks Notebook Basics
# Databricks notebooks support Python, SQL, Scala, and R
# Magic commands switch between languages

# Read data from Unity Catalog
df = spark.read.table("ml_project.bronze.raw_events")
df.display()  # Databricks-specific rich display

# Write results back
df_processed.write \
    .mode("overwrite") \
    .saveAsTable("ml_project.silver.clean_events")

# Use dbutils for Databricks-specific utilities
dbutils.fs.ls("/mnt/data/")
dbutils.widgets.text("start_date", "2026-01-01")
start_date = dbutils.widgets.get("start_date")

Compute Clusters

Cluster TypeUse CaseAuto-scaling
All-PurposeInteractive development, notebooksYes
Job ClustersScheduled jobs, CI/CD pipelinesYes
SQL WarehousesSQL analytics, BI dashboardsYes
GPU ClustersDeep learning, LLM trainingYes

Unity Catalog

Unity Catalog is Databricks' unified governance layer for all data and AI assets.

  • Three-level namespace: catalog.schema.table for clear data organization.
  • Fine-grained access control: Column-level, row-level, and table-level permissions.
  • Data lineage: Automatic tracking of how data flows between tables.
  • ML assets: Models, feature tables, and experiments are governed alongside data.
SQL — Unity Catalog Permissions
-- Grant access to data science team
GRANT USAGE ON CATALOG ml_project TO `data-science-team`;
GRANT SELECT ON SCHEMA ml_project.gold TO `data-science-team`;
GRANT ALL PRIVILEGES ON SCHEMA ml_project.ml_features TO `data-science-team`;

-- Grant model access
GRANT USAGE ON SCHEMA ml_project.ml_models TO `ml-engineers`;

Databricks ML Runtime

The ML Runtime comes pre-installed with popular ML libraries:

  • PyTorch, TensorFlow, scikit-learn, XGBoost
  • MLflow (integrated with workspace)
  • Hugging Face Transformers
  • RAPIDS for GPU-accelerated ML
Use Serverless Compute: Databricks Serverless automatically provisions and scales compute. No cluster management needed. Ideal for ad-hoc data exploration and lightweight ML experiments.