Beginner

Databricks Lakehouse Platform

Get started with the Databricks platform — workspaces, notebooks, compute clusters, Unity Catalog, and the integrated ML environment.

What is Databricks?

Databricks is a unified data analytics platform built on Apache Spark. Founded by the creators of Spark, Delta Lake, and MLflow, it provides a managed lakehouse environment on AWS, Azure, and GCP.

Workspace Setup

SQL — Creating Your First Catalog and Schema

-- Unity Catalog: three-level namespace
-- catalog.schema.table

-- Create a catalog for your ML project
CREATE CATALOG IF NOT EXISTS ml_project;

-- Create schemas for different data layers
CREATE SCHEMA IF NOT EXISTS ml_project.bronze;
CREATE SCHEMA IF NOT EXISTS ml_project.silver;
CREATE SCHEMA IF NOT EXISTS ml_project.gold;
CREATE SCHEMA IF NOT EXISTS ml_project.ml_features;

Notebooks

Python — Databricks Notebook Basics

# Databricks notebooks support Python, SQL, Scala, and R
# Magic commands switch between languages

# Read data from Unity Catalog
df = spark.read.table("ml_project.bronze.raw_events")
df.display()  # Databricks-specific rich display

# Write results back
df_processed.write \
    .mode("overwrite") \
    .saveAsTable("ml_project.silver.clean_events")

# Use dbutils for Databricks-specific utilities
dbutils.fs.ls("/mnt/data/")
dbutils.widgets.text("start_date", "2026-01-01")
start_date = dbutils.widgets.get("start_date")

Compute Clusters

Cluster Type	Use Case	Auto-scaling
All-Purpose	Interactive development, notebooks	Yes
Job Clusters	Scheduled jobs, CI/CD pipelines	Yes
SQL Warehouses	SQL analytics, BI dashboards	Yes
GPU Clusters	Deep learning, LLM training	Yes

Unity Catalog

Unity Catalog is Databricks' unified governance layer for all data and AI assets.

Three-level namespace: catalog.schema.table for clear data organization.
Fine-grained access control: Column-level, row-level, and table-level permissions.
Data lineage: Automatic tracking of how data flows between tables.
ML assets: Models, feature tables, and experiments are governed alongside data.

SQL — Unity Catalog Permissions

-- Grant access to data science team
GRANT USAGE ON CATALOG ml_project TO `data-science-team`;
GRANT SELECT ON SCHEMA ml_project.gold TO `data-science-team`;
GRANT ALL PRIVILEGES ON SCHEMA ml_project.ml_features TO `data-science-team`;

-- Grant model access
GRANT USAGE ON SCHEMA ml_project.ml_models TO `ml-engineers`;

Databricks ML Runtime

The ML Runtime comes pre-installed with popular ML libraries:

PyTorch, TensorFlow, scikit-learn, XGBoost
MLflow (integrated with workspace)
Hugging Face Transformers
RAPIDS for GPU-accelerated ML

✅

Use Serverless Compute: Databricks Serverless automatically provisions and scales compute. No cluster management needed. Ideal for ad-hoc data exploration and lightweight ML experiments.

← Previous Introduction Next → Delta Lake