Beginner
Databricks Lakehouse Platform
Get started with the Databricks platform — workspaces, notebooks, compute clusters, Unity Catalog, and the integrated ML environment.
What is Databricks?
Databricks is a unified data analytics platform built on Apache Spark. Founded by the creators of Spark, Delta Lake, and MLflow, it provides a managed lakehouse environment on AWS, Azure, and GCP.
Workspace Setup
SQL — Creating Your First Catalog and Schema
-- Unity Catalog: three-level namespace
-- catalog.schema.table
-- Create a catalog for your ML project
CREATE CATALOG IF NOT EXISTS ml_project;
-- Create schemas for different data layers
CREATE SCHEMA IF NOT EXISTS ml_project.bronze;
CREATE SCHEMA IF NOT EXISTS ml_project.silver;
CREATE SCHEMA IF NOT EXISTS ml_project.gold;
CREATE SCHEMA IF NOT EXISTS ml_project.ml_features;
Notebooks
Python — Databricks Notebook Basics
# Databricks notebooks support Python, SQL, Scala, and R
# Magic commands switch between languages
# Read data from Unity Catalog
df = spark.read.table("ml_project.bronze.raw_events")
df.display() # Databricks-specific rich display
# Write results back
df_processed.write \
.mode("overwrite") \
.saveAsTable("ml_project.silver.clean_events")
# Use dbutils for Databricks-specific utilities
dbutils.fs.ls("/mnt/data/")
dbutils.widgets.text("start_date", "2026-01-01")
start_date = dbutils.widgets.get("start_date")
Compute Clusters
| Cluster Type | Use Case | Auto-scaling |
|---|---|---|
| All-Purpose | Interactive development, notebooks | Yes |
| Job Clusters | Scheduled jobs, CI/CD pipelines | Yes |
| SQL Warehouses | SQL analytics, BI dashboards | Yes |
| GPU Clusters | Deep learning, LLM training | Yes |
Unity Catalog
Unity Catalog is Databricks' unified governance layer for all data and AI assets.
- Three-level namespace:
catalog.schema.tablefor clear data organization. - Fine-grained access control: Column-level, row-level, and table-level permissions.
- Data lineage: Automatic tracking of how data flows between tables.
- ML assets: Models, feature tables, and experiments are governed alongside data.
SQL — Unity Catalog Permissions
-- Grant access to data science team
GRANT USAGE ON CATALOG ml_project TO `data-science-team`;
GRANT SELECT ON SCHEMA ml_project.gold TO `data-science-team`;
GRANT ALL PRIVILEGES ON SCHEMA ml_project.ml_features TO `data-science-team`;
-- Grant model access
GRANT USAGE ON SCHEMA ml_project.ml_models TO `ml-engineers`;
Databricks ML Runtime
The ML Runtime comes pre-installed with popular ML libraries:
- PyTorch, TensorFlow, scikit-learn, XGBoost
- MLflow (integrated with workspace)
- Hugging Face Transformers
- RAPIDS for GPU-accelerated ML
Use Serverless Compute: Databricks Serverless automatically provisions and scales compute. No cluster management needed. Ideal for ad-hoc data exploration and lightweight ML experiments.
Lilly Tech Systems