Beginner

Setup & Configuration

Install DVC, initialize it in your Git repository, and configure remote storage for your data and models.

Installation

Bash — Install DVC
# Install with pip (core)
pip install dvc

# Install with cloud storage support
pip install "dvc[s3]"      # Amazon S3
pip install "dvc[gs]"      # Google Cloud Storage
pip install "dvc[azure]"   # Azure Blob Storage
pip install "dvc[ssh]"     # SSH/SFTP
pip install "dvc[all]"     # All backends

# Or install via conda
conda install -c conda-forge dvc

# Verify
dvc version

Initialize DVC

Bash — Initialize in a Git repo
# Start with a Git repository
git init my-ml-project
cd my-ml-project

# Initialize DVC
dvc init

# This creates:
# .dvc/           - DVC internal directory
# .dvc/config     - DVC configuration
# .dvcignore      - Files to ignore (like .gitignore)

# Commit the DVC initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"

Configure Remote Storage

Bash — Configure S3 remote
# Amazon S3
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc remote modify myremote region us-east-1

# Google Cloud Storage
dvc remote add -d myremote gs://my-bucket/dvc-store

# Azure Blob Storage
dvc remote add -d myremote azure://my-container/dvc-store
dvc remote modify myremote account_name myaccount

# Local directory (for testing)
dvc remote add -d myremote /tmp/dvc-storage

# SSH
dvc remote add -d myremote ssh://user@server:/path/to/storage

# Commit the remote configuration
git add .dvc/config
git commit -m "Configure DVC remote storage"

Project Structure

Recommended — ML project with DVC
my-ml-project/
├── .git/                    # Git repository
├── .dvc/                    # DVC configuration
│   └── config               # Remote storage config
├── data/
│   ├── raw/                 # Raw data (tracked by DVC)
│   │   └── dataset.csv.dvc  # DVC pointer file
│   └── processed/           # Processed data (tracked by DVC)
├── models/                  # Trained models (tracked by DVC)
├── src/
│   ├── preprocess.py        # Data preprocessing (Git)
│   ├── train.py             # Model training (Git)
│   └── evaluate.py          # Evaluation (Git)
├── params.yaml              # Parameters (Git)
├── dvc.yaml                 # Pipeline definition (Git)
├── dvc.lock                 # Pipeline state (Git)
├── metrics/
│   └── scores.json          # Metrics output (Git)
└── requirements.txt         # Dependencies (Git)
Use the -d flag: When adding a remote with dvc remote add -d, the -d flag sets it as the default remote. This means you can run dvc push and dvc pull without specifying the remote name.
💡
Credentials: DVC uses your existing cloud credentials. For S3, it uses AWS CLI configuration (~/.aws/credentials). For GCS, it uses gcloud authentication. No separate credential setup is needed.