Beginner
Setup & Configuration
Install DVC, initialize it in your Git repository, and configure remote storage for your data and models.
Installation
Bash — Install DVC
# Install with pip (core)
pip install dvc
# Install with cloud storage support
pip install "dvc[s3]" # Amazon S3
pip install "dvc[gs]" # Google Cloud Storage
pip install "dvc[azure]" # Azure Blob Storage
pip install "dvc[ssh]" # SSH/SFTP
pip install "dvc[all]" # All backends
# Or install via conda
conda install -c conda-forge dvc
# Verify
dvc version
Initialize DVC
Bash — Initialize in a Git repo
# Start with a Git repository
git init my-ml-project
cd my-ml-project
# Initialize DVC
dvc init
# This creates:
# .dvc/ - DVC internal directory
# .dvc/config - DVC configuration
# .dvcignore - Files to ignore (like .gitignore)
# Commit the DVC initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"
Configure Remote Storage
Bash — Configure S3 remote
# Amazon S3
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc remote modify myremote region us-east-1
# Google Cloud Storage
dvc remote add -d myremote gs://my-bucket/dvc-store
# Azure Blob Storage
dvc remote add -d myremote azure://my-container/dvc-store
dvc remote modify myremote account_name myaccount
# Local directory (for testing)
dvc remote add -d myremote /tmp/dvc-storage
# SSH
dvc remote add -d myremote ssh://user@server:/path/to/storage
# Commit the remote configuration
git add .dvc/config
git commit -m "Configure DVC remote storage"
Project Structure
Recommended — ML project with DVC
my-ml-project/
├── .git/ # Git repository
├── .dvc/ # DVC configuration
│ └── config # Remote storage config
├── data/
│ ├── raw/ # Raw data (tracked by DVC)
│ │ └── dataset.csv.dvc # DVC pointer file
│ └── processed/ # Processed data (tracked by DVC)
├── models/ # Trained models (tracked by DVC)
├── src/
│ ├── preprocess.py # Data preprocessing (Git)
│ ├── train.py # Model training (Git)
│ └── evaluate.py # Evaluation (Git)
├── params.yaml # Parameters (Git)
├── dvc.yaml # Pipeline definition (Git)
├── dvc.lock # Pipeline state (Git)
├── metrics/
│ └── scores.json # Metrics output (Git)
└── requirements.txt # Dependencies (Git)
Use the
-d flag: When adding a remote with dvc remote add -d, the -d flag sets it as the default remote. This means you can run dvc push and dvc pull without specifying the remote name.Credentials: DVC uses your existing cloud credentials. For S3, it uses AWS CLI configuration (
~/.aws/credentials). For GCS, it uses gcloud authentication. No separate credential setup is needed.
Lilly Tech Systems