Skip to contents

Introduction

Data Version Control (DVC) is essential for managing data science workflows, particularly when dealing with large files that shouldn’t be stored in Git. thoth provides a seamless integration between R and DVC, offering tidyverse-style functions that make data versioning intuitive for R users.

Prerequisites

Before using the DVC functionality in thoth, ensure you have:

  1. DVC installed (visit DVC Installation Guide)
  2. Git initialized in your project
  3. thoth package installed

You can verify your setup using:

Core DVC Functions

Basic Data Tracking

Track data files and R objects with simple, pipe-friendly functions:

# Track CSV files
processed_data |> 
  write_csv_dvc(
    "data/processed/results.csv",
    message = "Added processed analysis results"
  )

# Track R objects
model |> 
  write_rds_dvc(
    "models/final_model.rds",
    message = "Saved trained model"
  )

Pipeline Management

Create reproducible pipelines by tracking dependencies, outputs, and parameters:

# Data preprocessing stage
raw_data |> 
  write_csv_dvc(
    "data/processed/clean_data.csv",
    stage_name = "preprocess",
    deps = "data/raw/input.csv",
    params = list(
      remove_na = TRUE,
      normalize = TRUE
    )
  )

# Model training stage
model |> 
  write_rds_dvc(
    "models/model.rds",
    stage_name = "train",
    deps = "data/processed/clean_data.csv",
    params = list(
      n_trees = 500,
      learning_rate = 0.01
    )
  )

Metrics and Plots

Track model performance metrics and visualizations:

# Track evaluation metrics
metrics |> 
  write_csv_dvc(
    "metrics/evaluation.csv",
    stage_name = "evaluate",
    deps = c("models/model.rds", "data/test.csv"),
    metrics = TRUE
  )

# Track visualization data
plot_data |> 
  write_csv_dvc(
    "plots/performance.csv",
    stage_name = "visualize",
    deps = "metrics/evaluation.csv",
    plots = TRUE
  )

Best Practices

  1. Organized Data Structure
    • Keep raw data in data/raw/
    • Store processed data in data/processed/
    • Save models in models/
    • Track metrics in metrics/
    • Store plots in plots/
  2. Meaningful Messages
    • Include descriptive commit messages
    • Document parameter choices
    • Note important data transformations
  3. Pipeline Design
    • Create modular pipeline stages
    • Track all dependencies explicitly
    • Version control parameters
    • Include evaluation metrics

Common Operations

Pulling and Pushing Data

# Pull data from remote storage
dvc_pull()

# Push data to remote storage
dvc_push()

Managing Stages

# Add a new stage
dvc_stage(
  name = "feature_engineering",
  deps = "data/processed/clean_data.csv",
  outputs = "data/processed/features.csv",
  params = list(n_features = 10)
)

# Commit changes
dvc_commit()

Next Steps