Data Version Control with thoth • thoth

Introduction

Data Version Control (DVC) is essential for managing data science workflows, particularly when dealing with large files that shouldn’t be stored in Git. thoth provides a seamless integration between R and DVC, offering tidyverse-style functions that make data versioning intuitive for R users.

Prerequisites

Before using the DVC functionality in thoth, ensure you have:

DVC installed (visit DVC Installation Guide)
Git initialized in your project
thoth package installed

You can verify your setup using:

library(thoth)
check_system_requirements()

Core DVC Functions

Basic Data Tracking

Track data files and R objects with simple, pipe-friendly functions:

# Track CSV files
processed_data |> 
  write_csv_dvc(
    "data/processed/results.csv",
    message = "Added processed analysis results"
  )

# Track R objects
model |> 
  write_rds_dvc(
    "models/final_model.rds",
    message = "Saved trained model"
  )

Pipeline Management

Create reproducible pipelines by tracking dependencies, outputs, and parameters:

# Data preprocessing stage
raw_data |> 
  write_csv_dvc(
    "data/processed/clean_data.csv",
    stage_name = "preprocess",
    deps = "data/raw/input.csv",
    params = list(
      remove_na = TRUE,
      normalize = TRUE
    )
  )

# Model training stage
model |> 
  write_rds_dvc(
    "models/model.rds",
    stage_name = "train",
    deps = "data/processed/clean_data.csv",
    params = list(
      n_trees = 500,
      learning_rate = 0.01
    )
  )

Metrics and Plots

Track model performance metrics and visualizations:

# Track evaluation metrics
metrics |> 
  write_csv_dvc(
    "metrics/evaluation.csv",
    stage_name = "evaluate",
    deps = c("models/model.rds", "data/test.csv"),
    metrics = TRUE
  )

# Track visualization data
plot_data |> 
  write_csv_dvc(
    "plots/performance.csv",
    stage_name = "visualize",
    deps = "metrics/evaluation.csv",
    plots = TRUE
  )

Best Practices

Organized Data Structure
- Keep raw data in data/raw/
- Store processed data in data/processed/
- Save models in models/
- Track metrics in metrics/
- Store plots in plots/
Meaningful Messages
- Include descriptive commit messages
- Document parameter choices
- Note important data transformations
Pipeline Design
- Create modular pipeline stages
- Track all dependencies explicitly
- Version control parameters
- Include evaluation metrics

Common Operations

Pulling and Pushing Data

# Pull data from remote storage
dvc_pull()

# Push data to remote storage
dvc_push()

Managing Stages

# Add a new stage
dvc_stage(
  name = "feature_engineering",
  deps = "data/processed/clean_data.csv",
  outputs = "data/processed/features.csv",
  params = list(n_features = 10)
)

# Commit changes
dvc_commit()

Next Steps

Explore the DVC documentation for advanced features
Check out the end-to-end example: vignette("end-to-end-example")
Learn about Git integration: vignette("git-integration")