Introduction
Data Version Control (DVC) is essential for managing data science
workflows, particularly when dealing with large files that shouldn’t be
stored in Git. thoth
provides a seamless integration
between R and DVC, offering tidyverse-style functions that make data
versioning intuitive for R users.
Prerequisites
Before using the DVC functionality in thoth
, ensure you
have:
- DVC installed (visit DVC
Installation Guide)
- Git initialized in your project
-
thoth
package installed
You can verify your setup using:
Core DVC Functions
Basic Data Tracking
Track data files and R objects with simple, pipe-friendly functions:
# Track CSV files
processed_data |>
write_csv_dvc(
"data/processed/results.csv",
message = "Added processed analysis results"
)
# Track R objects
model |>
write_rds_dvc(
"models/final_model.rds",
message = "Saved trained model"
)
Pipeline Management
Create reproducible pipelines by tracking dependencies, outputs, and parameters:
# Data preprocessing stage
raw_data |>
write_csv_dvc(
"data/processed/clean_data.csv",
stage_name = "preprocess",
deps = "data/raw/input.csv",
params = list(
remove_na = TRUE,
normalize = TRUE
)
)
# Model training stage
model |>
write_rds_dvc(
"models/model.rds",
stage_name = "train",
deps = "data/processed/clean_data.csv",
params = list(
n_trees = 500,
learning_rate = 0.01
)
)
Metrics and Plots
Track model performance metrics and visualizations:
# Track evaluation metrics
metrics |>
write_csv_dvc(
"metrics/evaluation.csv",
stage_name = "evaluate",
deps = c("models/model.rds", "data/test.csv"),
metrics = TRUE
)
# Track visualization data
plot_data |>
write_csv_dvc(
"plots/performance.csv",
stage_name = "visualize",
deps = "metrics/evaluation.csv",
plots = TRUE
)
Best Practices
-
Organized Data Structure
- Keep raw data in
data/raw/
- Store processed data in
data/processed/
- Save models in
models/
- Track metrics in
metrics/
- Store plots in
plots/
- Keep raw data in
-
Meaningful Messages
- Include descriptive commit messages
- Document parameter choices
- Note important data transformations
- Include descriptive commit messages
-
Pipeline Design
- Create modular pipeline stages
- Track all dependencies explicitly
- Version control parameters
- Include evaluation metrics
- Create modular pipeline stages
Common Operations
Managing Stages
# Add a new stage
dvc_stage(
name = "feature_engineering",
deps = "data/processed/clean_data.csv",
outputs = "data/processed/features.csv",
params = list(n_features = 10)
)
# Commit changes
dvc_commit()
Next Steps
- Explore the DVC documentation for
advanced features
- Check out the end-to-end example:
vignette("end-to-end-example")
- Learn about Git integration:
vignette("git-integration")