Skip to contents

Introduction

This guide walks you through a complete data science project using thoth. We’ll use the classic iris dataset to demonstrate how all the package’s features work together. By the end, you’ll understand how to:

  • Set up a reproducible project structure
  • Track data and code changes
  • Document analytical decisions
  • Create reproducible workflows
  • Share your work with others

1. Project Setup

Let’s start by creating a new project with all the necessary components:

# Load required packages
library(thoth)
library(tidyverse)
library(tidymodels)

# Create a new project with all features enabled
create_analytics_project(
  "iris_analysis",      # Project name
  use_dvc = TRUE,       # Enable data version control
  use_docker = TRUE,    # Enable containerization
  git_init = TRUE       # Initialize Git repository
)

# Move into the project directory
setwd("iris_analysis")

This creates a structured project with:
- Version control (Git) for code
- Data version control (DVC) for large files
- Docker for reproducible environments
- Organized directories for data, code, and results

2. Data Management

2.1 Save Raw Data

First, let’s save and track our raw data:

# Save the iris dataset as our raw data
# The write_csv_dvc function combines saving and tracking in one step
iris %>% 
  as_tibble() %>%
  write_csv_dvc(
    "data/raw/iris.csv",
    message = "Add original iris dataset",  # Descriptive message for tracking
    stage_name = "save_raw_data"           # Name this step in our pipeline
  )

2.2 Process Data

Now we’ll clean and prepare our data for analysis:

# Read the raw data and process it
processed_data <- read_csv("data/raw/iris.csv") %>%
  janitor::clean_names() %>%
  mutate(species = as.factor(species))

# Create training (80%) and testing (20%) sets
set.seed(123)  # For reproducibility
split <- initial_split(processed_data, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)

# Create and apply the recipe for preprocessing
iris_recipe <- recipe(species ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors())

# Apply the recipe to both training and testing data
train_data <- iris_recipe %>%
  prep() %>%
  bake(new_data = NULL)

test_data <- iris_recipe %>%
  prep() %>%
  bake(new_data = test_data)

# Save and track the processed datasets
train_data %>%
  write_csv_dvc(
    "data/processed/train.csv",
    message = "Add processed training data",
    stage_name = "prepare_data",           # Name this pipeline stage
    deps = "data/raw/iris.csv",           # Declare dependencies
    params = list(                        # Track important parameters
      train_prop = 0.8,
      seed = 123
    )
  )

# Note: Parameters will be stored in params.yaml and the script in R/pipeline/prepare_data.R
test_data %>%
  write_csv_dvc(
    "data/processed/test.csv",
    message = "Add processed test data"
  )

3. Model Development

3.1 Train Model

Let’s train a random forest classifier:

# Define the model specification
rf_spec <- rand_forest(
  trees = 500,    # Number of trees
  mtry = 3        # Number of variables to consider at each split
) %>%
  set_engine("ranger") %>%           # Use ranger implementation
  set_mode("classification")         # Set for classification task

# Train the model
rf_fit <- rf_spec %>%
  fit(
    species ~ .,    # Predict species using all other variables
    data = train_data
  )

# Save and track the trained model
rf_fit %>%
  write_rds_dvc(
    "models/rf_model.rds",
    message = "Save trained random forest model",
    stage_name = "train_model",
    deps = "data/processed/train.csv",    # Model depends on training data
    params = list(                        # Track model parameters
      n_trees = 500,
      mtry = 3
    )
  )

3.2 Evaluate Model

Let’s assess how well our model performs:

# Make predictions on test data
predictions <- rf_fit %>%
  predict(test_data) %>%
  bind_cols(test_data)

# Calculate performance metrics
metrics <- predictions %>%
  metrics(truth = species, estimate = .pred_class)

# Create visualization
conf_matrix_plot <- predictions %>%
  conf_mat(truth = species, estimate = .pred_class) %>%
  autoplot()

# Save confusion matrix plot
ggsave(
  "plots/confusion_matrix.png",
  plot = conf_matrix_plot,
  width = 8,
  height = 6
)

# Track metrics with DVC
metrics %>%
  write_csv_dvc(
    "metrics/model_metrics.csv",
    message = "Add model evaluation metrics",
    stage_name = "evaluate_model",
    deps = c(                              # This stage depends on:
      "models/rf_model.rds",              # - The trained model
      "data/processed/test.csv"           # - The test data
    ),
    metrics = TRUE                        # Mark as metrics for tracking
  )

4. Decision Tracking

Let’s document our analytical decisions:

# Start tracking decisions
tree <- initialize_decision_tree(
  analysis_id = "iris_classification",
  analyst = "Data Scientist",
  description = "Classification of iris species using random forest"
)

# Record data processing decisions
record_decision(
  tree,
  check = "Feature scaling",
  observation = "Features are on different scales (e.g., length vs width)",
  decision = "Scale all numeric features to mean=0, sd=1",
  reasoning = "Standardized features ensure equal importance in the model",
  evidence = "data/processed/train.csv"
)

record_decision(
  tree,
  check = "Train/Test Split",
  observation = "Need to assess model performance on unseen data",
  decision = "Use 80/20 split with random sampling",
  reasoning = "Standard split ratio, enough data in both sets",
  evidence = "scripts/prepare_data.R"
)

# Record modeling decisions
record_decision(
  tree,
  check = "Model Selection",
  observation = "Relationships between features may be non-linear",
  decision = "Use random forest classifier",
  reasoning = "Can capture complex patterns, handles feature interactions well",
  evidence = "models/rf_model.rds"
)

# Export decisions for documentation
export_decision_tree(
  tree,
  format = "html",
  output_path = "reports/analysis_decisions.html"
)

5. Reproducible Pipeline

Now that we’ve set up our DVC pipeline with permanent scripts in R/pipeline/, we can easily reproduce our entire workflow:

# View the pipeline status
dvc_status()

# Reproduce the entire pipeline
dvc_repro()

# Or reproduce a specific stage
dvc_repro("train_model")

Each stage in our pipeline is backed by a permanent R script in the R/pipeline/ directory: - save_raw_data.R: Saves and tracks the raw iris dataset - prepare_data.R: Processes and splits the data - train_model.R: Trains the random forest model - evaluate_model.R: Computes and saves performance metrics

Parameters for each stage are stored in params.yaml, making it easy to track and modify parameters across the entire pipeline.

6. Sharing Your Work

6.1 Share Code

Push your code to the remote repository:

# Stage all changes
git_add(".")

# Commit changes
git_commit("Complete iris species classification analysis")

# Push to remote repository
git_push("origin", "main")

6.2 Share Data and Models

Push data and models to your DVC remote:

# Push all tracked data and models
dvc_push()

6.3 Share Environment

Others can reproduce your environment using these steps:

# 1. Clone the repository (one-time setup)
git_clone("<repository-url>", "iris_analysis")

# 2. Change to project directory
setwd("iris_analysis")

# 3. Pull the latest code
git_pull()

# 4. Pull data and models
dvc_pull()

# 5. Reproduce the analysis
dvc_repro()

Note: For Docker setup, use the functions from the Docker integration:

# Start the Docker environment
setup_docker()

Next Steps