
End-to-End Example: Iris Classification Project
Source:vignettes/end-to-end-example.Rmd
end-to-end-example.Rmd
Introduction
This guide walks you through a complete data science project using
thoth
. We’ll use the classic iris dataset to demonstrate
how all the package’s features work together. By the end, you’ll
understand how to:
- Set up a reproducible project structure
- Track data and code changes
- Document analytical decisions
- Create reproducible workflows
- Share your work with others
1. Project Setup
Let’s start by creating a new project with all the necessary components:
# Load required packages
library(thoth)
library(tidyverse)
library(tidymodels)
# Create a new project with all features enabled
create_analytics_project(
"iris_analysis", # Project name
use_dvc = TRUE, # Enable data version control
use_docker = TRUE, # Enable containerization
git_init = TRUE # Initialize Git repository
)
# Move into the project directory
setwd("iris_analysis")
This creates a structured project with:
- Version control (Git) for code
- Data version control (DVC) for large files
- Docker for reproducible environments
- Organized directories for data, code, and results
2. Data Management
2.1 Save Raw Data
First, let’s save and track our raw data:
# Save the iris dataset as our raw data
# The write_csv_dvc function combines saving and tracking in one step
iris %>%
as_tibble() %>%
write_csv_dvc(
"data/raw/iris.csv",
message = "Add original iris dataset", # Descriptive message for tracking
stage_name = "save_raw_data" # Name this step in our pipeline
)
2.2 Process Data
Now we’ll clean and prepare our data for analysis:
# Read the raw data and process it
processed_data <- read_csv("data/raw/iris.csv") %>%
janitor::clean_names() %>%
mutate(species = as.factor(species))
# Create training (80%) and testing (20%) sets
set.seed(123) # For reproducibility
split <- initial_split(processed_data, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)
# Create and apply the recipe for preprocessing
iris_recipe <- recipe(species ~ ., data = train_data) %>%
step_normalize(all_numeric_predictors())
# Apply the recipe to both training and testing data
train_data <- iris_recipe %>%
prep() %>%
bake(new_data = NULL)
test_data <- iris_recipe %>%
prep() %>%
bake(new_data = test_data)
# Save and track the processed datasets
train_data %>%
write_csv_dvc(
"data/processed/train.csv",
message = "Add processed training data",
stage_name = "prepare_data", # Name this pipeline stage
deps = "data/raw/iris.csv", # Declare dependencies
params = list( # Track important parameters
train_prop = 0.8,
seed = 123
)
)
# Note: Parameters will be stored in params.yaml and the script in R/pipeline/prepare_data.R
test_data %>%
write_csv_dvc(
"data/processed/test.csv",
message = "Add processed test data"
)
3. Model Development
3.1 Train Model
Let’s train a random forest classifier:
# Define the model specification
rf_spec <- rand_forest(
trees = 500, # Number of trees
mtry = 3 # Number of variables to consider at each split
) %>%
set_engine("ranger") %>% # Use ranger implementation
set_mode("classification") # Set for classification task
# Train the model
rf_fit <- rf_spec %>%
fit(
species ~ ., # Predict species using all other variables
data = train_data
)
# Save and track the trained model
rf_fit %>%
write_rds_dvc(
"models/rf_model.rds",
message = "Save trained random forest model",
stage_name = "train_model",
deps = "data/processed/train.csv", # Model depends on training data
params = list( # Track model parameters
n_trees = 500,
mtry = 3
)
)
3.2 Evaluate Model
Let’s assess how well our model performs:
# Make predictions on test data
predictions <- rf_fit %>%
predict(test_data) %>%
bind_cols(test_data)
# Calculate performance metrics
metrics <- predictions %>%
metrics(truth = species, estimate = .pred_class)
# Create visualization
conf_matrix_plot <- predictions %>%
conf_mat(truth = species, estimate = .pred_class) %>%
autoplot()
# Save confusion matrix plot
ggsave(
"plots/confusion_matrix.png",
plot = conf_matrix_plot,
width = 8,
height = 6
)
# Track metrics with DVC
metrics %>%
write_csv_dvc(
"metrics/model_metrics.csv",
message = "Add model evaluation metrics",
stage_name = "evaluate_model",
deps = c( # This stage depends on:
"models/rf_model.rds", # - The trained model
"data/processed/test.csv" # - The test data
),
metrics = TRUE # Mark as metrics for tracking
)
4. Decision Tracking
Let’s document our analytical decisions:
# Start tracking decisions
tree <- initialize_decision_tree(
analysis_id = "iris_classification",
analyst = "Data Scientist",
description = "Classification of iris species using random forest"
)
# Record data processing decisions
record_decision(
tree,
check = "Feature scaling",
observation = "Features are on different scales (e.g., length vs width)",
decision = "Scale all numeric features to mean=0, sd=1",
reasoning = "Standardized features ensure equal importance in the model",
evidence = "data/processed/train.csv"
)
record_decision(
tree,
check = "Train/Test Split",
observation = "Need to assess model performance on unseen data",
decision = "Use 80/20 split with random sampling",
reasoning = "Standard split ratio, enough data in both sets",
evidence = "scripts/prepare_data.R"
)
# Record modeling decisions
record_decision(
tree,
check = "Model Selection",
observation = "Relationships between features may be non-linear",
decision = "Use random forest classifier",
reasoning = "Can capture complex patterns, handles feature interactions well",
evidence = "models/rf_model.rds"
)
# Export decisions for documentation
export_decision_tree(
tree,
format = "html",
output_path = "reports/analysis_decisions.html"
)
5. Reproducible Pipeline
Now that we’ve set up our DVC pipeline with permanent scripts in
R/pipeline/
, we can easily reproduce our entire
workflow:
# View the pipeline status
dvc_status()
# Reproduce the entire pipeline
dvc_repro()
# Or reproduce a specific stage
dvc_repro("train_model")
Each stage in our pipeline is backed by a permanent R script in the
R/pipeline/
directory: - save_raw_data.R
:
Saves and tracks the raw iris dataset - prepare_data.R
:
Processes and splits the data - train_model.R
: Trains the
random forest model - evaluate_model.R
: Computes and saves
performance metrics
Parameters for each stage are stored in params.yaml
,
making it easy to track and modify parameters across the entire
pipeline.
6. Sharing Your Work
6.1 Share Code
Push your code to the remote repository:
# Stage all changes
git_add(".")
# Commit changes
git_commit("Complete iris species classification analysis")
# Push to remote repository
git_push("origin", "main")
6.2 Share Data and Models
Push data and models to your DVC remote:
# Push all tracked data and models
dvc_push()
6.3 Share Environment
Others can reproduce your environment using these steps:
# 1. Clone the repository (one-time setup)
git_clone("<repository-url>", "iris_analysis")
# 2. Change to project directory
setwd("iris_analysis")
# 3. Pull the latest code
git_pull()
# 4. Pull data and models
dvc_pull()
# 5. Reproduce the analysis
dvc_repro()
Note: For Docker setup, use the functions from the Docker integration:
# Start the Docker environment
setup_docker()
Next Steps
- Try adapting this workflow to your own data
- Explore more advanced features in
vignette("dvc-tracking")
- Learn about custom templates in
vignette("custom-templates")