Example ldmppr Workflow on Simulated Data

set.seed(90210)
library(ldmppr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Data

To illustrate the use of the ldmppr package, we will be using the included small_example_data dataset. This dataset consists of locations and sizes for points in a (25m x 25m) square domain.

data("small_example_data")
nrow(small_example_data)
#> [1] 121
head(small_example_data)
#>           x          y     size
#> 1 10.000000 14.0000000 900.7306
#> 2  5.636737 20.3068181 523.1122
#> 3 24.086291  6.2358462 519.3012
#> 4 21.191344 14.0034707 490.1228
#> 5 14.742128  0.3113149 400.2289
#> 6 20.160199  3.6875049 395.8505

We also include a collection of example raster files obtained from an ESS-DIVE repository (see the help file for small_example_data for details).

Introduction

The ldmppr package provides tools for:

  1. estimating a mechanistic spatio-temporal point process model for a reference marked point pattern (by mapping marks to event times),
  2. training a mark model that relates covariates (and optional competition indices) to marks,
  3. checking model fit using non-parametric summaries and global envelope tests, and
  4. simulating from the fitted model.

As of version 1.1.0, the main workflow functions return S3 objects with standard methods (print(), plot(), coef(), as.data.frame(), etc.). In most cases you should no longer need helper plotting functions like plot_mpp().


Example workflow

Step 1: Estimate the point process parameters

We start by estimating the parameters of a self-correcting spatio-temporal point process using the estimate_process_parameters function. This function makes use of user specified optimization algorithms in the nloptr function to perform estimation through log-likelihood optimization. The function can be run with three different optimization strategies ("local", "global_local", "multires_global_local") that allow the user to control the approach. The "local" strategy performs a local optimization, where the "global_local" first runs a global optimization and then uses the solution as a starting point for a local optimization. Finally, the "multires_global_local" strategy performs similarly to the "global_local", but iterates further at the local level by increasing the resolution of the spatio-temporal grid and repeating the local optimization with increasing degrees of polish. The function requires the observed locations and observed arrival times, which are generated from the observed sizes using a power law mapping function. These inputs are accepted as:

The power law mapping function depends on the hyperparameter delta, which controls the shape of the mapping relationship between sizes and arrival times. The function also requires the grid values for the optimization process, the initial parameter values, the upper bounds for the parameters, and the optimization algorithm (either global and local, or just local depending on strategy) to use. The function returns the optimal parameter estimates for the self-correcting point process.

Below we show the common case where you start from (x, y, size) with delta = 1.

# Use the (x, y, size) form and let estimate_process_parameters() construct time via delta
parameter_estimation_data <- small_example_data

# Upper bounds for (t, x, y)
ub <- c(1, 25, 25)

# Define the integration / grid schedule used by the likelihood approximation
grids <- ldmppr_grids(
  upper_bounds = ub,
  levels = list(
    c(20, 20, 20)
    )
)

# Define optimizer budgets/options for the global + local stages
budgets <- ldmppr_budgets(
  global_options = list(
    maxeval = 150
    ),
  local_budget_first_level = list(
    maxeval = 300,
    xtol_rel = 1e-5
    ),
  local_budget_refinement_levels = list(
    maxeval = 300,
    xtol_rel = 1e-5
    )
)

# Parameter initialization values: (alpha1, beta1, gamma1, alpha2, beta2, alpha3, beta3, gamma3)
parameter_inits <- c(1.5, 8.5, 0.015, 1.5, 3.2, 0.75, 3, 0.08)

fit_sc <- estimate_process_parameters(
  data = parameter_estimation_data,
  grids = grids,
  budgets = budgets,
  parameter_inits = parameter_inits,
  delta = 1,
  rescore_control = list(
    enabled = TRUE,
    top = 5L,
    objective_tol = 1e-6,
    param_tol = 0.10
  ),
  parallel = FALSE,
  strategy = "global_local",
  global_algorithm = "NLOPT_GN_CRS2_LM",
  local_algorithm = "NLOPT_LN_BOBYQA",
  verbose = TRUE
)
#> [ldmppr::estimate_process_parameters]
#> Estimating self-correcting process parameters
#>   Strategy: global_local
#>   Delta: 1
#>   Grids: 1 level(s)
#>   Local optimizer: NLOPT_LN_BOBYQA
#>   Global optimizer: NLOPT_GN_CRS2_LM
#>   Rescore control: enabled=TRUE, top=5, objective_tol=1e-06, param_tol=0.1
#>   Starts: global=1, local=1, jitter_sd=0.35, seed=1
#>   Parallel: off
#> Step 1/2: Preparing data and objective function...
#>   Prepared 121 points.
#>   Done in 0.0s.
#> Step 2/2: Optimizing parameters...
#> Single level (grid 20x20x20)
#>   Global search: 1 restart(s), then local refinement.
#>   Completed in 0.1s.
#>   Best objective: 164.71228
#> Finished. Total time: 0.2s.
# Print method for ldmppr_fit objects
fit_sc
#> ldmppr Fit
#>   process:         self_correcting
#>   engine:          nloptr
#>   strategy:        global_local
#>   global_alg:      NLOPT_GN_CRS2_LM
#>   local_alg:       NLOPT_LN_BOBYQA
#>   starts:          global=1, local=1, jitter_sd=0.35, seed=1
#>   n_obs:           121
#>   selected_delta:  1
#>   objective:       164.7123
#>   final_status:    5
#>   final_outcome:   NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval
#>                    (above) was reached.
#>   elapsed_sec:     0.151
#>   coefficients:
#> [1] -2.0628500 11.2504000  0.0251256  1.7797700  2.4252300  0.7583720  2.4725400
#> [8]  0.1083000

estimated_parameters <- coef(fit_sc)
estimated_parameters
#> [1] -2.06284913 11.25040097  0.02512557  1.77977004  2.42522901  0.75837162
#> [7]  2.47254405  0.10829965

Notes

Step 2: Train a mark model

Next, we use the train_mark_model function to train a suitably flexible mark model using location-specific covariates (rasters), (optionally) semi-distance dependent competition indices, and the generated arrival times to predict sizes. The function uses either:

and may be run in parallel if desired. The user has control over the choice of a Gradient Boosting Machine (GBM) or Random Forest (RF) model, the bounds of the spatial domain, the inclusion of competition indices, the competition radius, the correction method, the final model selection metric, the number of cross validation folds, and the size of the parameter tuning grid for the model.

# Load raster covariates shipped with the package
raster_paths <- list.files(system.file("extdata", package = "ldmppr"),
  pattern = "\\.tif$", full.names = TRUE
)
raster_paths <- raster_paths[!grepl("_med\\.tif$", raster_paths)]
rasters <- lapply(raster_paths, terra::rast)

# Scale rasters once up front
scaled_rasters <- scale_rasters(rasters)

# Train the mark model, passing the estimated self-correcting model fit from Step 1
mark_model <- train_mark_model(
  data = fit_sc,
  raster_list = scaled_rasters,
  scaled_rasters = TRUE,
  model_type = "xgboost",
  parallel = FALSE,
  include_comp_inds = FALSE,
  competition_radius = 10,
  edge_correction = "none",
  selection_metric = "rmse",
  cv_folds = 5,
  tuning_grid_size = 20,
  verbose = TRUE
)
#> [ldmppr::train_mark_model]
#> Training mark model
#>   Model type: xgboost
#>   Selection metric: rmse
#>   CV folds: 5
#>   Tuning grid size: 20
#>   Include competition indices: no
#>   Edge correction: none
#> Step 1/6: Preparing training data...
#>   Rows: 121
#>   Seed: 0
#>   Done in 0.0s.
#> Step 2/6: Configuring parallel backend...
#>   Parallel: off
#>   Model engine threads: 1
#>   Done in 0.0s.
#> Step 3/6: Extracting raster covariates...
#>   Using pre-scaled rasters (scaled_rasters = TRUE).
#>   Extracted 4 raster feature(s).
#>   Done in 0.0s.
#> Step 4/6: Building model matrix (and competition indices if requested)...
#>   Final training rows: 121
#>   Final feature columns (incl x,y,time): 7
#>   Done in 0.0s.
#> Step 5/6: Fitting model (with optional CV tuning)...
#>   foreach backend: doSEQ | workers=1
#>   Done in 9.6s.
#> Step 6/6: Finalizing output object...
#>   Residual bootstrap stored (source=oos, transform=sqrt, bins=6, min/bin=8).
#>   Done in 0.5s.
#> Training complete. Total time: 10.1s.

# Print method for ldmppr_mark_model objects
print(mark_model)
#> ldmppr Mark Model
#>   engine:           xgboost
#>   has_fit_engine:   TRUE
#>   has_xgb_raw:      FALSE
#>   n_features:       7
#>   n_rasters:        4
#>   raster_names:     Snodgrass_DEM_1m, Snodgrass_aspect_southness_1m,
#>                     Snodgrass_slope_1m, Snodgrass_wetness_index_1m
#>   scaled_rasters:   TRUE
#>   comp_indices:     FALSE

# Summary method for ldmppr_mark_model objects
summary(mark_model)
#> Summary: ldmppr Mark Model
#>   engine:           xgboost
#>   n_features:       7
#>   n_rasters:        4
#>   scaled_rasters:   TRUE
#>   comp_indices:     FALSE
#>   comp_radius:      10
#>   edge_correction:  none
#>   cv_folds:         5
#>   tuning_grid_size: 20
#>   selection_metric: rmse

predict_marks() is retained as a deprecated compatibility wrapper. For new workflows, use S3 prediction directly via predict(mark_model, ...).

If the user wants to save the trained mark model (and reload it at a later time), the package provides the save_mark_model and load_mark_model functions (which handle model-engine specifics appropriately):

save_path <- tempfile(fileext = ".rds")
save_mark_model(mark_model, save_path)

mark_model2 <- load_mark_model(save_path)
mark_model2

Step 3: Check model fit

Now that we have (i) an estimated process model and (ii) a trained mark model, we can check the model fit using the check_model_fit function. This function provides a variety of non-parametric summaries for point processes and global envelope tests to assess the goodness of fit of the model. The package provides individual envelope tests for the L, F, G, J, E, and V statistics, or a combined envelope test for all statistics simultaneously by making use of the functionality of the spatstat and GET packages. By plotting the combined envelope test, we can visually assess the goodness of fit of the model and obtain a \(p\)-value measuring how well the estimated model captures the relationships observed in the reference data. Typically a \(p\)-value less than 0.05 indicates a poor fit of the model to the data, and the authors of the GET package recommend a minimum of 2500 simulations to obtain a valid \(p\)-value at the .05 level.

check_model_fit() returns an S3 object with a plot() method (by default it plots the combined global envelope test).

# Check the model fit by passing the estimated process and trained mark models from Steps 1 and 2
model_check <- check_model_fit(
  t_min = 0,
  t_max = 1,
  process = "self_correcting",
  process_fit = fit_sc,
  mark_model = mark_model,
  include_comp_inds = FALSE,
  thinning = TRUE,
  edge_correction = "none",
  competition_radius = 10,
  n_sim = 199,
  mark_mode = "mark_model",
  fg_correction = "km",
  save_sims = FALSE,
  verbose = TRUE,
  seed = 90210
)
#> Registered S3 method overwritten by 'spatstat.geom':
#>   method       from     
#>   print.metric yardstick
#> [ldmppr::check_model_fit]
#> Checking model fit
#>   mark_mode=mark_model, n_sim=199, parallel=FALSE
#>   include_comp_inds=FALSE, edge_correction=none
#> Step 1/4: Preparing simulation setup
#> Using FGJ r-grid from reference: 1:198 (max r=2.405), correction=km
#>   Done in 0.1s.
#> Step 2/4: Generating accepted simulations
#>   Done in 3.9s.
#> Step 3/4: Computing envelope tests
#>   Done in 0.4s.
#> Step 4/4: Finalizing output object
#>   Done in 4.4s.
#> Model check complete.

# Plot method for ldmppr_model_check objects (defaults to combined global envelope test)
plot(model_check)

You can still access the underlying GET objects if you want (e.g., model_check$combined_env), but for most workflows plot(model_check) is sufficient.

Step 4: Simulate from the fitted model

Finally, we can simulate a new marked point process realization from the fitted model.

simulate_mpp() returns an object of class ldmppr_sim with plot() and as.data.frame() methods.

# Simulate a new marked point pattern by passing the estimated process and trained mark models from Steps 1 and 2
simulated <- simulate_mpp(
  process = "self_correcting",
  process_fit = fit_sc,
  t_min = 0,
  t_max = 1,
  mark_model = mark_model,
  mark_mode = "mark_model",
  include_comp_inds = TRUE,
  competition_radius = 10,
  edge_correction = "none",
  thinning = TRUE,
  seed = 90210
)

# Plot method for ldmppr_sim
plot(simulated)


# Data-frame view of the simulated realization
head(as.data.frame(simulated))
#>        time         x         y    marks
#> 1 0.0000000 10.000000 14.000000 900.3945
#> 2 0.3418117 13.990384 22.844918 656.8889
#> 3 0.4720850  8.711604 16.324267 544.4919
#> 4 0.4880539 20.399572  4.442644 446.0978
#> 5 0.4892544  7.679190 24.608903 586.1369
#> 6 0.5038109 14.727204  8.969249 668.0840

Summary

This vignette demonstrates the core ldmppr workflow using the updated S3-based interfaces:

These objects are designed to be composable and to behave like standard R model objects through S3 methods.