Step-by-step guide to the MultiBlock Coxmos pipeline

Pedro Salguero García

Universitat Politècnica de València and Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
pedrosalguerog[at]gmail.com

Last updated: 02 junio, 2025

Introduction

The Coxmos R package includes the following functionalities for Multi-Block datasets:

  1. Model fitting and optimization for Multi-Block Data: The user can either select the parameters (such as the penalty value, number of components in the latent space…) for survival models in multi-block datasets or use the package tools for estimating the best values for these parameters.

  2. Comparing Survival Models: After obtaining multiple survival models, the user can compare them to determine which one gives the best results. The package includes several functions for comparing the models.

  3. Interpreting Results: After selecting the best model or models, the user can interpret the results. The package includes several functions for understanding the impact of the original variables on survival prediction.

  4. Predicting Survival for New Patients: Finally, if a new dataset of patients is available, the user can use the estimated model to make predictions for the new patients. In addition, Coxmos allows for the comparison of variables to identify which ones contribute most to the prediction.

Installation

Coxmos stable version can be installed from CRAN repository:

install.packages("Coxmos")

C:1BfP43083d9d5ab9-MO-pipeline.R

The development version can be installed from GitHub:

install.packages("devtools")
devtools::install_github("BiostatOmics/Coxmos", build_vignettes = TRUE)

C:1BfP43083d9d5ab9-MO-pipeline.R

Getting ready

To run the analyses in this vignette, first load Coxmos:

library(Coxmos)

C:1BfP43083d9d5ab9-MO-pipeline.R

To generate plots, we will make use of the RColorConesa R package. After installing it:

# install.packages("RColorConesa")
library(RColorConesa)

C:1BfP43083d9d5ab9-MO-pipeline.R

Input data

The Coxmos pipeline requires two objects as input. One should be a list with as many elements as block of predictors X, and the other a matrix with two columns called time and event for survival information Y.

The data example contains miRNA expression data and protein levels as the two block of predictors. The data can be loaded as follows:

# load dataset
data("X_multiomic", package = "Coxmos")
data("Y_multiomic", package = "Coxmos")

X <- X_multiomic
Y <- Y_multiomic

rm(X_multiomic, Y_multiomic)

C:1BfP43083d9d5ab9-MO-pipeline.R

These files contain a list of two blocks and a data.frame object. Our toy example has a total of 150 observations, 642 miRNAs and 369 proteins:

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
hsa-let-7a-2-3p hsa-let-7a-3p hsa-let-7a-5p hsa-let-7b-3p hsa-let-7b-5p
TCGA-A2-A0SV-01A 9.344533 228.00660 56661.51 155.1192 57564.19
TCGA-A2-A0YT-01A 25.031847 134.54618 56991.26 103.2564 64478.91
TCGA-BH-A1F0-01A 34.725664 122.98673 68110.05 195.3319 37871.23
TCGA-B6-A0IK-01A 44.962897 163.66495 89585.87 183.4486 121455.58
TCGA-E2-A1LE-01A 15.392269 87.54353 188649.57 109.6699 54892.68
#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
7529 7531 7534 1978 7158
TCGA-A2-A0SV-01A 0.025273 -0.030661 0.458350 1.369000 -0.40781
TCGA-A2-A0YT-01A 0.348210 0.347870 0.891570 -0.188540 -0.13896
TCGA-BH-A1F0-01A 0.132340 -0.348210 0.078923 0.227890 -0.49063
TCGA-B6-A0IK-01A 0.352670 0.376670 -0.285750 -0.380980 -1.43080
TCGA-E2-A1LE-01A 0.110140 0.094990 -0.159960 -0.098148 0.40160
#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
time event
TCGA-A2-A0SV-01A 825 TRUE
TCGA-A2-A0YT-01A 723 TRUE
TCGA-BH-A1F0-01A 785 TRUE
TCGA-B6-A0IK-01A 571 TRUE
TCGA-E2-A1LE-01A 879 TRUE

C:1BfP43083d9d5ab9-MO-pipeline.R

Exploratory Analysis and Data Processing

Event Distribution

Before performing any survival analysis, it is relevant to understand the structure of the time-to-event data. A key preliminary step is examining how events (e.g., deaths, failures) and censored observations (e.g., lost to follow-up, study termination) are distributed across the dataset, as this exploration helps identify potential biases, assess data quality, and guide appropriate modeling choices.

The plot_events() function in this package provides a straightforward way to visualize this distribution. It displays event occurrences and censoring times, allowing users to:

  • Detect imbalances between events and censored cases that may affect statistical power.
  • Identify unusual patterns (e.g., clustered censoring) that could indicate informative dropout.
  • Compare distributions across groups to reveal differential follow-up rates.
  • Validate whether the observed event timing aligns with expected biological or clinical processes.

This visualization serves as an exploratory tool, ensuring that subsequent survival analysis rests on reliable data patterns. The function’s output can highlight the need for extended follow-up or careful interpretation of results in contexts with heavy censoring.

The plot_events() function requires the Y matrix and optionally accepts the following arguments among others:

  • categories: the names of each category (a character vector of length two).
  • y.text: the label for the y-axis.
  • roundTo: the smallest numeric unit to which break times will be rounded. For example, if set to 0.25, a value of 1.32 will be rounded to 1.25.
  • max.breaks: the number of breaks for the histogram.
ggp_density.event <- plot_events(Y = Y, 
                                 categories = c("Censored","Event"),
                                 y.text = "Number of observations", 
                                 roundTo = 0.5, 
                                 max.breaks = 15, 
                                 txt.x.angle = 90)

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_density.event$plot

C:1BfP43083d9d5ab9-MO-pipeline.R

Event Per Variable (EVP)

Before fitting survival models, it is important to ensure that there is sufficient statistical power to avoid overfitting. The Events Per Variable (EPV) metric supports this evaluation by indicating how many outcome events are available per predictor in the model.

According to literature, the higher values the better. As a recommendation, values greater than 10 should be used to obtain models with good predictions, but this value should not be lower than 5.

A low EPV value (typically <10) increases the risk of:

  • Overfitting: the model may capture noise rather than true associations
  • Unstable estimates: the variability of coefficient estimates increases, reducing reproducibility
  • Optimistic performance metrics: apparent model accuracy may not generalize to new data

The getEPV.mb() function calculates and reports the EPV for a given dataset. This helps to:

  • Assess whether the sample size is adequate for multivariable modeling
  • Guide variable selection to maintain an EPV ≥10 or support justification for deviations
  • Compare EPV values across subgroups when conducting stratified analyses

Additionally, working with low EPV values often indicates the presence of high-dimensional datasets, where the number of predictors is large relative to the number of events. In such cases, it is advisable to use survival models specifically designed to handle high-dimensional settings, such as Coxmos methods.

EPV <- getEPV.mb(X, Y)

C:1BfP43083d9d5ab9-MO-pipeline.R

EPV
#> $mirna
#> [1] 0.1121495
#> 
#> $proteomic
#> [1] 0.195122

C:1BfP43083d9d5ab9-MO-pipeline.R

In this case, the EPV has a value of 0.1121495 and 0.195122 for each block, which indicates that the number of predictors is high relative to the number of observed events. This makes it unreliable to apply classical survival algorithms, as they are prone to overfitting under these conditions.

Preprocessing Categorical Variables for Dimensionality Reduction

When working with datasets (i.e., X matrices) that include qualitative or categorical variables, proper encoding is essential prior to applying dimensionality reduction techniques such as Principal Component Analysis (PCA) or Partial Least Squares (PLS). Most of these algorithms require purely numerical input, so categorical variables (e.g., sex = {Male, Female}, tumor_stage = {I, II, III}) must be transformed into a suitable format that retains their informational content without introducing artificial ordinality.

The factorToBinary() function automates the transformation of categorical variables into dummy variables using one-hot encoding. This process addresses several important issues:

  • Avoiding misinterpretation: Prevents algorithms from mistakenly interpreting categorical values as ordinal or numeric (e.g., assigning sex Male = 1, Female = 2, etc.).

  • Compatibility: Ensures that categorical features are represented numerically, which is required by linear algebra-based techniques such as PCA or PLS.

  • Dimensionality control: Allows flexibility in the encoding format:

    • K dummy variables (one for each category) for full representation.
    • K - 1 dummy variables to avoid perfect multicollinearity in regression-based models.
  • Easy integration: The resulting matrix preserves original column names with appended suffixes (e.g., sex_Male, tumor_stage_II), enabling straightforward traceability in downstream processes.

It accepts the following arguments:

  • X: numeric matrix or data frame. Only qualitative variables (class factor) will be transformed into binary variables.

  • all: logical. If TRUE, one binary variable will be generated for each level of the factor (k dummies). If FALSE, only k - 1 variables will be returned, using the first level as the reference category (default: TRUE).

  • sep: character. Symbol used to concatenate the original variable name and the level name in the new column names. For example, if the variable is "sex" and sep = "_", the resulting dummy variables will be named "sex_male", "sex_female", etc.

This function ensures compatibility with numerical algorithms while keeping variable names interpretable and traceable.

X$mirna <- factorToBinary(X = X$mirna, all = TRUE, sep = "_")

C:1BfP43083d9d5ab9-MO-pipeline.R

Training/Test Split for Survival Model Validation

Proper validation of survival models requires thoughtful partitioning of the dataset into training and test subsets. The getTrainTest() function provides a consistent and event-aware approach to splitting survival data, ensuring that the characteristics of time-to-event outcomes are preserved during validation.

This function performs event-balanced splits by default, maintaining a similar proportion of events and censored cases across both training and test sets. This is essential to avoid biased performance estimates and to ensure that the test set contains a sufficient number of events for evaluation.

In datasets composed of multiple data blocks (e.g., omics and clinical data), getTrainTest() synchronizes the partitioning across all blocks, preserving sample alignment and ensuring integrity during downstream analysis. Whether working with simple matrices or multiblock structures, the function guarantees that each block is split consistently.

It also supports flexible validation schemes. By setting the times argument, users can perform a single data split (times = 1) or repeated random splits for cross-validation (times > 1), allowing for evaluation of model stability and robustness across different partitions. A fixed random seed ensures reproducibility of the results.

split_data <- getTrainTest(X, Y, p = 0.7)

X_train <- split_data$X_train #106x642 and 106x369
Y_train <- split_data$Y_train

X_test <- split_data$X_test #44x642 and 44x369
Y_test <- split_data$Y_test

C:1BfP43083d9d5ab9-MO-pipeline.R

This splitting strategy is especially relevant in survival analysis, where event sparsity and censoring can lead to unreliable validation if not properly addressed. By incorporating stratified sampling and support for complex data structures, getTrainTest() serves as a reliable bridge between preprocessing and model development. It enables performance assessment that reflects real-world behavior while preserving the temporal dynamics of time-to-event data.

Survival Models for Multi-Block datasets

After loading the data, it may be of interest for the user to perform a survival analysis in order to examine the relationship between explanatory variables and the outcome. However, traditional methods are only applicable for low-dimensional datasets. To address this issue, we have developed a set of functions that make use of (s)PLS techniques in combination with Cox analysis for the analysis of high-dimensional datasets. The methods are divided into two categories:

  1. Single-Block approach: Each omic dataset is analyzed individually to reduce its dimensionality for explaining survival. Subsequently, all omics are integrated to construct the final model.

    1.1. Single-Block: In this variant, all omic blocks are processed using the same number of components and a common penalty value. The cross-validation (CV) is performed using the integrated dataset.

    1.2. Iterative Single-Block: Unlike the standard version, each omic is allowed to have its own optimal number of components and penalty value, tailored to its specific characteristics. The CV is conducted individually per omic and could differ from the standard version.

  2. Multi-Omic-Block approach: All omic datasets are jointly analyzed using multi-block sparse Partial Least Squares (sPLS) functions. After dimensionality reduction, the resulting latent variables are used to construct the final model.

Coxmos provides the following methodologies for multi-block approaches:

More information for each approach could be found in the help section for each function. The function name for each methodology are:

However, an easier and unified way to run all models is provided via the main function mb.coxmos(), which allows execution of any method by specifying the desired algorithm through the method parameter.

To perform a survival analysis with our example, we will use methodologies that can work with high-dimensional data. These are the set of methodologies that use sPLS techniques.

MB methods for Survival Analyses

In order to perform survival analysis with our multi-block high-dimensional data, we have implemented a series of methods that utilize techniques such as partial least squares (PLS) methodology in order to reduce the dimensionality of the input data.

To evaluate the performance of these methods, we have implemented cross-validation, which allows us to estimate the optimal parameters for future predictions based on prediction metrics such as: AIC, C-INDEX, I.BRIER and Area Under the Curve (AUC). By default AUC metric is used with the "cenROC" evaluator as it has provided the best results in our tests. However, multiple AUC evaluators could be used: "risksetROC", "survivalROC", "cenROC", "nsROC", "smoothROCtime_C" and "smoothROCtime_I".

Furthermore, a mix of multiple metrics could be used to obtain the optimal model. The user has to establish different weights for each metric and all of them will be considered to compute the optimal model (the total weight must sum 1).

In addition, we have included options for normalizing data, filtering variables, and setting the minimum EPV, as well as specific parameters for each method, such as the alpha value for Cox Elastic Net and the number of components for sPLS models. Overall, our cross-validation methodology allows us to effectively analyze high-dimensional survival data and optimize our model parameters.

SB.sPLS-ICOX

The first method is SB.sPLS-ICOX, the single-block version of the sPLS-ICOX method. SB.sPLS-ICOX applies the sPLS-ICOX procedure individually to each omic dataset, resulting in dimensionality reduction within each individual dataset. By constructing weights based on univariate Cox models, survival-related information is captured during the reduction process. The reduced omic datasets are then integrated to create a comprehensive survival model based on their respective PLS components. Cross-validation is utilized to determine the optimal number of components and the penalty for variable selection, to ensure robust model optimization.

cv.sb.splsicox_res <- cv.mb.coxmos(method = "sb.splsicox",
                                   X = X_train, Y = Y_train,
                                   max.ncomp = 2, penalty.list = c(0.5,0.9),
                                   n_run = 1, k_folds = 5)

C:1BfP43083d9d5ab9-MO-pipeline.R

cv.sb.splsicox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

cv.sb.splsicox_res$plot_AUC

C:1BfP43083d9d5ab9-MO-pipeline.R

We will generate a SB.sPLS-ICOX model with optimal number of principal components and its penalty based on the results obtained from the cross validation.

sb.splsicox_model <- mb.coxmos(method = "sb.splsicox",
                               X = X_train, Y = Y_train,
                               n.comp = 2, #cv.sb.splsicox_res$opt.comp
                               penalty = 0.9) #cv.sb.splsicox_res$opt.penalty
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.

sb.splsicox_model
#> The method used is SB.sPLS-ICOX.
#> Survival model:
#>                        coef exp(coef)    se(coef)  robust se         z
#> comp_1_mirna     0.08136291  1.084765 0.023122035 0.02795461 2.9105362
#> comp_2_mirna     0.02298943  1.023256 0.008282689 0.01137970 2.0202130
#> comp_1_proteomic 0.04192283  1.042814 0.282275072 0.33317483 0.1258283
#> comp_2_proteomic 0.72828143  2.071517 0.381966688 0.37589925 1.9374378
#>                     Pr(>|z|)
#> comp_1_mirna     0.003608091
#> comp_2_mirna     0.043361296
#> comp_1_proteomic 0.899867825
#> comp_2_proteomic 0.052691844

C:1BfP43083d9d5ab9-MO-pipeline.R

After obtaining the model, if we print it we could see the cox model with all the components that have been selected. However, some of the selected components have not been significant. If we wish to calculate a model where all selected variables are significant, we should update the remove_non_significant parameter to TRUE.

sb.splsicox_model <- mb.coxmos(method = "sb.splsicox",
                               X = X_train, Y = Y_train,
                               n.comp = 2, #cv.sb.splsicox_res$opt.comp
                               penalty = 0.9, #cv.sb.splsicox_res$opt.penalty
                               remove_non_significant = TRUE)
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.

sb.splsicox_model
#> The method used is SB.sPLS-ICOX.
#> A total of 1 variables have been removed due to non-significance filter inside cox model.
#> Survival model:
#>                        coef exp(coef)    se(coef)  robust se        z
#> comp_1_mirna     0.08401369  1.087644 0.014683553 0.01827697 4.596696
#> comp_2_mirna     0.02253909  1.022795 0.007657962 0.01017298 2.215583
#> comp_2_proteomic 0.72600739  2.066812 0.381060467 0.36841663 1.970615
#>                      Pr(>|z|)
#> comp_1_mirna     4.292437e-06
#> comp_2_mirna     2.672004e-02
#> comp_2_proteomic 4.876791e-02

C:1BfP43083d9d5ab9-MO-pipeline.R

In this case, each omic/block was optimized to use the same number of components. However, the Iterative Single-block approach (iSB) allows the selection of a different number of components and penalty values for each block.

cv.isb.splsicox_res <- cv.mb.coxmos(method = "isb.splsicox", 
                                    X = X_train, Y = Y_train,
                                    max.ncomp = 2, penalty.list = c(0.5,0.9),
                                    n_run = 1, k_folds = 5,
                                    remove_non_significant = TRUE)

cv.isb.splsicox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

cv.isb.splsicox_res$list_cv_spls_models$mirna$plot_AUC
cv.isb.splsicox_res$list_cv_spls_models$proteomic$plot_AUC

C:1BfP43083d9d5ab9-MO-pipeline.R When running an iSB model, instead of specifying n.comp and variable selection parameters, the cross-validated model must be provided.

isb.splsicox_model <- mb.coxmos(method = "isb.splsicox",
                                X = X_train, Y = Y_train, cv.isb = cv.isb.splsicox_res,
                                remove_non_significant = FALSE)

isb.splsicox_model

C:1BfP43083d9d5ab9-MO-pipeline.R

Since each block is cross-validated independently, without considering the other blocks, the results differ from those of the standard SB approach.

SB.sPLS-DRCOX

SB.sPLS-DRCOX, derived from its high-dimensional (HD) counterpart, has also been created based on the same principle. It analyzes each omic independently by studying the deviance residuals (DR), integrates the resulting components, and constructs a unified survival model. As with SB.sPLS-ICOX, cross-validation is employed to identify the optimal number of components and the number of variables, ensuring robust model optimization.

cv.sb.splsdrcox_res <- cv.mb.coxmos(method = "sb.splsdrcox", 
                                    X = X_train, Y = Y_train,
                                    max.ncomp = 2, vector = NULL,
                                    n_run = 1, k_folds = 5)

cv.sb.splsdrcox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

An SB.sPLS-DRCOX model will be generated using the optimal number of principal components and corresponding penalty values, based on the results obtained from cross-validation.

sb.splsdrcox_model <- mb.coxmos(method = "sb.splsdrcox", 
                                X = X_train, Y = Y_train, 
                                n.comp = 2, #cv.sb.splsdrcox_res$opt.comp, 
                                vector = list("mirna" = 161, "proteomic" = 185), #cv.sb.splsdrcox_res$opt.nvar,
                                remove_non_significant = T)
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.

sb.splsdrcox_model
#> The method used is SB.sPLS-DRCOX-Dynamic.
#> Survival model:
#>                          coef exp(coef)     se(coef)    robust se        z
#> comp_2_mirna     1.026886e-06  1.000001 5.316507e-07 4.398692e-07 2.334525
#> comp_1_proteomic 7.072923e-01  2.028491 1.031619e-01 9.865372e-02 7.169444
#> comp_2_proteomic 2.752298e-01  1.316833 6.880221e-02 6.227886e-02 4.419313
#>                      Pr(>|z|)
#> comp_2_mirna     1.956824e-02
#> comp_1_proteomic 7.530320e-13
#> comp_2_proteomic 9.901509e-06

C:1BfP43083d9d5ab9-MO-pipeline.R

As in previous approaches, each omic/block was optimized to use the same number of components. However, another methodology, cv.isb.splsdrcox, allows the selection of a different number of components and/or variables for each block.

cv.isb.splsdrcox_res <- cv.mb.coxmos(method = "isb.splsdrcox", 
                                     X = X_train, Y = Y_train,
                                     max.ncomp = 2, vector = NULL,
                                     n_run = 1, k_folds = 5)

cv.isb.splsdrcox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

cv.isb.splsdrcox_res$list_cv_spls_models$mirna$plot_AUC
cv.isb.splsdrcox_res$list_cv_spls_models$proteomic$plot_AUC

C:1BfP43083d9d5ab9-MO-pipeline.R

isb.splsdrcox_model <- mb.coxmos(method = "isb.splsdrcox", 
                                 X = X_train, Y = Y_train, cv.isb = cv.isb.splsdrcox_res,
                                 remove_non_significant = TRUE)

isb.splsdrcox_model

C:1BfP43083d9d5ab9-MO-pipeline.R

SB.sPLS-DACOX

SB.sPLS-DACOX has also been created based on the same principle of sPLS-DACOX. The cross-validation is employed to identify the optimal number of components and the number of variables, ensuring robust model optimization.

cv.sb.splsdacox_res <- cv.mb.coxmos(method = "sb.splsdacox", 
                                    X = X_train, Y = Y_train,
                                    max.ncomp = 2, vector = NULL,
                                    n_run = 1, k_folds = 5)

cv.sb.splsdacox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

An SB.sPLS-DACOX model will be generated using the optimal number of principal components and corresponding penalty values, based on the results obtained from cross-validation.

sb.splsdacox_model <- mb.coxmos(method = "sb.splsdacox", 
                                X = X_train, Y = Y_train, 
                                n.comp = 1, #cv.sb.splsdacox_res$opt.comp, 
                                vector = list("mirna" = 321, "proteomic" = 369), #cv.sb.splsdacox_res$opt.nvar,
                                remove_non_significant = T)
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.

sb.splsdacox_model
#> The method used is SB.sPLS-DACOX-Dynamic.
#> A total of 1 variables have been removed due to non-significance filter inside cox model.
#> Survival model:
#>                       coef exp(coef)  se(coef)  robust se        z     Pr(>|z|)
#> comp_1_proteomic 0.4428919  1.557204 0.1065671 0.09343293 4.740212 2.134948e-06

C:1BfP43083d9d5ab9-MO-pipeline.R

As in previous approaches, each omic/block was optimized to use the same number of components. However, another methodology, cv.isb.splsdacox, allows the selection of a different number of components and/or variables for each block.

cv.isb.splsdacox_res <- cv.mb.coxmos(method = "isb.splsdacox", 
                                     X = X_train, Y = Y_train,
                                     max.ncomp = 2, vector = NULL,
                                     n_run = 1, k_folds = 5)

cv.isb.splsdacox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

cv.isb.splsdacox_res$list_cv_spls_models$mirna$plot_AUC
cv.isb.splsdacox_res$list_cv_spls_models$proteomic$plot_AUC

C:1BfP43083d9d5ab9-MO-pipeline.R

isb.splsdacox_model <- mb.coxmos(method = "isb.splsdacox", 
                                 X = X_train, Y = Y_train, cv.isb = cv.isb.splsdacox_res,
                                 remove_non_significant = TRUE)

isb.splsdacox_model

C:1BfP43083d9d5ab9-MO-pipeline.R

MB.sPLS-DRCOX

To enhance the versatility and performance of our survival analysis methods, we have developed a full multi-block survival model by combining the sPLS-DRCOX methodology with the multi-block sPLS functions from the mixOmics R package.

In the creation of these methods, we utilized an heuristic variable selection approach along with the mixOmics algorithms for hyperparameter selection. The penalty is determined based on a vector of the number of variables to test. Users have the flexibility to specify a specific value for selecting a fixed number of variables. Alternatively, by setting the value to NULL, the heuristic process automatically selects the optimal number of variables.

Our method empowers users to define the minimum and maximum number of variables to consider, as well as the number of cutpoints to test between these limits. Through an iterative process, the algorithm identifies the optimal number of variables and further explores the performance of existing cutpoints compared to the selected optimal value.

This integration of sPLS-DRCOX and multi-block sPLS provides researchers with a powerful tool for conducting comprehensive multivariate survival analysis.

cv.mb.splsdrcox_res <- cv.mb.coxmos(method = "mb.splsdrcox", 
                                    X = X_train, Y = Y_train, 
                                    max.ncomp = 2, vector = NULL,
                                    MIN_NVAR = 10, MAX_NVAR = NULL, n.cut_points = 10, EVAL_METHOD = "AUC",
                                    n_run = 1, k_folds = 5)

cv.mb.splsdrcox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

After getting the cross-validation, the full model could be obtained by providing the optimized number of components and the list of the number of variables per omic.

mb.splsdrcox_model <- mb.coxmos(method = "mb.splsdrcox", 
                                X = X_train, Y = Y_train, 
                                n.comp = 1, #cv.mb.splsdrcox_res$opt.comp,
                                vector = list("mirna" = 10, "proteomic" = 369)) #cv.mb.splsdrcox_res$opt.nvar
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.
mb.splsdrcox_model
#> The method used is MB.sPLS-DRCOX-Dynamic.
#> Survival model:
#>                           coef exp(coef)     se(coef)    robust se          z
#> comp_1_mirna      2.336609e-07 1.0000002 3.622335e-07 3.462225e-07  0.6748866
#> comp_1_proteomic -5.074908e-02 0.9505171 5.972778e-02 6.766230e-02 -0.7500348
#>                   Pr(>|z|)
#> comp_1_mirna     0.4997478
#> comp_1_proteomic 0.4532338

C:1BfP43083d9d5ab9-MO-pipeline.R

In some cases, if any component is significant, the model will keep at least the one with the lower P-Value.

MB.sPLS-DACOX

We also have extended our methodological repertoire with the development of MB.sPLS-DACOX. Building upon the foundation of sPLS-DACOX, this approach incorporates the multi-block functions from the mixOmics R package, further enhancing its capabilities and performance.

cv.mb.splsdacox_res <- cv.mb.coxmos(method = "mb.splsdacox", 
                                    X = X_train, Y = Y_train, 
                                    max.ncomp = 2, vector = NULL,
                                    n_run = 1, k_folds = 5)

cv.mb.splsdacox_res

C:1BfP43083d9d5ab9-MO-pipeline.R

After getting the cross-validation, the full model could be obtain by passing the optimized number of components and the list of the number of variables per omic.

mb.splsdacox_model <- mb.coxmos(method = "mb.splsdacox", 
                                X = X_train, Y = Y_train, 
                                n.comp = 2, #cv.mb.splsdacox_res$opt.comp,
                                vector = list("mirna" = 1, "proteomic" = 93)) #cv.mb.splsdacox_res$opt.nvar
#> As we are working with a multiblock approach with 2 blocks, a maximum of 5 components can be used.
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.

mb.splsdacox_model
#> The method used is MB.sPLS-DACOX-Dynamic.
#> Survival model:
#>                           coef exp(coef)     se(coef)    robust se          z
#> comp_1_mirna      2.219494e-07 1.0000002 3.652928e-07 3.802056e-07  0.5837614
#> comp_2_mirna      3.117680e-07 1.0000003 3.664796e-07 3.303379e-07  0.9437851
#> comp_1_proteomic -5.026128e-02 0.9509809 6.575603e-02 7.353506e-02 -0.6835009
#> comp_2_proteomic  2.382137e-02 1.0241074 5.125087e-02 4.843720e-02  0.4917991
#>                   Pr(>|z|)
#> comp_1_mirna     0.5593808
#> comp_2_mirna     0.3452795
#> comp_1_proteomic 0.4942904
#> comp_2_proteomic 0.6228614

C:1BfP43083d9d5ab9-MO-pipeline.R

Model Validation

Survival Models’ Comparison

Next, we will analyze the results obtained from the multiple models to see which one provides the best predictions based on our data. To do this, we will use the test set that has not been used to train any model. However, in case that test data are not available, the train data can also be used.

Initially, we will compare the results for each of the methods according to the evaluator we desire. Additionaly, we must provide a list of the different models as well as the X and Y evaluation set.

When evaluating survival model results, we must indicate at which temporal points we want to perform the evaluation. As we already specified a NULL value for the times parameter in the cross-validation, we are going to let the algorithm compute them again.

lst_models <- list("SB.sPLS-ICOX" = sb.splsicox_model,
                   #"iSB.sPLS-ICOX" = isb.splsicox_model,
                   "SB.sPLS-DRCOX-Dynamic" = sb.splsdrcox_model,
                   #"iSB.sPLS-DRCOX-Dynamic" = isb.splsdrcox_model,
                   "SB.sPLS-DACOX-Dynamic" = sb.splsdacox_model,
                   #"iSB.sPLS-DACOX-Dynamic" = isb.splsdacox_model,
                   "MB.sPLS-DRCOX" = mb.splsdrcox_model,
                   "MB.sPLS-DACOX" = mb.splsdacox_model)

eval_results <- eval_Coxmos_models(lst_models = lst_models,
                                  X_test = X_test, Y_test = Y_test, 
                                  times = NULL)

C:1BfP43083d9d5ab9-MO-pipeline.R

We can print the results obtained in the console where we can see, for each of the selected methods, the training time and the time it took to be evaluated, as well as their AIC, C-Index and AUC metrics and at which time points it was evaluated.

eval_results
#> Evaluation performed for methods: SB.sPLS-ICOX, SB.sPLS-DRCOX-Dynamic, SB.sPLS-DACOX-Dynamic, MB.sPLS-DRCOX, MB.sPLS-DACOX.
#> SB.sPLS-ICOX:
#>  training.time: 0.1713
#>  evaluating.time: 0.0088
#>  AIC: 301.8495
#>  C.Index: 0.8538
#>  time: 377, 554.786, 732.572, 910.358, 1088.143, 1265.929, 1443.715, 1621.5, 1799.286, 1977.072, 2154.858, 2332.643, 2510.429, 2688.215, 2866
#>  AUC: 0.62239
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I. Brier: 0.2003
#> 
#> SB.sPLS-DRCOX-Dynamic:
#>  training.time: 0.0361
#>  evaluating.time: 0.0086
#>  AIC: 318.2304
#>  C.Index: 0.817
#>  time: 377, 554.786, 732.572, 910.358, 1088.143, 1265.929, 1443.715, 1621.5, 1799.286, 1977.072, 2154.858, 2332.643, 2510.429, 2688.215, 2866
#>  AUC: 0.57529
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I. Brier: 0.19103
#> 
#> SB.sPLS-DACOX-Dynamic:
#>  training.time: 0.0368
#>  evaluating.time: 0.0087
#>  AIC: 351.6075
#>  C.Index: 0.6749
#>  time: 377, 554.786, 732.572, 910.358, 1088.143, 1265.929, 1443.715, 1621.5, 1799.286, 1977.072, 2154.858, 2332.643, 2510.429, 2688.215, 2866
#>  AUC: 0.5214
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I. Brier: 0.20821
#> 
#> MB.sPLS-DRCOX:
#>  training.time: 0.0262
#>  evaluating.time: 0.0078
#>  AIC: 371.5634
#>  C.Index: 0.5335
#>  time: 377, 554.786, 732.572, 910.358, 1088.143, 1265.929, 1443.715, 1621.5, 1799.286, 1977.072, 2154.858, 2332.643, 2510.429, 2688.215, 2866
#>  AUC: 0.63332
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I. Brier: 0.17193
#> 
#> MB.sPLS-DACOX:
#>  training.time: 0.0337
#>  evaluating.time: 0.0138
#>  AIC: 373.6803
#>  C.Index: 0.5287
#>  time: 377, 554.786, 732.572, 910.358, 1088.143, 1265.929, 1443.715, 1621.5, 1799.286, 1977.072, 2154.858, 2332.643, 2510.429, 2688.215, 2866
#>  AUC: 0.61615
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I. Brier: 0.17102
#> 

C:1BfP43083d9d5ab9-MO-pipeline.R

However, we can also obtain graphical results where we can compare each method over time, as well as their average scores using the function plot_evaluation() or plot_evaluation.list() if multiple evaluators have been tested. The user can choose to plot the AUC for time prediction points or the Brier Score.

lst_eval_results_auc <- plot_evaluation(eval_results, evaluation = "AUC", pred.attr = "mean")
lst_eval_results_brier <- plot_evaluation(eval_results, evaluation = "IBS", pred.attr = "mean")

C:1BfP43083d9d5ab9-MO-pipeline.R

After generating the evaluation plot with Coxmos, an R object is returned containing two sublists. The first sublist contains time-dependent evaluation results for each method, along with a variant that includes the average performance values. The second sublist enables comparison of the mean performance across methods using statistical tests such as the T-test, Wilcoxon test, ANOVA, or Kruskal–Wallis test.

lst_eval_results_auc$lst_plots$lineplot.mean

lst_eval_results_auc$lst_plot_comparisons$anova

C:1BfP43083d9d5ab9-MO-pipeline.R

Computing Time Comparison (Cross-Validation Model’s time)

Another possible comparison is related to the computation times for cross-validation, model creation, and evaluation. In this case, cross-validations and methods are loaded.

lst_models_time <- list(#cv.sb.splsicox_res,
                        sb.splsicox_model,
                        #isb.splsicox_model,
                        #cv.sb.splsdrcox_res,
                        sb.splsdrcox_model,
                        #isb.splsdrcox_model,
                        #cv.sb.splsdacox_res,
                        sb.splsdacox_model,
                        #isb.splsdacox_model,
                        #cv.mb.splsdrcox_res,
                        mb.splsdrcox_model,
                        #cv.mb.splsdrcox_res,
                        mb.splsdacox_model,
                        eval_results)

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_time <- plot_time.list(lst_models_time, txt.x.angle = 90)

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_time

C:1BfP43083d9d5ab9-MO-pipeline.R

Evaluating the Proportional Hazards Assumption

A critical diagnostic step when using Cox models is evaluating whether the Proportional Hazards (PH) assumption holds for each predictor or component. This assumption underpins the validity of Cox regression results, and violations can lead to misleading inferences.

So the function plot_proportionalHazard() tries to answer the question: “Does my Cox model violate the fundamental proportional hazards assumption, and if so, which specific variables or components show time-dependent effects that could invalidate my results?”

The plot_proportionalHazard() function provides a visual and statistical assessment of this assumption. It leverages Schoenfeld residuals to test whether the hazard ratio associated with each variable remains constant over time. If patterns in the Schoenfeld residuals are systematic or show statistically significant deviations, it indicates that the effect of a variable changes over time during the follow-up period.

This is particularly important when working with dimension-reduced components, such as those generated from sPLS models. Because these components aggregate the effects of multiple variables, subtle time-dependent behavior from individual features may be masked, and only detectable through residual diagnostics.

The function can be applied to a single model using plot_proportionalHazard(), or to a list of models with plot_proportionalHazard.list(). The output plot shows, for each component or variable, the residual pattern and associated p-value from the proportional hazards test.

lst_ph_ggplot <- plot_proportionalHazard(lst_models$`SB.sPLS-ICOX`)

C:1BfP43083d9d5ab9-MO-pipeline.R

Variables or components with a significant p-value (typically p < 0.05) indicate a violation of the PH assumption.

lst_ph_ggplot

C:1BfP43083d9d5ab9-MO-pipeline.R

This visualization helps determine whether model refinement is needed. For instance, if the assumption is violated, one might:

  • Introduce time-dependent covariates.
  • Apply stratified models.
  • Use alternative modeling strategies better suited to non-proportional effects.

Model Interpretation

Based on the cross-validation results, we selected the sPLS-ICOX methodology as the most suitable and representative model for our data, since the Multi-Block approaches did not yield any significant components. We now proceed to study and interpret the results in terms of the selected study variables or the latent components derived from the model.

Forest Plots for Cox Model Interpretation

The plot_forest() function provides a concise and interpretable summary of the effects estimated by a Cox model. It visualizes Hazard Ratios (HRs) and their confidence intervals for each predictor or component, allowing for rapid identification of statistically significant associations and the directionality of their effects.

Forest plots are especially useful in models derived from Cox, but plot_forest() function was updated to work with all Coxmos approaches, where each component represents a latent combination of variables. The function automatically handles these structures, labeling components appropriately and ensuring that effect sizes can be meaningfully compared across components or even between models.

For each variable or component, the plot displays:

  • The hazard ratio (HR), quantifying the strength and direction of association with the event.
  • Confidence intervals, which convey the precision of the estimate.
  • P-values, to assess statistical significance (typically, p < 0.05 indicates a significant effect).

Interpretation is straightforward: components with HR > 1 are associated with increased risk (right-sided bars), while those with HR < 1 are associated with reduced risk (left-sided bars). If the confidence interval crosses HR = 1, the effect is considered non-significant.

This visualization bridges statistical output and clinical interpretation, helping to prioritize features or components that have both statistical and biological relevance.

lst_forest_plot <- plot_forest(lst_models$`SB.sPLS-ICOX`)

C:1BfP43083d9d5ab9-MO-pipeline.R

lst_forest_plot

C:1BfP43083d9d5ab9-MO-pipeline.R

Predicted Value Distribution by Event Status

Another type of visualization implemented for all supported models, whether classical or sPLS-based, is the distribution of predicted values by event status. These plots are useful for assessing how well the model separates patients who experienced the event from those who did not, based on the model’s output.

The function plot_cox.event() (or plot_cox.event.list() for multiple models) enables the generation of density and histogram plots using predictions derived from the fitted Cox model. Internally, the function leverages the prediction types provided by the coxph function in the survival package. The main types include:

  • lp (linear predictors): the log-risk scores for each observation based on the model’s coefficients. These reflect the relative risk associated with an individual’s covariate profile.
  • risk: the relative risk of experiencing the event, obtained by exponentiating the linear predictor.
  • expected: the expected number of events for each subject over time, given their covariate profile.
  • terms: the contribution of each variable to the linear predictor, shown separately.
  • survival: the estimated probability of survival at a given time point.

By selecting the appropriate prediction type, the function produces plots that show the distribution of predicted values stratified by event status. These plots allow for visual evaluation of the model’s discriminative ability, indicate that the model is effectively distinguishing between censored and event observations.

density.plots.lp <- plot_cox.event(lst_models$`SB.sPLS-ICOX`, type = "lp")

C:1BfP43083d9d5ab9-MO-pipeline.R

density.plots.lp$plot.density

density.plots.lp$plot.histogram

C:1BfP43083d9d5ab9-MO-pipeline.R

Variable or Component Contributions in Coxmos Models

How much do my components (SB/MB.sPLS-COX) contribute to the full model?

Understanding the internal structure of sPLS-COX models is essential for translating their predictive power into biological insight. The eval_Coxmos_model_per_variable() function enables an evaluation of how individual variables or latent components contribute to model performance.

A central goal of this analysis is to assess component-level predictive performance, both in training and test datasets. The function decomposes the model’s AUC by component, allowing users to determine which components carry the most prognostic information. Comparing individual component AUCs against the full model’s linear predictor helps identify whether some components are particularly dominant or redundant.

From a validation standpoint, the function supports the identification of overfitting risk. Components that perform well in training data but drop in test performance may lack generalizability. Conversely, components that consistently perform across datasets represent robust predictors worthy of further investigation.

The eval_Coxmos_model_per_variable() function provides:

  • AUC decomposition by component, for both training and test sets
  • Visual comparison between individual component performance and the full linear predictor
  • Quantitative indicators of overfitting through cross-dataset validation

In terms of interpretation, components with high AUC values can be considered the primary drivers of the model. If an individual component outperforms the full model (linear predictor), it may represent potential overfitting, and should be considered in model refinement.

variable_auc_results <- eval_Coxmos_model_per_variable(model = lst_models$`SB.sPLS-ICOX`, 
                                                      X_test = lst_models$`SB.sPLS-ICOX`$X_input, 
                                                      Y_test = lst_models$`SB.sPLS-ICOX`$Y_input)

variable_auc_plot_train <- plot_evaluation(variable_auc_results, evaluation = "AUC")

C:1BfP43083d9d5ab9-MO-pipeline.R

variable_auc_plot_train$lst_plots$lineplot.mean

C:1BfP43083d9d5ab9-MO-pipeline.R

Visualizing sPLS Models for Survival Analysis

The plot_sPLS_Coxmos() function provides a visualization framework for exploring sPLS components in the context of survival analysis. One of its capabilities is the visual detection of outliers, which can significantly impact model interpretation and performance.

Through score plots, users can visually inspect how samples are distributed across the latent component space. Samples that appear distant from the main cluster may indicate potential outliers — observations with unique profiles that could disproportionately influence the model. Identifying such samples early allows for more informed decisions about data quality, potential stratification, or exclusion.

In addition to outlier detection, the score plots also highlight sample clustering patterns. They help determine whether patients group naturally according to event status and whether component axes capture meaningful clinical variation. A clear separation between event groups supports the relevance of the extracted components.

Loading plots complement this view by showing the variables that contribute most to each component. They can reveal which features are driving sample positioning in the score space and whether any variables have excessive influence.

The biplot integrates both sample and variable information, allowing users to examine relationships between extreme samples and specific variables.

The plot_sPLS_Coxmos() function generates graphical representations of sPLS components from any Coxmos sPLS-based survival model. It supports three types of plots depending on the mode selected:

  • mode = "scores": displays sample projections onto the latent components (score plot).
  • mode = "loadings": shows variable contributions to the selected components (loading plot).
  • mode = "biplot": combines both sample and variable information in a single view.
ggp_scores <- plot_sPLS_Coxmos(model = lst_models$`SB.sPLS-ICOX`,
                               comp = c(1,2), mode = "scores")

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_scores
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_loadings <- plot_sPLS_Coxmos(model = lst_models$`SB.sPLS-ICOX`, 
                                 comp = c(1,2), mode = "loadings",
                                 top = 10)

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_loadings
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_biplot <- plot_sPLS_Coxmos(model = lst_models$`SB.sPLS-ICOX`, 
                               comp = c(1,2), mode = "biplot",
                               top = 15,
                               only_top = T,
                               overlaps = 20)

C:1BfP43083d9d5ab9-MO-pipeline.R

ggp_biplot
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

Interpreting original variables in sPLS Components: Pseudobeta Coefficients

Which original variables drive my sPLS model’s survival predictions, and what direction of effect (protective or risk) do they contribute to each component?

The plot_pseudobeta() function allows for a detailed inspection of how individual variables influence the components of a sPLS-based survival model. This visualization provides a transparent and biologically interpretable summary of variable-level contributions, capturing both the magnitude and direction of effect for each feature.

By decomposing each component into pseudobeta coefficients, it becomes possible to rank variables based on their impact on the linear predictor (LP) and to assess whether their effects are consistent across components. The directionality—whether a variable acts as a protective or risk factor—is visualized through the orientation of the bars, even when components aggregate multiple opposing effects.

This type of analysis is particularly valuable for biological interpretation, as it links abstract components back to measurable variables. It also reveals whether known biomarkers align with expected effects or whether new candidate variables emerge as strong contributors to survival risk.

The function includes several customizable options:

  • onlySig = TRUE filters components based on statistical significance (by default at alpha = 0.05), reducing variable noise contribution and emphasizing robust effects.
  • top defines the number of top variables to display (e.g., top = 20 for overview; top = NULL for full detail).
  • The percentage of total LP explained by the displayed variables is included, providing quantitative context for their contribution to the model.
ggp.simulated_beta <- plot_pseudobeta(model = lst_models$`SB.sPLS-ICOX`, 
                                      error.bar = T, onlySig = F, alpha = 0.05, 
                                      zero.rm = T, auto.limits = T, top = 20,
                                      show_percentage = T, size_percentage = 2)
#> For iSB.sPLS-ICOX and SB.sPLS-ICOX model, pseudobetas are an approximation as predictions are obtained through a deflation process.

C:1BfP43083d9d5ab9-MO-pipeline.R

The iSB.sPLS-DRCOX model was computed using a total of 2 components. Although these components were classified as dangerous for the observations (based on coefficients greater than one), certain variables within the components may still have a protective effect, depending on their individual weights.

The following plot illustrates the pseudo-beta coefficient for the original variables in the iSB.sPLS-DRCOX model. As only the top 20 variables are shown, in some case a percentage of the total linear predictor total value could be shown. To view the complete model, all variables would need to be included in the plot by assigning the value NULL to the “top” parameter.

ggp.simulated_beta$plot
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

In addition, for MB datasets, all omics are combined into a single plot. In this case, the relative pseudobeta per variable is shown. The user must take into account that each omic has a different survival beta, which influences the relative importance of each original variable. In this case, only proteomic variables are shown.

ggp.simulated_beta$mb_plot$plot

C:1BfP43083d9d5ab9-MO-pipeline.R

Kaplan-Meier Curves for Visualizing Survival Differences

Kaplan-Meier (KM) curves are a fundamental tool in survival analysis. They estimate the probability of survival over time and allow visual comparison between different patient groups. When used in conjunction with a risk prediction model, such as the Cox model, KM curves help assess whether the predicted risk groups truly differ in their survival outcomes. Typically, the survival probability is plotted against time, and differences between curves indicate the effectiveness of the model in stratifying risk.

In this context, the getAutoKM() and getAutoKM.list() functions provide an automated way to generate KM plots from models. These functions can be used to analyze the survival curves based on:

  • The model’s linear predictor (type = "LP"),
  • Specific sPLS components (type = "COMP"), or
  • Original variables (type = "VAR").

Full Model-Based Stratification

To generate Kaplan-Meier curves using the full linear predictor, set the type parameter to "LP". This uses the value of the linear predictor (a combination of component scores and Cox coefficients) to divide patients into high- and low-risk groups. The optimal cut-point is computed using the surv_cutpoint() function from the survminer R package. Other parameters, such as comp or top, are ignored in this configuration.

LST_KM_RES_LP <- getAutoKM(type = "LP",
                           model = lst_models$`SB.sPLS-ICOX`)

C:1BfP43083d9d5ab9-MO-pipeline.R

This function returns a list containing both numerical information (cut-points, significance levels) and the corresponding KM plots.

To display the Kaplan-Meier curve associated with the full linear predictor:

LST_KM_RES_LP$LST_PLOTS$LP

C:1BfP43083d9d5ab9-MO-pipeline.R

The curve illustrates how effectively the model separates patients into groups with statistically different survival probabilities. The log-rank test p-value indicates whether the observed differences are significant.

This approach provides a straightforward and interpretable method to validate the prognostic power of a model’s output. By transforming risk predictions into time-dependent survival probabilities, it bridges the gap between statistical modeling and clinical decision-making.

Component-Level Kaplan-Meier Model-Based Stratification

In sPLS-based models, each latent component contributes to the model’s predictive power. Setting type = "COMP" allows the user to evaluate the prognostic value of each component separately through KM curves. This helps identify whether a single component is sufficient to stratify patients effectively, or if multiple components are necessary.

The comp parameter selects which components to include in the analysis. If multiple components are specified, a separate KM plot will be generated for each.

LST_KM_RES_COMP <- getAutoKM(type = "COMP",
                             model = lst_models$`SB.sPLS-ICOX`,
                             comp = 1:2)

C:1BfP43083d9d5ab9-MO-pipeline.R

LST_KM_RES_COMP$LST_PLOTS$mirna$comp_1

LST_KM_RES_COMP$LST_PLOTS$proteomic$comp_2

C:1BfP43083d9d5ab9-MO-pipeline.R

Variable-Level Kaplan-Meier Curves

Finally, by setting type = "VAR", Kaplan-Meier curves are generated for individual variables from the original dataset. This approach is valuable when:

You want to examine how specific variables behave independently of the sPLS model.

You need to validate the clinical relevance of original features in survival stratification.

This method uses ori_data = TRUE to determine cutpoints from raw (unscaled) variable values, ensuring interpretability in clinical settings. The top parameter controls how many top-ranked variables to include, and the function will automatically compute log-rank p-values to filter only significant results if only_sig = TRUE.

LST_KM_RES_VAR <- getAutoKM(type = "VAR",
                            model = lst_models$`SB.sPLS-ICOX`,
                            top = 10,
                            ori_data = T, #original data selected
                            only_sig = T, alpha = 0.05)

C:1BfP43083d9d5ab9-MO-pipeline.R

LST_KM_RES_VAR$LST_PLOTS$mirna$hsa.minus.miR.minus.491.minus.3p

LST_KM_RES_VAR$LST_PLOTS$proteomic$var_5080

C:1BfP43083d9d5ab9-MO-pipeline.R

Prediction and Evaluation on New Observations

The framework also supports prediction and interpretation for new patients, offering tools to visualize how a new observation compares to the training data in terms of risk and variable importance.

Single Patient Analysis

To illustrate this functionality, we select a censored observation from the test dataset:

new_pat <- X_test
new_pat$mirna <- new_pat$mirna[1,,drop=F]
new_pat$proteomic <- new_pat$proteomic[1,,drop=F]

C:1BfP43083d9d5ab9-MO-pipeline.R

This observation corresponds to a censored patient with the last follow-up recorded at time :

knitr::kable(Y_test[rownames(new_pat$mirna),])
#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
time event
TCGA-A2-A0SV-01A 825 TRUE

C:1BfP43083d9d5ab9-MO-pipeline.R

Predicted Value Distribution for new observations

Additionally, the new observation can be added to the density and histogram plots that the model has computed. While this function can be useful in visualizing the results, it is important to note that no definitive conclusions can be drawn from these types of plots alone. They serve as an additional means of viewing the predictive values for the new patient (“lp”, “risk”, “expected”, “terms”, “survival”) in relation to the training dataset and its histogram or density plot, along with the events or censored patients.

pat_density <- plot_observation.eventDensity(observation = new_pat,
                                             model = lst_models$`SB.sPLS-ICOX`,
                                             type = "lp")

C:1BfP43083d9d5ab9-MO-pipeline.R

pat_density

C:1BfP43083d9d5ab9-MO-pipeline.R

pat_histogram <- plot_observation.eventHistogram(observation = new_pat, 
                                                 model = lst_models$`SB.sPLS-ICOX`, 
                                                 type = "lp")

C:1BfP43083d9d5ab9-MO-pipeline.R

pat_histogram

C:1BfP43083d9d5ab9-MO-pipeline.R

Pseudobeta’s Consistenc for new observation

The function plot_observation.pseudobeta() allows comparison between the new patient’s variable profile and the pseudo-beta coefficients derived from a trained model. This comparison highlights variables contributing to increased or decreased predicted risk, relative to the average values in the training set.

ggp.simulated_beta_newObs <- plot_observation.pseudobeta(model = lst_models$`SB.sPLS-ICOX`,
                                                         observation = new_pat,
                                                         error.bar = T, onlySig = T, alpha = 0.05,
                                                         zero.rm = T, auto.limits = T, show.betas = T, top = 20,
                                                         txt.x.angle = 90)
#> For iSB.sPLS-ICOX and SB.sPLS-ICOX model, pseudobetas are an approximation as predictions are obtained through a deflation process.

C:1BfP43083d9d5ab9-MO-pipeline.R

The resulting figure shows:

  • On the left: the linear predictor contribution from each variable in the selected patient.
  • On the right: the model’s pseudo-beta values for the same variables.

This helps identify whether variables behave in a risk-aligned or protective direction. For example, a model variable with a negative pseudo-beta (protective) whose patient value is in the opposite direction (above the mean) contributes to a risk profile.

ggp.simulated_beta_newObs$plot
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

Multiple new observation comparison

Variable or Component Contributions for Test Data

The function eval_Coxmos_model_per_variable() enables the evaluation of how individual variables or latent components contribute to model performance. This evaluation can also be applied to new observations, such as a test dataset, by specifying the and arguments.

variable_auc_results_test <- eval_Coxmos_model_per_variable(model = lst_models$`SB.sPLS-ICOX`, 
                                                       X_test = X_test, 
                                                       Y_test = Y_test)

variable_auc_plot_test <- plot_evaluation(variable_auc_results_test, evaluation = "AUC")

C:1BfP43083d9d5ab9-MO-pipeline.R

variable_auc_plot_test$lst_plots$lineplot.mean

C:1BfP43083d9d5ab9-MO-pipeline.R

In this example, component 2 achieves better performance than the full linear predictor (LP) model, which could indicate possible overfitting. However, it also exhibits a larger standard deviation.

Note: This vignette uses a toy dataset, and the results should not be interpreted in a biological context.

Multiple Linear Predictors Study

Furthermore, LP plots, similar to those for a single observations, can be generated for multiple patients at the same time. For example, by selecting 5 patients.

lst_observations <- list()
for(b in names(X_test)){
  lst_observations[[b]] <- X_test[[b]][1:5,]
}

C:1BfP43083d9d5ab9-MO-pipeline.R

knitr::kable(Y_test[rownames(lst_observations$mirna),])
#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")

#> Warning: 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
time event
TCGA-A2-A0SV-01A 825 TRUE
TCGA-A2-A0YT-01A 723 TRUE
TCGA-BH-A1FB-01A 3669 TRUE
TCGA-D8-A1XC-01A 377 TRUE
TCGA-BH-A1EX-01A 1508 TRUE

C:1BfP43083d9d5ab9-MO-pipeline.R

This section allows the visualization of the individual linear predictor (LP) values for each variable, along with the total LP calculated using all variables included in the survival model.

A potential application of this function is to analyze the same patient at different time points, for example, after receiving treatment, and track its evolution over time with respect to the original variables.

To generate the figure, use the plot_multipleObservations.LP.list() or plot_multipleObservations.LP() function.

lst_cox.comparison <- plot_multipleObservations.LP(model = lst_models$`SB.sPLS-ICOX`,
                                                   observations = lst_observations,
                                                   top = 10)
#> For iSB.sPLS-ICOX and SB.sPLS-ICOX model, pseudobetas are an approximation as predictions are obtained through a deflation process.

C:1BfP43083d9d5ab9-MO-pipeline.R

lst_cox.comparison$plot
#> $mirna

#> 
#> $proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

In the resulting plot, the LP contribution of each variable is displayed for each observation, as well as the total LP. This visualization helps identify which observations carry higher predicted risk compared to others.

Kaplan-Meier Curves on Test Set

To evaluate the model’s performance on unseen data, the cutoffs derived from getAutoKM() can be used to stratify the test dataset using Kaplan-Meier analysis. This approach helps determine how well the model separates patients into groups with different survival probabilities.

Test KM Based on Linear Predictor (LP)

This modality uses the final linear predictor (LP), which combines all variables or components weighted by their estimated coefficients, to classify patients into high- and low-risk groups. The cutoff value, usually determined using the training data, is applied to the test data to stratify patients.

Once the cutoff is retrieved using getCutoffAutoKM(), the getTestKM() function applies it to the test set and generates the survival curves. A log-rank test is performed to assess whether the survival distributions between the groups are significantly different, indicating that the model has learned meaningful risk stratification.

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_LP)
LST_KM_TEST_LP <- getTestKM(model = lst_models$`SB.sPLS-ICOX`, 
                            X_test = X_test, Y_test = Y_test, 
                            type = "LP",
                            BREAKTIME = NULL, n.breaks = 20,
                            cutoff = lst_cutoff)

C:1BfP43083d9d5ab9-MO-pipeline.R

LST_KM_TEST_LP

C:1BfP43083d9d5ab9-MO-pipeline.R

Test KM Based on Components (COMP)

This option evaluates the contribution of individual latent components from the model. Each component represents a compressed combination of input variables designed to maximize the model’s predictive power.

By applying a cutoff to each component separately, you can assess whether one or more components independently capture significant survival-related structure. This is particularly useful in multivariate models, where understanding the relevance of each component can guide dimensionality reduction or biological interpretation.

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_COMP)
LST_KM_TEST_COMP <- getTestKM(model = lst_models$`SB.sPLS-ICOX`, 
                              X_test = X_test, Y_test = Y_test, 
                              type = "COMP",
                              BREAKTIME = NULL, n.breaks = 20,
                              cutoff = lst_cutoff)

C:1BfP43083d9d5ab9-MO-pipeline.R

LST_KM_TEST_COMP$comp_1_mirna

LST_KM_TEST_COMP$comp_2_proteomic

C:1BfP43083d9d5ab9-MO-pipeline.R

Test KM Based on Original Variables (VAR)

This modality focuses on the original input variables, rather than derived components or LP. It enables a more granular analysis by evaluating whether each variable, considered individually, can stratify the test data into groups with different survival outcomes.

This is helpful when interpreting the biological or clinical relevance of variables, as it highlights those that retain predictive value outside the multivariate context. The ori_data = TRUE flag ensures that the model works directly with the original variable scales.

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_VAR)
LST_KM_TEST_VAR <- getTestKM(model = lst_models$`SB.sPLS-ICOX`, 
                             X_test = X_test, Y_test = Y_test, 
                             type = "VAR", ori_data = T,
                             BREAKTIME = NULL, n.breaks = 20,
                             cutoff = lst_cutoff)

C:1BfP43083d9d5ab9-MO-pipeline.R

LST_KM_TEST_VAR$mirna$hsa.minus.miR.minus.491.minus.3p

LST_KM_TEST_VAR$proteomic$var_5080

C:1BfP43083d9d5ab9-MO-pipeline.R

In this case, although the variables were significant in the training data, they shows similar behavior across both groups in the test data.

Note: This vignette uses a toy dataset, and the results should not be interpreted in a biological context.