Greybox main vignette
Ivan Svetunkov
2025-06-30
There are three well-known notions of “boxes” in modelling: 1. White
box - the model that is completely transparent and does not have any
randomness. One can see how the inputs are transformed into the specific
outputs. 2. Black box - the model which does not have an apparent
structure. One can only observe inputs and outputs but does not know
what happens inside. 3. Grey box - the model that is in between the
first two. We observe inputs and outputs plus have some information
about the structure of the model, but there is still a part of
unknown.
The white boxes are usually used in optimisations (e.g. linear
programming), while black boxes are popular in machine learning. As for
the grey box models, they are more often used in analysis and
forecasting. So the package greybox contains models that
are used for these purposes.
At the moment the package contains augmented linear model function
and several basic functions that implement model selection and
combinations using information criteria (IC). You won’t find statistical
tests in this package - there’s plenty of them in the other packages.
Here we try using the modern techniques and methods that do not rely on
hypothesis testing. This is the main philosophical point of
greybox.
Main functions
The package includes the following functions for models
construction:
- alm() - Augmented Linear Model. This is
something similar to GLM, but with a focus on forecasting and the
information criteria usage for time series. It also supports mixture
distribution models for the intermittent data and allows adding trend to
the data via the formula.
- stepwise() - select the linear model with
the lowest IC from all the possible in the provided data. Uses partial
correlations. Works fast;
- lmCombine() - combine the linear models
into one using IC weights;
- lmDynamic() - produce model with dynamic
weights and time varying parameters based on point IC weight.
See discussion of some of these functions in this vignette below.
 
Models evaluation functions
- ro() - produce forecasts with a specified
function using rolling origin.
- measures()- function, returning a bunch of error
measures for the provided forecast and the holdout sample.
- rmcb()- regression on ranks of forecasting methods.
This is a fast alternative to the classical nemenyi / MCB test.
 
Methods
The following methods can be applied to the models, produced by
alm(), stepwise(), lmCombine()
and lmDynamic():
- logLik()- extracts log-likelihood.
- AIC(),- AICc(),- BIC(),- BICc()- calculates the respective information
criteria.
- pointLik()- extracts the point likelihood.
- pAIC(),- pAICc(),- pBIC(),- pBICc()- calculates the respective point information
criteria, based on pointLik.
- actuals()- extracts the actual values of the response
variable.
- coefbootstrap()- produces bootstrapped values of
parameters, taking- nsimsamples of the size- sizefrom the data and reapplying the model.
- coef(),- coefficients()- extract the
parameters of the model.
- confint()- extracts the confidence intervals for the
parameters.
- vcov()- extracts the variance-covariance matrix of the
parameters.
- sigma()- extracts the standard deviation of the
residuals.
- nobs()- the number of the in-sample observations of
the model.
- nparam()- the number of all the estimated parameters
in the model.
- nvariate()- the number of variates (columns /
dimensions) of the resposne variable.
- summary()- produces the summary of the model.
- predict()- produces the predictions based on the model
and the provided- newdata. If the- newdatais
not provided, then it uses the already available data in the model. Can
also produce- confidenceand- predictionintervals.
- forecast()- acts similarly to- predict()with few differences. It has a parameter- h- forecast
horizon - which is- NULLby default and is set to be equal
to the number of rows in- newdata. However, if the- newdatais not provided, then it will produce forecasts of
the explanatory variables to the horizon- hand use them as- newdata. Finally, if- hand- newdataare provided, then the number of rows to use will
be regulated by- h.
- plot()- produces several plots for the analysis of the
residuals. This includes: Fitted over time, Standardised residuals vs
Fitted, Absolute residuals vs Fitted, Q-Q plot with the specified
distribution, Squared residuals vs Fitted, ACF of the residuals and PACF
of the residuals, which is regulated by- whichparameter.
See documentation for more info:- ?plot.greybox.
- detectdst()and- detectleap()- methods
that return the ids of the hour / date for the DST / Leap year
change.
- extract()method, needed in order to produce printable
regression outputs using- texreg()function from the- texregpackage.
 
Distribution functions
- qlaplace(),- dlaplace(),- rlaplace(),- plaplace()- functions for Laplace
distribution.
- qalaplace(),- dalaplace(),- ralaplace(),- palaplace()- functions for
Asymmetric Laplace distribution.
- qs(),- ds(),- rs(),- ps()- functions for S distribution.
- qgnorm(),- dgnorm(),- rgnorm(),- pgnorm()- functions for the Generalised normal
distribution.
- qfnorm(),- dfnorm(),- rfnorm(),- pfnorm()- functions for folded normal distribution.
- qtplnorm(),- dtplnorm(),- rtplnorm(),- ptplnorm()- functions for three
parameter log normal distribution.
- qbcnorm(),- dbcnorm(),- rbcnorm(),- pbcnorm()- functions for the
Box-Cox normal distribution.
- qlogitnorm(),- dlogitnorm(),- rlogitnorm(),- plogitnorm()- functions for the
Logit-normal distribution.
 
Additional functions
- graphmaker()- produces linear plots for the variable,
its forecasts and fitted values.
 
xregExpander
The function xregExpander() is useful in cases when the
exogenous variable may influence the response variable either via some
lags or leads. As an example, consider BJsales.lead series
from the datasets package. Let’s assume that the
BJsales variable is driven by the today’s value of the
indicator, the value five and 10 days ago. This means that we need to
produce lags of BJsales.lead. This can be done using
xregExpander():
BJxreg <- xregExpander(BJsales.lead,lags=c(-5,-10))
The BJxreg is a matrix, which contains the original
data, the data with the lag 5 and the data with the lag 10. However, if
we just move the original data several observations ahead or backwards,
we will have missing values in the beginning / end of series, so
xregExpander() fills in those values with the forecasts
using es() and iss() functions from
smooth package (depending on the type of variable we are
dealing with). This also means that in cases of binary variables you may
have weird averaged values as forecasts (e.g. 0.7812), so beware and
look at the produced matrix. Maybe in your case it makes sense to just
substitute these weird numbers with zeroes…
You may also need leads instead of lags. This is regulated with the
same lags parameter but with positive values:
BJxreg <- xregExpander(BJsales.lead,lags=c(7,-5,-10))
Once again, the values are shifted, and now the first 7 values are
backcasted. In order to simplify things we can produce all the values
from 10 lags till 10 leads, which returns the matrix with 21
variables:
BJxreg <- xregExpander(BJsales.lead,lags=c(-10:10))
 
stepwise
The function stepwise() does the selection based on an information
criterion (specified by user) and partial correlations. In order to run
this function the response variable needs to be in the first column of
the provided matrix. The idea of the function is simple, it works
iteratively the following way:
- The basic model of the first variable and the constant is
constructed (this corresponds to simple mean). An information criterion
is calculated;
- The correlations of the residuals of the model with all the original
exogenous variables are calculated;
- The regression model of the response variable and all the variables
in the previous model plus the new most correlated variable from (2) is
constructed using lm()function;
- An information criterion is calculated and is compared with the one
from the previous model. If it is greater or equal to the previous one,
then we stop and use the previous model. Otherwise we go to step 2.
This way we do not do a blind search, going forward or backwards, but
we follow some sort of “trace” of a good model: if the residuals contain
a significant part of variance that can be explained by one of the
exogenous variables, then that variable is included in the model.
Following partial correlations makes sure that we include only
meaningful (from technical point of view) variables in the model. In
general the function guarantees that you will have the model with the
lowest information criterion. However this does not guarantee that you
will end up with a meaningful model or with a model that produces the
most accurate forecasts. So analyse what you get as a result.
Let’s see how the function works with the Box-Jenkins data. First we
expand the data and form the matrix with all the variables:
BJxreg <- as.data.frame(xregExpander(BJsales.lead,lags=c(-10:10)))
BJxreg <- cbind(as.matrix(BJsales),BJxreg)
colnames(BJxreg)[1] <- "y"
ourModel <- stepwise(BJxreg)
This way we have a nice data frame with nice names, not something
weird with strange long names. It is important to note that the response
variable should be in the first column of the resulting matrix. After
that we use stepwise function:
ourModel <- stepwise(BJxreg)
And here’s what it returns (the object of class lm):
ourModel
#> Time elapsed: 0.08 seconds
#> Call:
#> alm(formula = y ~ xLag4 + xLag9 + xLag3 + xLag10 + xLag5 + xLead9 + 
#>     xLag6 + xLag7 + xLag8, data = data, distribution = "dnorm")
#> 
#> Coefficients:
#> (Intercept)       xLag4       xLag9       xLag3      xLag10       xLag5 
#>  18.0199167   3.3525614   1.3795732   4.6442650   1.5498201   2.3144543 
#>      xLead9       xLag6       xLag7       xLag8 
#>   0.4011588   1.7013350   1.4006062   1.3334322
The values in the function are listed in the order of most correlated
with the response variable to the least correlated ones. The function
works very fast because it does not need to go through all the variables
and their combinations in the dataset.
All the basic methods can be used together with the final model
(e.g. predict(), forecast(),
summary() etc).
Furthermore, the greybox package implements
extract() method from texreg package for the
production of printable outputs from the regression, here is an
example:
texreg::htmlreg(ourModel)
Statistical models
|  | Model 1 | 
| (Intercept) | 18.02* | 
|  | [16.46; 19.58] | 
| xLag4 | 3.35* | 
|  | [ 2.74; 3.97] | 
| xLag9 | 1.38* | 
|  | [ 0.76; 2.00] | 
| xLag3 | 4.64* | 
|  | [ 4.07; 5.22] | 
| xLag10 | 1.55* | 
|  | [ 0.99; 2.11] | 
| xLag5 | 2.31* | 
|  | [ 1.68; 2.95] | 
| xLead9 | 0.40* | 
|  | [ 0.15; 0.65] | 
| xLag6 | 1.70* | 
|  | [ 1.06; 2.34] | 
| xLag7 | 1.40* | 
|  | [ 0.76; 2.04] | 
| xLag8 | 1.33* | 
|  | [ 0.70; 1.97] | 
| Num. obs. | 150.00 | 
| Num. param. | 11.00 | 
| Num. df | 139.00 | 
| AIC | 413.75 | 
| AICc | 415.66 | 
| BIC | 446.87 | 
| BICc | 451.66 | 
| * 0 outside the confidence interval. | 
Similarly, you can produce pdf tables via texreg()
function from that package. Alternatively, you can use
kable() function from knitr package on the
summary to get a table for LaTeX / HTML.
 
lmCombine
lmCombine() function creates a pool of linear models
using lm(), writes down the parameters, standard errors and
information criteria and then combines the models using IC weights. The
resulting model is of the class “lm.combined”. The speed of the function
deteriorates exponentially with the increase of the number of variables
\(k\) in the dataset, because the
number of combined models is equal to \(2^k\). The advanced mechanism that uses
stepwise() and removes a large chunk of redundant models is
also implemented in the function and can be switched using
bruteforce parameter.
Here’s an example of the reduced data with combined model and the
parameter bruteforce=TRUE:
ourModel <- lmCombine(BJxreg[,-c(3:7,18:22)],bruteforce=TRUE)
summary(ourModel)
#> The AICc combined model
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Coefficients:
#>             Estimate Std. Error Importance Lower 2.5% Upper 97.5%  
#> (Intercept)  21.0828     0.2294     1.0000    20.6293     21.5364 *
#> x            -0.0526     0.0290     0.2618    -0.1100      0.0048  
#> xLag5         6.4014     0.0829     1.0000     6.2375      6.5653 *
#> xLag4         5.8425     0.0888     1.0000     5.6669      6.0181 *
#> xLag3         5.6732     0.0890     1.0000     5.4973      5.8492 *
#> xLag2         0.1198     0.0371     0.2850     0.0465      0.1932 *
#> xLag1        -0.0924     0.0348     0.2750    -0.1612     -0.0237 *
#> xLead1       -0.0994     0.0331     0.2822    -0.1648     -0.0340 *
#> xLead2       -0.0363     0.0256     0.2604    -0.0868      0.0143  
#> xLead3       -0.1202     0.0344     0.2970    -0.1881     -0.0522 *
#> xLead4        0.0048     0.0228     0.2595    -0.0402      0.0498  
#> xLead5        0.1359     0.0322     0.3166     0.0722      0.1996 *
#> 
#> Error standard deviation: 2.2076
#> Sample size: 150
#> Number of estimated parameters: 7.2375
#> Number of degrees of freedom: 142.7625
#> Approximate combined information criteria:
#>      AIC     AICc      BIC     BICc 
#> 670.7379 671.5791 692.5275 694.6348
summary() function provides the table with the
parameters, their standard errors, their relative importance and the 95%
confidence intervals. Relative importance indicates in how many cases
the variable was included in the model with high weight. So, in the
example above variables xLag5, xLag4, xLag3 were included in the models
with the highest weights, while all the others were in the models with
lower ones. This may indicate that only these variables are needed for
the purposes of analysis and forecasting.
The more realistic situation is when the number of variables is high.
In the following example we use the data with 21 variables. So if we use
brute force and estimate every model in the dataset, we will end up with
\(2^{21}\) = 2^21
combinations of models, which is not possible to estimate in the
adequate time. That is why we use bruteforce=FALSE:
ourModel <- lmCombine(BJxreg,bruteforce=FALSE)
summary(ourModel)
#> The AICc combined model
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Coefficients:
#>             Estimate Std. Error Importance Lower 2.5% Upper 97.5%  
#> (Intercept)  18.0324     0.7736     1.0000    16.5028     19.5620 *
#> xLag4         3.3549     0.3044     1.0000     2.7530      3.9567 *
#> xLag9         1.3788     0.3053     0.9998     0.7751      1.9824 *
#> xLag3         4.6484     0.2831     1.0000     4.0887      5.2081 *
#> xLag10        1.5503     0.2770     1.0000     1.0025      2.0981 *
#> xLag5         2.3151     0.3143     1.0000     1.6937      2.9365 *
#> xLead9        0.3944     0.1244     0.9832     0.1484      0.6404 *
#> xLag6         1.7015     0.3170     1.0000     1.0747      2.3283 *
#> xLag7         1.4001     0.3177     0.9997     0.7720      2.0281 *
#> xLag8         1.3328     0.3158     0.9995     0.7085      1.9571 *
#> 
#> Error standard deviation: 0.9277
#> Sample size: 150
#> Number of estimated parameters: 10.9822
#> Number of degrees of freedom: 139.0178
#> Approximate combined information criteria:
#>      AIC     AICc      BIC     BICc 
#> 413.9082 415.8151 446.9716 451.7489
In this case first, the stepwise() function is used,
which finds the best model in the pool. Then each variable that is not
in the model is added to the model and then removed iteratively. IC,
parameters values and standard errors are all written down for each of
these expanded models. Finally, in a similar manner each variable is
removed from the optimal model and then added back. As a result the pool
of combined models becomes much smaller than it could be in case of the
brute force, but it contains only meaningful models, that are close to
the optimal. The rationale for this is that the marginal contribution of
variables deteriorates with the increase of the number of parameters in
case of the stepwise function, and the IC weights become close to each
other around the optimal model. So, whenever the models are combined,
there is a lot of redundant models with very low weights. By using the
mechanism described above we remove those redundant models.
There are several methods for the lm.combined class,
including:
- predict.greybox()- returns the point and interval
predictions.
- forecast.greybox()- wrapper around- predict()The forecast horizon is defined by the length of
the provided sample of- newdata.
- plot.lm.combined()- plots actuals and fitted
values.
- plot.predict.greybox()- which uses- graphmaker()function from- smoothin order to
produce graphs of actuals and forecasts.
As an example, let’s split the whole sample with Box-Jenkins data
into in-sample and the holdout:
BJInsample <- BJxreg[1:130,];
BJHoldout <- BJxreg[-(1:130),];
ourModel <- lmCombine(BJInsample,bruteforce=FALSE)
A summary and a plot of the model:
summary(ourModel)
#> The AICc combined model
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Coefficients:
#>             Estimate Std. Error Importance Lower 2.5% Upper 97.5%  
#> (Intercept)  19.7922     0.8494     1.0000    18.1103     21.4742 *
#> xLag4         3.3355     0.2953     1.0000     2.7509      3.9202 *
#> xLag9         1.3372     0.2969     0.9992     0.7492      1.9251 *
#> xLag3         4.7372     0.2774     1.0000     4.1879      5.2866 *
#> xLag10        1.5414     0.2688     1.0000     1.0091      2.0737 *
#> xLag5         2.3147     0.3049     1.0000     1.7109      2.9185 *
#> xLag6         1.6574     0.3076     1.0000     1.0484      2.2665 *
#> xLead9        0.2954     0.1251     0.8958     0.0478      0.5430 *
#> xLag8         1.3704     0.3070     0.9991     0.7624      1.9783 *
#> xLag7         1.3281     0.3079     0.9985     0.7185      1.9376 *
#> 
#> Error standard deviation: 0.9458
#> Sample size: 130
#> Number of estimated parameters: 10.8924
#> Number of degrees of freedom: 119.1076
#> Approximate combined information criteria:
#>      AIC     AICc      BIC     BICc 
#> 365.5125 367.7060 396.7469 402.0855
plot(ourModel)



 Importance tells us how important the respective variable is in the
combination. 1 means 100% important, 0 means not important at all.
Importance tells us how important the respective variable is in the
combination. 1 means 100% important, 0 means not important at all.
And the forecast using the holdout sample:
ourForecast <- predict(ourModel,BJHoldout)
plot(ourForecast)

These are the main functions implemented in the package for now. If
you want to read more about IC model selection and combinations, I would
recommend (Burnham and Anderson 2004)
textbook.
 
References
Burnham, Kenneth P, and David R Anderson. 2004. 
Model Selection and Multimodel Inference.
Edited by Kenneth P Burnham and David R Anderson. Springer New York. 
https://doi.org/10.1007/b97636.