library(abn)This vignette covers the whole process of Bayesian network structure learning to parameter estimation and data simulation. # Find the best fitting graphical structure using an exact search algorithm
abn packageThe package abn is a collection of functions for modelling of additive Bayesian networks. It contains routines to score Bayesian Networks based on Bayesian (default) or information-theoretic formulation of generalized linear models. Depending on the type of data, the package supports a possible mixture of continuous, discrete, and count data. The following table shows which of distribution types are supported by for each method of estimation:
| Distribution type | method = "bayes" | method = "mle" | 
|---|---|---|
| Gaussian | ✅ | ✅ | 
| Binomial | ✅ | ✅ | 
| Poisson | ✅ | ✅ | 
| Multinomial | ❌ | ✅ | 
Structure learning of additive Bayesian networks with abn is a three-step process. Based on a set of model specifications (data, maximal number of possible parent nodes, restricted or enforced arcs, etc.), abn calculates in a first step the score of the data given the model (buildScoreCache()). This list of scores is then used to estimate the most probable Bayesian network structure ("structure learning") and to infer the network structure in a third step (fitAbn()). Four structure-learning algorithms have been implemented in abn: the hill-climbing algorithm, the "exact search" algorithm, the simulated annealing algorithm and tabu search algorithm. With the network structure inferred, the package provides routines to estimate the parameters of the network and to simulate data from the fitted additive Bayesian network model.
The following example shows how to find the best fitting graphical structure using an exact search algorithm.
ex1.dag.dataThis artificial data set comes with abn and contains 10000 observations of 10 variables. The variables are a mixture of continuous (gaussian), binary (binomial), and count (poisson) data. The data set is a simulated data set from a known network structure.
mydat <- ex1.dag.data
str(mydat)abn requires a list of the type of distribution for each node in the data set.
mydists <- list(b1="binomial", 
                p1="poisson", 
                g1="gaussian", 
                b2="binomial", 
                p2="poisson", 
                b3="binomial", 
                g2="gaussian", 
                b4="binomial", 
                b5="binomial", 
                g3="gaussian") The max.par argument sets the maximum number of parent nodes for each node in the data set. It can be set to a single value for all nodes or to a list with the node names as keys and the maximum number of parent nodes as values. This is a crucial parameter to speed up the model estimation in abn as it limits the number of possible combinations.
# max.par <- list("b1"=1,"p1"=2,"g1"=3,"b2"=4,"p2"=1,"b3"=2,"g2"=3,"b4"=4,"b5"=1,"g3"=2) # set different max parents for each node
max.par <- 4 # set the same max parents for all nodesThe score cache is a list of scores for each possible parent combination for each node in the data set. It is used to learn the structure of the Bayesian network in the next step.
mycache <- buildScoreCache(data.df = mydat, 
                           data.dists = mydists,
                           method = "bayes", # the default method is "bayes"
                           max.parents = max.par)The minimal number of input arguments for buildScoreCache() is the data set and the distribution list. By default, the function uses the Bayesian score which is based on the posterior probability of the model given the data. To use the Log-Likelihood score, Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) instead, the method argument can be set to "mle".
The function buildScoreCache() also accepts a list of banned and retained arcs, which can be used to enforce or restrict the presence of certain arcs in the network structure. This can be useful if prior knowledge about the network structure is available, e.g. from expert knowledge or from previous analyses it is known that certain arcs must be present or have to be absent.
The max.parents argument sets the maximum number of parent nodes for each node in the data set and together with the dag.banned and dag.retained arguments, it restricts the model search space and can speed up the model estimation in abn.
The next step is to find the best fitting graphical structure of the Bayesian network. In this example, we use the exact search algorithm to find the most probable Bayesian network structure given the score cache from the previous step. We supply the score cache as abnCache object from the previous step to the structure learning function.
mp.dag <- mostProbable(score.cache = mycache)The mostProbable() function returns an object of class abnLearned which contains the most probable Bayesian network structure and the score of the model given the data.
The best fitting graphical structure can be plotted using the plotAbn() function.
plot(mp.dag)The plot() function requires the Rgraphviz package to be installed.
The parameters of the network can be estimated using the fitAbn() function.
myfit <- fitAbn(object = mp.dag)The fitAbn() function returns an object of class abnFit which contains the estimated parameters of the network.
summary(myfit)
plot(myfit)The simulateAbn() function can be used to simulate data from the fitted model.
simdat <- simulateAbn(object = myfit, 
                      n.iter = 10000L)
summary(simdat)