Below is the hands-on exercises on the advanced methods. This will cover the heuristic search, how to handle clustering and how to perform Bayesian model averaging.
Note: in the following text: Bayesian network, structure and DAG are synonyms.
Mixed models – correction for grouped data (or clustering)
Note: In the following: clustering and grouped are taken as synonym.
In some situations, the way the data were collected has a clear grouping aspect, and therefore there is a potential risk for non-independence between data points from the same group that could cause over-dispersion. This can lead to analyses which are over-optimistic as the true level of variation in the data is under-estimated. It could have an impact … or not. But this is a good practice to check!
In practice we will introduce a random effect to account for this additional variability. Thus each node will become a GLMM (Generalized Linear Mixed Model (Faraway, 2016)) instead of a GLM (Generalized Linear Model (McCullagh, 2018)) but in a Bayesian setting. We will then compute the posterior distribution and check if they widen. In such case, we will have to take into account the clustering in the scoring scheme we use.
In practice, the major problem is the computational complexity of this approach. Indeed, if the clustering do not affect too much the result, it is preferable to not take it into account. The model is then much simpler and parsimonious then more generalizable.
The grouping variable is farm
and we apply the random effect to every nodes. On a personal computer following code takes 15 minutes where the previous code ran in less than a second! The grouping variable has 15 levels.
marg.f.grouped <- fitabn(dag.m = trim.dag,
data.df = as.data.frame(abndata),
data.dists = dist,
group.var = "farm",
cor.vars = c("AR", "pneumS", "female", "livdam", "eggs", "wormCount", "age", "adg"),
compute.fixed = TRUE,
n.grid = 1000)
Calculate the area under the density curve
|
auc
|
AR|(Intercept)
|
0.9999986
|
AR|age
|
0.9999994
|
AR|group.precision
|
0.9999853
|
pneumS|(Intercept)
|
0.9999977
|
pneumS|group.precision
|
0.9999532
|
female|(Intercept)
|
0.9999996
|
female|group.precision
|
0.9992713
|
livdam|(Intercept)
|
0.9999978
|
livdam|eggs
|
0.9999976
|
livdam|group.precision
|
0.9999974
|
eggs|(Intercept)
|
0.9999957
|
eggs|adg
|
0.9999990
|
eggs|group.precision
|
0.9999999
|
wormCount|(Intercept)
|
0.9999972
|
wormCount|AR
|
0.9999995
|
wormCount|eggs
|
0.9999991
|
wormCount|age
|
0.9999997
|
wormCount|adg
|
0.9999997
|
wormCount|group.precision
|
0.9999976
|
age|(Intercept)
|
0.9999984
|
age|female
|
0.9999996
|
age|group.precision
|
0.9999982
|
age|precision
|
0.9999995
|
adg|(Intercept)
|
0.9999993
|
adg|age
|
0.9999997
|
adg|group.precision
|
0.9999725
|
adg|precision
|
0.9999980
|
Here the shape of some of the parameter distribution looks weird. Indeed, there is a deal of information between the random strata. However, the AUC looks good. So we can proceed getting the estimates.
Get the table of quantiles for the marginals
As one can see, except for a precision parameter (female|group.precision
) the quantiles seems to not have too much widen.
Non-adjusted marginal densities
|
2.5%
|
50%
|
97.5%
|
AR|(Intercept)
|
0.891
|
1.155
|
1.435
|
AR|age
|
0.509
|
0.798
|
1.110
|
pneumS|(Intercept)
|
-2.134
|
-1.814
|
-1.519
|
female|(Intercept)
|
-0.196
|
0.018
|
0.230
|
livdam|(Intercept)
|
0.976
|
1.265
|
1.567
|
livdam|eggs
|
0.663
|
1.547
|
2.656
|
eggs|(Intercept)
|
-1.508
|
-1.232
|
-0.971
|
eggs|adg
|
0.346
|
0.614
|
0.895
|
wormCount|(Intercept)
|
-1.646
|
-1.400
|
-1.170
|
wormCount|AR
|
0.152
|
0.273
|
0.395
|
wormCount|eggs
|
3.280
|
3.504
|
3.741
|
wormCount|age
|
-0.824
|
-0.685
|
-0.548
|
wormCount|adg
|
-0.374
|
-0.240
|
-0.107
|
age|(Intercept)
|
-0.349
|
-0.200
|
-0.052
|
age|female
|
0.188
|
0.397
|
0.606
|
age|precision
|
0.893
|
1.042
|
1.207
|
adg|(Intercept)
|
-0.053
|
0.000
|
0.053
|
adg|age
|
-0.921
|
-0.868
|
-0.816
|
adg|precision
|
3.485
|
4.070
|
4.712
|
Marginal densities adjusted with random effects
|
2.5%
|
50%
|
97.5%
|
AR|(Intercept)
|
0.761
|
1.236
|
1.766
|
AR|age
|
0.371
|
0.758
|
1.163
|
AR|group.precision
|
0.633
|
2.105
|
10.084
|
pneumS|(Intercept)
|
-2.664
|
-2.006
|
-1.521
|
pneumS|group.precision
|
0.533
|
2.339
|
12.875
|
female|(Intercept)
|
-0.209
|
0.025
|
0.256
|
female|group.precision
|
166.132
|
2333.598
|
4707.719
|
livdam|(Intercept)
|
1.067
|
1.826
|
2.748
|
livdam|eggs
|
-0.116
|
1.040
|
2.346
|
livdam|group.precision
|
0.229
|
0.666
|
1.762
|
eggs|(Intercept)
|
-3.071
|
-1.687
|
-0.413
|
eggs|adg
|
-0.344
|
0.100
|
0.544
|
eggs|group.precision
|
0.071
|
0.207
|
0.515
|
wormCount|(Intercept)
|
-2.562
|
-1.756
|
-1.043
|
wormCount|AR
|
0.325
|
0.452
|
0.579
|
wormCount|eggs
|
2.428
|
2.700
|
2.980
|
wormCount|age
|
-0.266
|
-0.093
|
0.080
|
wormCount|adg
|
-0.244
|
-0.100
|
0.044
|
wormCount|group.precision
|
0.240
|
0.631
|
1.425
|
age|(Intercept)
|
-0.629
|
-0.221
|
0.184
|
age|female
|
0.312
|
0.446
|
0.579
|
age|group.precision
|
0.816
|
1.854
|
3.580
|
age|precision
|
2.214
|
2.594
|
3.012
|
adg|(Intercept)
|
-0.162
|
-0.017
|
0.125
|
adg|age
|
-0.904
|
-0.838
|
-0.772
|
adg|group.precision
|
6.610
|
15.856
|
33.294
|
adg|precision
|
4.706
|
5.514
|
6.402
|
Heuristic search
The mostprobable()
function is limited to 20 to 25 nodes (if computed on a computing cluster even with a limited number of parent). Then for larger problem this approach becomes intractable form a computational point of view. The main advantages of the mostprobable()
function is that the returned structure is the one that has the maximal possible score, thus it is called exact in that sense (Koivisto & Sood, 2004). It compares all possible structure to select the optimal one.
The heuristic searches are techniques that tends to perform a greedy optimization. Then there is no guarantees to reach the maximum score. In this context greedy means: perform local optimization and hoping to get the overall maximum.
We will use a Hill-Climber (Korb & Nicholson, 2010). We need to set up the number of searches and the number of steps per searches. The starting point (here we choose a random Directed Acyclic Graph (DAG)). The rest of parameters is equivalent to the ones used in buildscorecache()
function.
abndata <- dt[, -drop]
abndata[,1:5] <- as.data.frame(lapply(abndata[,1:5], factor))
dist <- list(AR = "binomial", pneumS = "binomial", female="binomial",
livdam= "binomial", eggs = "binomial",wormCount = "poisson",
age= "gaussian", adg = "gaussian")
#set maximum number of possible parent per node
max.par <- 4
# compute a cache of scores
mycache <- buildscorecache(data.df = as.data.frame(abndata),
data.dists = dist,
max.parents = max.par)
# set number of searches and number of steps
num.searches <- 200
max.steps <- 150
# Hill-climber
heur.res <- quiet(search.heuristic(score.cache = mycache,
score = "mlik",
data.dists = dist,
max.parents = 4,
start.dag = "random",
num.searches = num.searches,
max.steps = max.steps,
seed = 3213,
verbose = TRUE,
algo = "hc"))
# for comparison let us compute the maximum exact score
mydag <- quiet(mostprobable(score.cache = mycache))
fabn <- fitabn(dag.m = mydag, data.df = as.data.frame(abndata), data.dists = dist)
# plot Hill-Climber scores
df.heur <- unlist(heur.res$scores)
plot(NULL,lty=1, xlab="Index of heuristic search",ylab="BN score", ylim = c(max(unlist(df.heur)),min(unlist(df.heur))), xlim = c(0,num.searches))
for(i in 1:num.searches){
if(sum(i==order(df.heur,decreasing = TRUE)[1:10])){
points(x=i,y=df.heur[i],type="p",pch=19, col=rgb(0,0,1, 0.8),lwd = 2)
}else{
points(x=i,y=df.heur[i],type="p",pch=19, col=rgb(0,0,0, 0.3))
}
}
points(x = max(unlist(df.heur)),y = max(unlist(df.heur)),col="red",pch=19)
abline(h = fabn$mlik, col="red",lty = 3)
Above we plot the maximum scores after 150 steps out of 200 different searches of a Hill-Climber. The blues dotes are the 10 best searches. The red dashed line is the maximum possible score.
# let us plot the evolution of scores during optimization
Long <- (heur.res$detailed.score)
Long.arr <- array(unlist(Long), dim = c(nrow(Long[[1]]), ncol(Long[[1]]), length(Long)))
plot(NULL,lty=1, xlab="Number of Steps",ylab="BN score", ylim = c(max(unlist(df.heur)),min(unlist(df.heur))), xlim = c(0,max.steps))
for(i in 1:num.searches){
if(sum(i==order(unlist(heur.res$scores),decreasing = TRUE)[1:10])){
points(x=1:(max.steps-1),y=Long.arr[1,,i],type="l",lty=1, col=rgb(0,0,1, 0.8),lwd = 2)
}else{
points(x=1:(max.steps-1),y=Long.arr[1,,i],type="l",lty=1, col=rgb(0,0,0, 0.25))
}
}
lines(x=1:(max.steps-1),y=Long.arr[1,,which.max(unlist(heur.res$scores))],type="l",col="red",lwd=3)
abline(h = fabn$mlik, col="red",lty = 3)
Above we plot the scores in function of the steps for the 200 searches. The blue lines are the 10 best and the red one is the one that has the highest score. As one can see, in searches that work well the optimization is already good after 60 steps. This plot is a diagnostic plot used to set up the number of required steps.
MCMC over the structures
Usually, the output of a Bayesian network analysis of a dataset ends-up with a single well adjusted DAG. From the researcher point of view this could be frustrating. Indeed, the model is the one that is best supported by the data but the uncertainty quantification is missing. Classically in epidemiology, researchers are used to express point estimate with an uncertainty measure. An arc in a Bayesian network is a point estimate, we will see how to perform model averaging. The link strength measure is designed to account for that. An more natural alternative is to perform MCMC over structures (Friedman & Koller, 2003).
We use the mcmcabn()
function on the cache of pre-computed networks scores. One needs to define the type of score used (here is the marginal likelihood mlik
). The maximum of number of parents per node (same as the one used in buildscorecache()
). The MCMC learning scheme, defined as: number of MCMC samples, number of thinned sampled (to avoid autocorrelation) and the length of the burn-in phase. Possibly a starting DAG and a structural prior. We also need to select the relative probability of performing radical moves (shuffling). Indeed, a naive MCMC approach is known to get very easily stuck in local maximum (for more details see: (Grzegorczyk & Husmeier, 2008; Su & Borsuk, 2016)).
mcmc.out <- mcmcabn(score.cache = mycache,
score = "mlik",
data.dists = dist,
max.parents = 4,
mcmc.scheme = c(1000,9,500),
seed = 321,
verbose = FALSE,
start.dag = "random",
prob.rev = 0.07,
prob.mbr = 0.07,
prior.choice = 1)
This is again computationally complex!
Her is a plot of the MCMC samples. One can see the scores of the structures on the y-axis in function of the index steps. The dots are the radical moves. One the right side a histogram shows the number of structures with a given score. As one can see the histogram is very peaked on the maximum possible score.
One can also display the cumulative maximum score (used for network score optimization).
But the major advantage of this method is the possibility of querying the MCMC sample using a formula statement.
# average individual arc support
query(mcmcabn = mcmc.out)
AR pneumS female livdam eggs
AR 0.00000000 0.01069893 0.01139886 0.01159884 0.07599240
pneumS 0.00589941 0.00000000 0.01539846 0.00829917 0.10398960
female 0.00579942 0.01879812 0.00000000 0.01769823 0.00659934
livdam 0.00689931 0.00999900 0.01119888 0.00000000 0.44275572
eggs 0.14908509 0.09239076 0.01019898 0.29377062 0.00000000
wormCount 0.76062394 0.08119188 0.03349665 0.07039296 0.91700830
age 0.13568643 0.07349265 0.15768423 0.11338866 0.09739026
adg 0.03019698 0.01589841 0.08359164 0.00959904 0.17288271
wormCount age adg
AR 0.00349965 0.77212279 0.08649135
pneumS 0.00429957 0.19238076 0.03989601
female 0.00249975 0.29857014 0.12168783
livdam 0.00439956 0.06649335 0.01169883
eggs 0.02959704 0.25057494 0.36466353
wormCount 0.00000000 0.88481152 0.31056894
age 0.01399860 0.00000000 0.47095290
adg 0.00349965 0.52864714 0.00000000
# probability that worm count being linked to age but not to female directly
query(mcmcabn = mcmc.out,formula = ~wormCount|age-wormCount|female)
[1] 0.01889811
# probability that worm count being directly linked to age and adg and that adg is link to age (undirected)
query(mcmcabn = mcmc.out,formula = ~wormCount|age + wormCount|adg + age|adg)+
query(mcmcabn = mcmc.out,formula = ~wormCount|age + wormCount|adg + adg|age)
[1] 0.2451755
References
Faraway, J. J. (2016). Extending the linear model with r: Generalized linear, mixed effects and nonparametric regression models. Chapman; Hall/CRC.
Friedman, N., & Koller, D. (2003). Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine Learning, 50(1-2), 95–125.
Grzegorczyk, M., & Husmeier, D. (2008). Improving the structure mcmc sampler for bayesian networks by introducing a new edge reversal move. Machine Learning, 71(2-3), 265.
Koivisto, M., & Sood, K. (2004). Exact bayesian structure discovery in bayesian networks. Journal of Machine Learning Research, 5(May), 549–573.
Korb, K. B., & Nicholson, A. E. (2010). Bayesian artificial intelligence. CRC press.
McCullagh, P. (2018). Generalized linear models. Routledge.
Su, C., & Borsuk, M. E. (2016). Improving structure mcmc for bayesian networks through markov blanket resampling. The Journal of Machine Learning Research, 17(1), 4042–4061.
---
title: "Hands-on exercise: advanced methods"
fontsize: 12pt
output:
  html_document:
    toc: true
    toc_depth: 2
    code_download: true
bibliography: bib_advanced.bib
csl: apa.csl
---

&nbsp;


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, collapse=FALSE,
                      fig.align='center', fig.height=4, fig.width=10, comment = NA)

options(scipen=999)

quiet <- function(x) { 
  sink(tempfile()) 
  on.exit(sink()) 
  invisible(force(x)) 
} 

## load data, list of dist etc
dt <- readRDS("dataOK.RDS")

drop <- which(colnames(dt)%in% c("pneum", "epg5", "worms", "farm"))
drop.clustring <- which(colnames(dt)%in% c("pneum", "epg5", "worms"))

##end of old stuff

library(knitr)
library(kableExtra)
library(abn)
library(mcmcabn)
```

Below is the hands-on exercises on the advanced methods. This will cover the **heuristic search**, how to handle **clustering** and how to perform **Bayesian model averaging**.

*Note: in the following text: Bayesian network, structure and DAG are synonyms.*

# Mixed models – correction for grouped data (or clustering)

*Note: In the following: clustering and grouped are taken as synonym.*

In some situations, the way the data were collected has a clear grouping aspect, and therefore there is a potential risk for non-independence between data points from the same group that could cause over-dispersion. This can lead to analyses which are over-optimistic as the true level of variation in the data is under-estimated. It could have an impact ... or not. But this is a good practice to check!

In practice we will introduce a random effect to account for this additional variability. Thus each node will become a **GLMM** (Generalized Linear Mixed Model [@faraway2016extending]) instead of a **GLM** (Generalized Linear Model [@mccullagh2018generalized]) but in a Bayesian setting. We will then compute the posterior distribution and check if they **widen**. In such case, we will have to take into account the clustering in the scoring scheme we use.

In practice, the major problem is the computational complexity of this approach. Indeed, if the clustering do not affect too much the result, it is preferable to not take it into account. The model is then much simpler and parsimonious then more generalizable.  

```{r}
# incorporate grouping factor in the data.
abndata <- dt[, -drop.clustring]

# set up factors
abndata[,c(1:5,9)] <- as.data.frame(lapply(abndata[,c(1:5,9)], factor))

# recompute the trimed DAG (i.e. arc supported by at least 50% of the bootstrap samples)
dags <- readRDS("BootDAGs5000.RDS")
dags <- array(data = unlist(dags),dim = c(8,8,5000))
dag <- apply(dags,1:2, mean)

trim.dag <- dag
trim.dag[dag>0.5] <- 1
trim.dag[dag<=0.5] <- 0

colnames(trim.dag) <- rownames(trim.dag) <- names(dist)
```

The grouping variable is `farm` and we apply the random effect to every nodes. On a personal computer following code takes 15 minutes where the previous code ran in less than a second! The grouping variable has 15 levels. 

```{r, eval=FALSE, echo=TRUE}
marg.f.grouped <- fitabn(dag.m = trim.dag, 
                 data.df = as.data.frame(abndata),
                 data.dists = dist, 
                 group.var = "farm",
                 cor.vars = c("AR", "pneumS", "female", "livdam", "eggs", "wormCount", "age", "adg"),
                 compute.fixed = TRUE, 
                 n.grid = 1000)
```

## Visually inspect the marginal posterior distributions of the parameters

```{r, fig.width=10, fig.height=4}
# alrady computed marginals
load(file = "marg.grouped.RData")

par(mfrow=c(1,4), mar=c(2,2,1.5,1))
for(i in 1:length(marg.f.grouped$marginals)){

# get the marginal for current node, which is a matrix [x, f(x)]
  cur.node <- marg.f.grouped$marginals[i]
  nom1 <- names(marg.f.grouped$marginals)[i]

# pick the first value (for models wothout random effects)
  cur.node <- cur.node[[1]]
  for(j in 1:length(cur.node) ) {
    nom2 <- names(cur.node)[j]
    cur.param <- cur.node[[j]]
    plot(cur.param,type="l",main=paste(nom1, ":", nom2), cex=0.7)
  }
}
```

## Calculate the area under the density curve

```{r, fig.width=10, fig.height=4}
#extract marginals 
marg.dens <- marg.f$marginals[[1]]
for (i in 2:length(marg.f$marginals)) {
  marg.dens <- c(marg.dens, marg.f$marginals[[i]])
}

# extract marginals adjusted for grouped data
marg.dens.grouped <- marg.f.grouped$marginals[[1]]
for (i in 2:length(marg.f.grouped$marginals)) {
  marg.dens.grouped <- c(marg.dens.grouped, marg.f.grouped$marginals[[i]])
}

# calculate AUC --> should be ~1
auc <- rep(NA, length(marg.dens.grouped))
names(auc) <- names(marg.dens.grouped)
for(i in 1:length(marg.dens.grouped)) {
  tmp <- spline(marg.dens.grouped[[i]])
  auc[i] <- sum(diff(tmp$x[-length(tmp$x)])*tmp$y[-1])
}

kable(cbind(auc)) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

par(las=2, mar=c(8.1,4.1,4.1,2.1))
barplot(auc, ylab="Area under Density", ylim=c(0,1.2))
```

&nbsp;

Here the *shape* of some of the parameter distribution looks weird. Indeed, there is a deal of information between the random strata. However, the *AUC* looks good. So we can proceed getting the estimates.

&nbsp;

## Get the table of quantiles for the marginals

As one can see, except for a precision parameter (`female|group.precision`) the quantiles seems to not have too much widen. 

```{r , echo=FALSE}

mat <- matrix(rep(NA, length(marg.dens.grouped)*3), ncol=3)
rownames(mat) <- names(marg.dens.grouped)
colnames(mat) <- c("2.5%", "50%", "97.5%")
ignore.me <- union(grep("\\(Int", names(marg.dens.grouped)), grep("prec", names(marg.dens.grouped))) # take away background k and precision
comment <- rep("", length(marg.dens.grouped))
for (i in 1:length(marg.dens.grouped)) {
  tmp <- marg.dens.grouped[[i]]
  tmp2 <- cumsum(tmp[,2])/sum(tmp[,2])
  mat[i, ] <-c(tmp[which(tmp2>0.025)[1]-1,1],## -1 is so use value on the left of the 2.5%
               tmp[which(tmp2>0.5)[1],1],
               tmp[which(tmp2>0.975)[1],1])
  vec <- mat[i,]

  if( !(i%in%ignore.me) && (vec[1]<0 && vec[3]>0)){comment[i]<-"not sig. at 5%"}

  ## truncate for printing
  mat[i,] <- as.numeric(formatC(mat[i,],digits=3,format="f"))
}

mat1 <- matrix(rep(NA, length(marg.dens)*3), ncol=3)
rownames(mat1) <- names(marg.dens)
colnames(mat1) <- c("2.5%", "50%", "97.5%")
ignore.me <- union(grep("\\(Int", names(marg.dens)), grep("prec", names(marg.dens))) # take away background k and precision
comment <- rep("", length(marg.dens))
for (i in 1:length(marg.dens)) {
  tmp <- marg.dens[[i]]
  tmp2 <- cumsum(tmp[,2])/sum(tmp[,2])
  mat1[i, ] <-c(tmp[which(tmp2>0.025)[1]-1,1],## -1 is so use value on the left of the 2.5%
               tmp[which(tmp2>0.5)[1],1],
               tmp[which(tmp2>0.975)[1],1])
  vec <- mat1[i,]

  if( !(i%in%ignore.me) && (vec[1]<0 && vec[3]>0)){comment[i]<-"not sig. at 5%"}

  ## truncate for printing
  mat1[i,] <- as.numeric(formatC(mat1[i,],digits=3,format="f"))
}
 
mat1 %>%
  kable("html", align = 'clc', caption = 'Non-adjusted marginal densities') %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "float_left")

mat %>%
  kable("html", align = 'clc', caption = 'Marginal densities adjusted with random effects ') %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

```

# Heuristic search

The `mostprobable()` function is limited to 20 to 25 nodes (if computed on a computing cluster even with a limited number of parent). Then for larger problem this approach becomes intractable form a computational point of view. The main advantages of the `mostprobable()` function is that the returned structure is the one that has the maximal possible score, thus it is called **exact** in that sense [@koivisto2004exact]. It compares all possible structure to select the optimal one. 

The heuristic searches are techniques that tends to perform a greedy optimization. Then there is no guarantees to reach the maximum score. In this context greedy means: perform local optimization and hoping to get the overall maximum.

We will use a Hill-Climber [@korb2010bayesian]. We need to set up the number of searches and the number of steps per searches. The starting point (here we choose a random Directed Acyclic Graph (DAG)). The rest of parameters is equivalent to the ones used in `buildscorecache()` function.

```{r heuristic1, echo=TRUE, fig.width=10, fig.height=4}

abndata <- dt[, -drop]

abndata[,1:5] <- as.data.frame(lapply(abndata[,1:5], factor))
dist <- list(AR = "binomial", pneumS = "binomial", female="binomial", 
             livdam= "binomial", eggs = "binomial",wormCount = "poisson",
             age= "gaussian", adg = "gaussian")

#set maximum number of possible parent per node
max.par <- 4

# compute a cache of scores
mycache <- buildscorecache(data.df = as.data.frame(abndata), 
                           data.dists = dist, 
                           max.parents = max.par)

# set number of searches and number of steps
num.searches <- 200
max.steps <- 150

# Hill-climber
heur.res <- quiet(search.heuristic(score.cache = mycache,
                           score = "mlik",
                           data.dists = dist,
                           max.parents = 4,
                           start.dag = "random",
                           num.searches = num.searches,
                           max.steps = max.steps,
                           seed = 3213,
                           verbose = TRUE,
                           algo = "hc"))

# for comparison let us compute the maximum exact score
mydag <- quiet(mostprobable(score.cache = mycache))
fabn <- fitabn(dag.m = mydag, data.df = as.data.frame(abndata), data.dists = dist)

# plot Hill-Climber scores
df.heur <- unlist(heur.res$scores)
plot(NULL,lty=1, xlab="Index of heuristic search",ylab="BN score", ylim = c(max(unlist(df.heur)),min(unlist(df.heur))), xlim = c(0,num.searches))
for(i in 1:num.searches){
  if(sum(i==order(df.heur,decreasing = TRUE)[1:10])){
    points(x=i,y=df.heur[i],type="p",pch=19, col=rgb(0,0,1, 0.8),lwd = 2)
    }else{
    points(x=i,y=df.heur[i],type="p",pch=19, col=rgb(0,0,0, 0.3))
  }
}
points(x = max(unlist(df.heur)),y = max(unlist(df.heur)),col="red",pch=19)
abline(h = fabn$mlik, col="red",lty = 3)

```
Above we plot the maximum scores after 150 steps out of 200 different searches of a Hill-Climber. The blues dotes are the 10 best searches. The red dashed line is the maximum possible score. 

```{r heuristic2, echo=TRUE, fig.width=10, fig.height=4}
# let us plot the evolution of scores during optimization
Long <- (heur.res$detailed.score)
Long.arr <- array(unlist(Long), dim = c(nrow(Long[[1]]), ncol(Long[[1]]), length(Long)))
plot(NULL,lty=1, xlab="Number of Steps",ylab="BN score", ylim = c(max(unlist(df.heur)),min(unlist(df.heur))), xlim = c(0,max.steps))
for(i in 1:num.searches){
  if(sum(i==order(unlist(heur.res$scores),decreasing = TRUE)[1:10])){
    points(x=1:(max.steps-1),y=Long.arr[1,,i],type="l",lty=1, col=rgb(0,0,1, 0.8),lwd = 2)
    }else{
    points(x=1:(max.steps-1),y=Long.arr[1,,i],type="l",lty=1, col=rgb(0,0,0, 0.25))
  }
}
lines(x=1:(max.steps-1),y=Long.arr[1,,which.max(unlist(heur.res$scores))],type="l",col="red",lwd=3)
abline(h = fabn$mlik, col="red",lty = 3)

```
Above we plot the scores in function of the steps for the 200 searches. The blue lines are the 10 best and the red one is **the** one that has the highest score. As one can see, in searches that work well the optimization is already good after 60 steps. This plot is a diagnostic plot used to set up the number of required steps. 

# MCMC over the structures

Usually, the output of a Bayesian network analysis of a dataset ends-up with a single well adjusted DAG. From the researcher point of view this could be frustrating. Indeed, the model is the one that is best supported by the data but the uncertainty quantification is missing. Classically in epidemiology, researchers are used to express point estimate with an uncertainty measure. An arc in a Bayesian network is a point estimate, we will see how to perform model averaging. The *link strength* measure is designed to account for that. An more natural alternative is to perform MCMC over structures [@friedman2003being]. 

We use the `mcmcabn()` function on the cache of pre-computed networks scores. One needs to define the type of score used (here is the marginal likelihood `mlik`). The maximum of number of parents per node (same as the one used in `buildscorecache()`). The MCMC learning scheme, defined as: number of MCMC samples, number of thinned sampled (to avoid autocorrelation) and the length of the burn-in phase. Possibly a starting DAG and a structural prior. We also need to select the relative probability of performing radical moves (shuffling). Indeed, a naive MCMC approach is known to get very easily stuck in local maximum (for more details see: [@grzegorczyk2008improving;@su2016improving]). 

```{r,eval=FALSE, echo=TRUE}
mcmc.out <- mcmcabn(score.cache = mycache,
                  score = "mlik",
                  data.dists = dist,
                  max.parents = 4,
                  mcmc.scheme = c(1000,9,500),
                  seed = 321,
                  verbose = FALSE,
                  start.dag = "random",
                  prob.rev = 0.07,
                  prob.mbr = 0.07,
                  prior.choice = 1)
```

This is again computationally complex!

Her is a plot of the MCMC samples. One can see the scores of the structures on the y-axis in function of the index steps. The dots are the radical moves. One the right side a histogram shows the number of structures with a given score. As one can see the histogram is very peaked on the maximum possible score. 

```{r, fig.width=10, fig.height=4}
load("mcmc.Rdata")
plot(mcmc.out)
```

One can also display the cumulative maximum score (used for network score optimization).

```{r, fig.width=10, fig.height=4}
plot(mcmc.out,max.score=TRUE)
```

But the major advantage of this method is the possibility of querying the MCMC sample using a formula statement.

```{r, echo=TRUE, eval=TRUE}
# average individual arc support
query(mcmcabn = mcmc.out)

# probability that worm count being linked to age but not to female directly
query(mcmcabn = mcmc.out,formula = ~wormCount|age-wormCount|female)

# probability that worm count being directly linked to age and adg and that adg is link to age (undirected)
query(mcmcabn = mcmc.out,formula = ~wormCount|age + wormCount|adg + age|adg)+
  query(mcmcabn = mcmc.out,formula = ~wormCount|age + wormCount|adg + adg|age)
```
# References