Install the package devtools
(with install.packages("devtools")
) and run the following line of code
devtools::install_github("retodomax/cowfit", dependencies=TRUE)
Load the package with
library("cowfit")
Your data has to be organized in two objects
pedigree
from package pedigreemm
. Type ?pedigreemm::pedigree
for more information of how construct an object of class pedigree
.data.frame
. Make sure to define the desired data type for each column (numeric, factor, …).lme4
lme4
compared to cowfit
.lme4
.cowfit
are allowed with lme4
(animal model with single observations).anova()
, AIC()
, Cross validation, …)The data set sim_milk
contains 8375 observed milk yields from 1675 animals. They are descendants from 38 sires for which we want to predict breeding values. Our goal is to select for animals which have a higher milk yield under high-protein feeding.
We start fitting models of increasing complexity without considering the correlation of the random effects.
fit1 <- lmer(formula = y ~ (1|sire),
data = sim_milk)
fit2 <- lmer(formula = y ~ protein + (1|sire),
data = sim_milk)
fit3 <- lmer(formula = y ~ protein + (1|sire) + (1|herd),
data = sim_milk)
fit4 <- lmer(formula = y ~ protein + (1|sire) + (protein|herd),
data = sim_milk)
fit5 <- lmer(formula = y ~ protein + (protein|sire) + (1|herd),
data = sim_milk)
Models with (protein|sire) + (protein|herd) will end up being singular and are therefore not considered. Next we do hierarchical model comparison with anova
.
anova(fit1, fit2, fit3, fit4, fit5)
Data: sim_milk
Models:
fit1: y ~ (1 | sire)
fit2: y ~ protein + (1 | sire)
fit3: y ~ protein + (1 | sire) + (1 | herd)
fit4: y ~ protein + (1 | sire) + (protein | herd)
fit5: y ~ protein + (protein | sire) + (1 | herd)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fit1 3 118636 118657 -59315 118630
fit2 4 78937 78965 -39465 78929 39701.01 1 < 2.2e-16 ***
fit3 5 75701 75736 -37845 75691 3238.39 1 < 2.2e-16 ***
fit4 7 74874 74924 -37430 74860 830.23 2 < 2.2e-16 ***
fit5 7 57698 57748 -28842 57684 17176.06 0 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The additional parameters in the most complex model (fit5) seem to have a significant effect on the target variable. In the next step we refit the last two models and include the correlation structure due to the pedigree. We can fit the model with the one step approach because this is a rather small example.
fit6 <- cowfit_lmer(formula = y ~ protein + (1|sire) + (protein|herd),
data = sim_milk, pedigree = list(sire = pedSires),
cowfit_verbose = FALSE)
fit7 <- cowfit_lmer(formula = y ~ protein + (protein|sire) + (1|herd),
data = sim_milk, pedigree = list(sire = pedSires),
cowfit_verbose = FALSE)
anova(fit6, fit7)
Data: sim_milk
Models:
fit6: y ~ protein + (1 | sire) + (protein | herd)
fit7: y ~ protein + (protein | sire) + (1 | herd)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fit6 7 74877 74926 -37432 74863
fit7 7 57702 57751 -28844 57688 17176 0 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, we observe that fit6
is better also with considering the correlation structure in the random effects.
Finally we use the function ranef()
to extract the random effects of the sires (aka breeding values). Each sire will have two breeding values. The random intercept breeding value is defined as two times the deviation from the population mean of an average offspring at low protein level. The random slope breeding value is associated with the expected effect on higher protein levels
re <- ranef(fit7)
bv <- re$sire
plot(1, xlim = c(0, 5), ylim = c(-150, 150),
xlab = "protein", ylab = "effect on milk yield",
type = "n")
z <- apply(bv, 1, function(x) abline(a = x[1], b = x[2]))
bv$eff_prot5 <- bv$`(Intercept)` + bv$protein*5
ind <- which(bv$eff_prot5 > 130)
rownames(bv[ind,])
[1] "321" "324"
z <- apply(bv[ind,], 1, function(x) abline(a = x[1], b = x[2],
col = "red"))
We should select the animal 321 and 324 (red lines) if we select animals which perform better with high protein feedstuff.
Next we consider problems where we want to fit an animal model with a larger pedigree. This implies a large number of random effects which makes the estimation computationally difficult. We use the two step approach in order to fit the model in reasonable time. The data set sim_fat
and pedigree pedCows
are still relatively small, however, the techniques we are using also apply to larger problems. As before, we start fitting the whole data set without considering the correlation of the random effects.
fit1 <- lmer(formula = y ~ (1|animal),
data = sim_fat)
fit2 <- lmer(formula = y ~ lact + (1|animal),
data = sim_fat)
fit3 <- lmer(formula = y ~ (1|animal) + (1|herd),
data = sim_fat)
anova(fit1, fit2, fit3)
Data: sim_fat
Models:
fit1: y ~ (1 | animal)
fit2: y ~ lact + (1 | animal)
fit3: y ~ (1 | animal) + (1 | herd)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fit1 3 26613 26631 -13303 26607
fit2 4 26615 26639 -13303 26607 0.1958 1 0.6581
fit3 4 26615 26639 -13303 26607 0.0000 0 1.0000
We choose the first model and fit the model to a small subset of the data and pedigree.
## Maybe use subPed() ??
library(optiSel)
myp <- subPed(Pedig = data.frame(ID = pedCows@label, Sire = pedCows@sire, Dam = pedCows@dam),
keep = "2000", prevGen = 0, succGen = 10)
## However we need less random effects and more observations (ideally)
## How can we sample to have MANY observations and LESS random effects?
## 2. Option
## Use ANIMAL model
# BUT: are variance components of animal model the same?
# In my opinion variance components are 0.25 times variance components of animal model
# (effect is only halfe the effect in animal model)
Let’s try an example where our target variable is not continuous but a binary. Assume we want to have less multiple birth events in a population of animals because multiple births are associated with complications. We do not care about the exact number of offspring therefore we just model a binary trait, multiple birth true or false. We also expect an influence of predictor variable lact
on the probability for multiple birth. The data is stored in data frame sim_twin
and pedSires
is the associated pedigree of the sires.
We start by fitting models with increasing complexity with the function glmer()
fit1 <- glmer(formula = y ~ (1|sire),
data = sim_twin, family = "binomial")
fit2 <- glmer(formula = y ~ lact + (1|sire),
data = sim_twin, family = "binomial")
fit3 <- glmer(formula = y ~ lact + (1|sire) + (1|herd),
data = sim_twin, family = "binomial")
fit4 <- glmer(formula = y ~ lact + (1|sire) + (lact|herd),
data = sim_twin, family = "binomial")
fit5 <- glmer(formula = y ~ lact + (lact|sire) + (1|herd),
data = sim_twin, family = "binomial")
fit6 <- glmer(formula = y ~ lact + (lact|sire),
data = sim_twin, family = "binomial")
anova(fit1, fit2, fit3, fit4, fit5, fit6)
Data: sim_twin
Models:
fit1: y ~ (1 | sire)
fit2: y ~ lact + (1 | sire)
fit3: y ~ lact + (1 | sire) + (1 | herd)
fit6: y ~ lact + (lact | sire)
fit4: y ~ lact + (1 | sire) + (lact | herd)
fit5: y ~ lact + (lact | sire) + (1 | herd)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fit1 2 5918.2 5932.2 -2957.1 5914.2
fit2 3 5580.6 5601.7 -2787.3 5574.6 339.61 1 <2e-16 ***
fit3 4 5582.6 5610.7 -2787.3 5574.6 0.00 1 1
fit6 5 5010.1 5045.2 -2500.0 5000.1 574.49 1 <2e-16 ***
fit4 6 5486.2 5528.4 -2737.1 5474.2 0.00 1 1
fit5 6 5012.1 5054.3 -2500.0 5000.1 474.13 0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 5 and 6 are very close according to AIC
.
anova(fit6, fit5)
Data: sim_twin
Models:
fit6: y ~ lact + (lact | sire)
fit5: y ~ lact + (lact | sire) + (1 | herd)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
fit6 5 5010.1 5045.2 -2500 5000.1
fit5 6 5012.1 5054.3 -2500 5000.1 0 1 1
The additional parameters of model 5 do not have a significant effect, which is why we prefer the simpler model. Fitting this model with cowfit_glmer()
already takes some seconds but is still possible without prespecified variance components.
system.time(fit7 <- cowfit_glmer(formula = y ~ lact + (lact|sire)
, data = sim_twin, family = "binomial"
, pedigree = list(sire = pedSires)
, cowfit_verbose = FALSE))
user system elapsed
64.403 0.048 64.509
Our goal is to exclude all sires which inherit a breeding value associated with probability > 0.5 for twins at third lactation.
re <- ranef(fit7)
bv <- re$sire
bv$sire <- rownames(bv)
betas <- fixef(fit7)
bv$lin_pred <- betas[1] + betas[2]*3 + bv$`(Intercept)` + bv$lact*3
bv$p <- inv_logit(bv$lin_pred)
bv$sire[which(bv$p > 0.5)]
[1] "2" "4" "320" "323" "324" "325" "334" "336" "338" "339" "342" "345"
[13] "348" "349"
Note that we have to calculate the linear predictor manually. Using predict()
will be wrong because it does use the random effects on a transformed scale.