Install and Load Package

Install the package devtools (with install.packages("devtools")) and run the following line of code

devtools::install_github("retodomax/cowfit", dependencies=TRUE)

Load the package with

library("cowfit")

General Procedure

Prepare Data and Pedigree

Your data has to be organized in two objects

  1. pedigree Store the pedigree in an object of class pedigree from package pedigreemm. Type ?pedigreemm::pedigree for more information of how construct an object of class pedigree.
  2. data table Store the observations together with predictor variables as columns in a data.frame. Make sure to define the desired data type for each column (numeric, factor, …).

Get First Impression of your Data

  • Think about model
    • Which variables might have a possible effect on the target variable?
    • Should a predictor be considered as fixed or random?
    • Do we have enoght data to fit all kinds of interactions?
  • Try different models with lme4
    • Fitting a model is much faster with lme4 compared to cowfit.
    • A first impression of which predictors might be relevant can be obtained with using lme4.
    • Note that not all models which can be fitted with cowfit are allowed with lme4 (animal model with single observations).

Estimate Model

  • One step approach
    • Fit full model and estimate variance components
    • Best option for small data sets with not so many grouping factor levels (sire model)
  • Two Step approach
    • First step: estimate variance components with a small subset of your data
    • Second step: Fit entire data set using estimated variance components
    • Best option for large data sets with many grouping factor levels (animal model)

Model Selection

  • Estimate different models using the one step or the two step approach
  • Compare different models (anova(), AIC(), Cross validation, …)
  • Choose best model

Predict Breeding Values

  • Use final model to predict breeding values

Example 1: LMM, small data set, sire model

The data set sim_milk contains 8375 observed milk yields from 1675 animals. They are descendants from 38 sires for which we want to predict breeding values. Our goal is to select for animals which have a higher milk yield under high-protein feeding.

We start fitting models of increasing complexity without considering the correlation of the random effects.

fit1 <- lmer(formula = y ~ (1|sire),
             data = sim_milk)
fit2 <- lmer(formula = y ~ protein + (1|sire),
             data = sim_milk)
fit3 <- lmer(formula = y ~ protein + (1|sire) + (1|herd),
             data = sim_milk)
fit4 <- lmer(formula = y ~ protein + (1|sire) + (protein|herd),
             data = sim_milk)
fit5 <- lmer(formula = y ~ protein + (protein|sire) + (1|herd),
             data = sim_milk)

Models with (protein|sire) + (protein|herd) will end up being singular and are therefore not considered. Next we do hierarchical model comparison with anova.

anova(fit1, fit2, fit3, fit4, fit5)
Data: sim_milk
Models:
fit1: y ~ (1 | sire)
fit2: y ~ protein + (1 | sire)
fit3: y ~ protein + (1 | sire) + (1 | herd)
fit4: y ~ protein + (1 | sire) + (protein | herd)
fit5: y ~ protein + (protein | sire) + (1 | herd)
     npar    AIC    BIC logLik deviance    Chisq Df Pr(>Chisq)    
fit1    3 118636 118657 -59315   118630                           
fit2    4  78937  78965 -39465    78929 39701.01  1  < 2.2e-16 ***
fit3    5  75701  75736 -37845    75691  3238.39  1  < 2.2e-16 ***
fit4    7  74874  74924 -37430    74860   830.23  2  < 2.2e-16 ***
fit5    7  57698  57748 -28842    57684 17176.06  0  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The additional parameters in the most complex model (fit5) seem to have a significant effect on the target variable. In the next step we refit the last two models and include the correlation structure due to the pedigree. We can fit the model with the one step approach because this is a rather small example.

fit6 <- cowfit_lmer(formula = y ~ protein + (1|sire) + (protein|herd),
                    data = sim_milk, pedigree = list(sire = pedSires),
                    cowfit_verbose = FALSE)

fit7 <- cowfit_lmer(formula = y ~ protein + (protein|sire) + (1|herd),
                    data = sim_milk, pedigree = list(sire = pedSires),
                    cowfit_verbose = FALSE)
anova(fit6, fit7)
Data: sim_milk
Models:
fit6: y ~ protein + (1 | sire) + (protein | herd)
fit7: y ~ protein + (protein | sire) + (1 | herd)
     npar   AIC   BIC logLik deviance Chisq Df Pr(>Chisq)    
fit6    7 74877 74926 -37432    74863                        
fit7    7 57702 57751 -28844    57688 17176  0  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we observe that fit6 is better also with considering the correlation structure in the random effects.

Finally we use the function ranef() to extract the random effects of the sires (aka breeding values). Each sire will have two breeding values. The random intercept breeding value is defined as two times the deviation from the population mean of an average offspring at low protein level. The random slope breeding value is associated with the expected effect on higher protein levels

re <- ranef(fit7)
bv <- re$sire
plot(1, xlim = c(0, 5), ylim = c(-150, 150),
     xlab = "protein", ylab = "effect on milk yield",
     type = "n")
z <- apply(bv, 1, function(x) abline(a = x[1], b = x[2]))
bv$eff_prot5 <- bv$`(Intercept)` + bv$protein*5
ind <- which(bv$eff_prot5 > 130)
rownames(bv[ind,])
[1] "321" "324"
z <- apply(bv[ind,], 1, function(x) abline(a = x[1], b = x[2],
                                           col = "red"))

We should select the animal 321 and 324 (red lines) if we select animals which perform better with high protein feedstuff.

Example 2: LMM, large data set, animal model

Next we consider problems where we want to fit an animal model with a larger pedigree. This implies a large number of random effects which makes the estimation computationally difficult. We use the two step approach in order to fit the model in reasonable time. The data set sim_fat and pedigree pedCows are still relatively small, however, the techniques we are using also apply to larger problems. As before, we start fitting the whole data set without considering the correlation of the random effects.

fit1 <- lmer(formula = y ~ (1|animal),
             data = sim_fat)
fit2 <- lmer(formula = y ~ lact + (1|animal),
             data = sim_fat)
fit3 <- lmer(formula = y ~ (1|animal) + (1|herd),
             data = sim_fat)
anova(fit1, fit2, fit3)
Data: sim_fat
Models:
fit1: y ~ (1 | animal)
fit2: y ~ lact + (1 | animal)
fit3: y ~ (1 | animal) + (1 | herd)
     npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
fit1    3 26613 26631 -13303    26607                     
fit2    4 26615 26639 -13303    26607 0.1958  1     0.6581
fit3    4 26615 26639 -13303    26607 0.0000  0     1.0000

We choose the first model and fit the model to a small subset of the data and pedigree.

## Maybe use subPed() ??
library(optiSel)
myp <- subPed(Pedig = data.frame(ID = pedCows@label, Sire = pedCows@sire, Dam = pedCows@dam),
              keep = "2000", prevGen = 0, succGen = 10)
## However we need less random effects and more observations (ideally)
## How can we sample to have MANY observations and LESS random effects?

## 2. Option
## Use ANIMAL model
# BUT: are variance components of animal model the same?
# In my opinion variance components are 0.25 times variance components of animal model
# (effect is only halfe the effect in animal model)

Example 3: GLMM, small data set, sire model

Let’s try an example where our target variable is not continuous but a binary. Assume we want to have less multiple birth events in a population of animals because multiple births are associated with complications. We do not care about the exact number of offspring therefore we just model a binary trait, multiple birth true or false. We also expect an influence of predictor variable lact on the probability for multiple birth. The data is stored in data frame sim_twin and pedSires is the associated pedigree of the sires.

We start by fitting models with increasing complexity with the function glmer()

fit1 <- glmer(formula = y ~ (1|sire),
              data = sim_twin, family = "binomial")
fit2 <- glmer(formula = y ~ lact + (1|sire),
              data = sim_twin, family = "binomial")
fit3 <- glmer(formula = y ~ lact + (1|sire) + (1|herd),
              data = sim_twin, family = "binomial")
fit4 <- glmer(formula = y ~ lact + (1|sire) + (lact|herd),
              data = sim_twin, family = "binomial")
fit5 <- glmer(formula = y ~ lact + (lact|sire) + (1|herd),
              data = sim_twin, family = "binomial")
fit6 <- glmer(formula = y ~ lact + (lact|sire),
              data = sim_twin, family = "binomial")
anova(fit1, fit2, fit3, fit4, fit5, fit6)
Data: sim_twin
Models:
fit1: y ~ (1 | sire)
fit2: y ~ lact + (1 | sire)
fit3: y ~ lact + (1 | sire) + (1 | herd)
fit6: y ~ lact + (lact | sire)
fit4: y ~ lact + (1 | sire) + (lact | herd)
fit5: y ~ lact + (lact | sire) + (1 | herd)
     npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
fit1    2 5918.2 5932.2 -2957.1   5914.2                         
fit2    3 5580.6 5601.7 -2787.3   5574.6 339.61  1     <2e-16 ***
fit3    4 5582.6 5610.7 -2787.3   5574.6   0.00  1          1    
fit6    5 5010.1 5045.2 -2500.0   5000.1 574.49  1     <2e-16 ***
fit4    6 5486.2 5528.4 -2737.1   5474.2   0.00  1          1    
fit5    6 5012.1 5054.3 -2500.0   5000.1 474.13  0     <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 5 and 6 are very close according to AIC.

anova(fit6, fit5)
Data: sim_twin
Models:
fit6: y ~ lact + (lact | sire)
fit5: y ~ lact + (lact | sire) + (1 | herd)
     npar    AIC    BIC logLik deviance Chisq Df Pr(>Chisq)
fit6    5 5010.1 5045.2  -2500   5000.1                    
fit5    6 5012.1 5054.3  -2500   5000.1     0  1          1

The additional parameters of model 5 do not have a significant effect, which is why we prefer the simpler model. Fitting this model with cowfit_glmer() already takes some seconds but is still possible without prespecified variance components.

system.time(fit7 <- cowfit_glmer(formula = y ~ lact + (lact|sire)
                                 , data = sim_twin, family = "binomial"
                                 , pedigree = list(sire = pedSires)
                                 , cowfit_verbose = FALSE))
   user  system elapsed 
 64.403   0.048  64.509 

Our goal is to exclude all sires which inherit a breeding value associated with probability > 0.5 for twins at third lactation.

re <- ranef(fit7)
bv <- re$sire
bv$sire <- rownames(bv)
betas <- fixef(fit7)
bv$lin_pred <- betas[1] + betas[2]*3 + bv$`(Intercept)` + bv$lact*3
bv$p <- inv_logit(bv$lin_pred)
bv$sire[which(bv$p > 0.5)]
 [1] "2"   "4"   "320" "323" "324" "325" "334" "336" "338" "339" "342" "345"
[13] "348" "349"

Note that we have to calculate the linear predictor manually. Using predict() will be wrong because it does use the random effects on a transformed scale.

Example 4: GLMM, large data set, animal model