Genomic evaluations with many more genotypes

Genomic evaluations with many more genotypes

(Parte 1 de 4)

RESEARCH Open Access

Genomic evaluations with many more genotypes Paul M VanRaden1*, Jeffrey R O’Connell2, George R Wiggans1, Kent A Weigel3


Background: Genomic evaluations in Holstein dairy cattle have quickly become more reliable over the last two years in many countries as more animals have been genotyped for 50,0 markers. Evaluations can also include animals genotyped with more or fewer markers using new tools such as the 7,0 or 2,900 marker chips recently introduced for cattle. Gains from more markers can be predicted using simulation, whereas strategies to use fewer markers have been compared using subsets of actual genotypes. The overall cost of selection is reduced by genotyping most animals at less than the highest density and imputing their missing genotypes using haplotypes. Algorithms to combine different densities need to be efficient because numbers of genotyped animals and markers may continue to grow quickly.

Methods: Genotypes for 500,0 markers were simulated for the 3,414 Holsteins that had 50,0 marker genotypes in the North American database. Another 86,465 non-genotyped ancestors were included in the pedigree file, and linkage disequilibrium was generated directly in the base population. Mixed density datasets were created by keeping 50,0 (every tenth) of the markers for most animals. Missing genotypes were imputed using a combination of population haplotyping and pedigree haplotyping. Reliabilities of genomic evaluations using linear and nonlinear methods were compared.

Results: Differing marker sets for a large population were combined with just a few hours of computation. About 95% of paternal alleles were determined correctly, and > 95% of missing genotypes were called correctly. Reliability of breeding values was already high (84.4%) with 50,0 simulated markers. The gain in reliability from increasing the number of markers to 500,0 was only 1.6%, but more than half of that gain resulted from genotyping just 1,406 young bulls at higher density. Linear genomic evaluations had reliabilities 1.5% lower than the nonlinear evaluations with 50,0 markers and 1.6% lower with 500,0 markers.

Conclusions: Methods to impute genotypes and compute genomic evaluations were affordable with many more markers. Reliabilities for individual animals can be modified to reflect success of imputation. Breeders can improve reliability at lower cost by combining marker densities to increase both the numbers of markers and animals included in genomic evaluation. Larger gains are expected from increasing the number of animals than the number of markers.

Background Breeders now use thousands of genetic markers to select and improve animals. Previously only phenotypes and pedigrees were used in selection, but performance and parentage information was collected, stored, and evaluated affordably and routinely for many traits and many millions of animals. Genetic markers had limited use during the century after Mendel’s principles of genetic inheritance were rediscovered because few major QTL were identified and because marker genotypes were expensive to obtain before 2008. Genomic evaluation s implemented in the last two years for dairy cattle have greatly improved reliability of selection, especia lly for younger animals, by using many markers to trace the inheritance of many QTL with small effects.

More genetic markers can increase both reliability and cost of genomic selection. Genotypes for 50,0 markers now cost <US$200 per animal for cattle, pigs, chickens, and sheep. Lower cost chips contain ing fewer (2,900) markers and higher cost chips with more (7,0) markers are already available for cattle, and additional genotyping tools will become availablef or cattlea nd other

* Correspondence: 1Animal Improvement Programs Laboratory, USDA, Building 5 BARC-West, Beltsville, MD 20705-2350, USA Full list of author information is available at the end of the article

VanRaden et al. Genetics Selection Evolution 2011, 43:10 Genetics

Selection Evolution

© 2011 VanRaden, et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

species in the near future. All three billion DNA base pairs of several Holstein bulls have been fully sequenced and costs of sequence data are rapidly declining.

Reliabilities of genomic predictions were compared in previous studies for up to 50,0 actual or 1 million simu lated marke rs. Reliabi lities for young animals increased gradually as marker numbers increased from a few hundred up to 50,0 [1-3], and increased slightly when markers with low minor allele frequency were included [4]. For low- to medium-density panels (300 to 3,0 markers), selection of markers with large effects preserves more reliability if only the selected markers are used in the evaluation [5], but evenly spaced markers preserve more reliability for all traits if imputation is used [6]. Reliabilities increased from 81 up to 83% as numbers of simulated markers increased from 50,0 to 100,0 using 40,0 predictor bulls [7], however, base population allele s in that study were in equil ibrium rather than disequilibrium.

Increasing marker numbers above 20,0 up to 1 million linked markers resulted in almost no gains in reliability in a simulation of 10 chromosomes and 1,500 QTL [8]. Larger gains resulted in a simulation of only one chromosome containing three to 30 QTL that accounted for all of the additive variance [9]. Many genome-wide association studies of human traits have combined large numbers of markers from different chips [10], but those studies almost always estimated effects of individual loci rather than included all the loci to estimate the total genetic effect.

Many genotypes will be missing in the future when data from denser or less dense chips are merged with current genotypes from 50,0-marker chips or when two different 50,0-marker sets are merged, as is being done in the EuroGenomics project [1,12]. Missing genotypes of descendants can be imputed accurately using low-density marker sets if ancestor haplotypes are available [13-15]. At low marker densities, haplotypes provide higher accuracy than genotypes when included in genomic evaluation [1,16]. Missing genotypes were not an immediate problem with data from a 50,0-marker set because >9% of genotypes were read correctly [17].

Fewer markers can be used to trace chromosome segments within a population once identified by high-density haplotyping. Without haplotyping, regressions could simply be computed for available SNP and the rest disregarded. With haplotyping, effects of both observed and unobserv ed SNP can be included. Transition to higher density chips will require including multiple marker sets in one analysis because breeders will not regenotype most animals.

Simulated genotypes and haplotypes can be more useful than real data to test programs and hypothese s. Example s are analyse s of larger data sets than are currently available or comparison of estimated haplotypes with true haplotypes, which are not observable in real data. Most simulations begin with all alleles in the founding generation in Hardy-Weinberg equilibrium and then introduce linkage disequili brium (LD) using many non-overlapping generations of hypothetical pedigrees [18] or fewer generations of actual pedigree [19]. Simulations can also include selection [20] or model divergent populations such as breeds [21]. Many genomic evaluation studies simulated shorter genomes and fewer chromosomes than in actual populations, presumably because computing times for obtaining complete data were too long.

Goals of this study are to 1) impute genotypes using a combination of population and pedigree haplotyping, 2) compute genomic evaluations with up to 500,0 simulated markers, and 3) evaluate potential gains in reliability from increasing numbers of markers.

Methods Haplotyping program Unknow n genotypes can be made known (impute d) from observed genotypes at the same or nearby loci of relatives using pedigree haplotyping or from matching allele patterns (regardless of pedigree) using population haplotypin g. Haplotypes indicate which alleles are on each chromosome and can distin guish the maternal chromosome provided by the ovum from the paternal chromosome provided by the sperm. Genotypes indicate only how many copies of each allele an individual inherited from its two parents.

Fortran program findhap.f90 was designed to combine population and pedigree haplotyping. Genotypes were coded numerically as 0 if homozygous for the first allele, 2 if homozygous for the second allele, and 1 if heterozygous or not known; haplotypes were coded as 0 for the first allele, 2 for the second allele, and 1 for unknown to simplifym atching. Thea lgorithm beganb yc reatinga list of haplotypes from the genotypes in the first pass, and the process was iterated so genotypes earlier in the file could be matched again using haplotype refinements that occurred later.

Steps used in the populati on haplotyping algorithm were: 1) each chromosome was divided into segments of about 500 markers each when analyzing the 500,0 marker or mixed datasets and 100 markers each for 50,0 marker data; 2) the first genotype was entered into the haplotype list as if it was a haplotype; 3) any subsequent genotypes that shared a haplotype were then used to split the previous genotypes into haplotypes; 4) as each genotype was compared to the list, a match was declared if no homozygou s loci conflic ted with the stored haplotype; 5) any remaining unknown alleles in that haplotype were imputed from homozygou s alleles

VanRaden et al. Genetics Selection Evolution 2011, 43:10 Page 2 of 1 in the genotype; 6) the individual ’s second haploty pe was obtained by subtracting its first haplotype from its genotype, and the second haplotype was checked against remaining haplotypes in the list; 7) if no match was found, the new genotype (or haplotype) was added to the end of the list. Unknown alleles in the genotype were stored as unknown alleles in the haplotype; 8) the list of currently known haplotypes was sorted from most to least frequent as haplotypes were found for efficiency and so that more probable haplotypes were preferred.

Steps 4) and 6) of the algorithm for population haplotyping are demonstrated in Figure 1 for a shortened segment of 57 markers. The example genotype conflicted with the first four listed haplotypes but had no conflicts with haplotype number 5. After removing haplotype 5 from the genotype to obtain the animal’s complementary haplotype, the algorithm searched for the complementary haplotype in the remainder of the list until it was identified as haplotype 8. Instead of storing all 57 codes from the segments found, this animal’s haplotypes were stored simply as 5 and 8. In practice, some alleles in the least frequen t haploty pes remain unknown because few or no matches were found or because each matching genotype happened to be heterozygous at that locus.

Iteration proceeded as follows. The first two iterations used only population haplotyping and not the pedigree. The first used only the highest density genotypes, and later iterations used all genotypes. The third and fourth iterations used both pedigree and population methods to locate matching haplotypes. Known haploty pes of genotyped parents (or grandparents if parents were not genotyped) were checked first, and if either of the individual ’sh aplotypesw eren otf oundw itht hisq uick check then checking restarted from the top of the sorted list. For example, the algorithm in Figure 1 could check haplotypes 5 and 8 first if parent genotypes are known to contain these haplotypes. The last two iterations did not search sequentially through the haplotype list and instead used only pedigrees to impute haplotypes of non-genotyped ancestors from their genotyped descendants, locate crossovers that created new haplotypes, and resolve conflicts between parent and progeny haplotypes. If parent and progeny haplotypes differed at just one marker, the difference was assumed to be genotyping error, and the more frequent haplotype was substituted for the less frequent.

Imputa tion success was measured in several ways.

Percentages of alleles missing before and after imputation indicated the amount of fill needed and remaining. Percentages of incorrect genotypes were calculated across all loci including the genotypes observ ed, the haplotypes imputed, and the remaining haplotypes not imputed but simply assigned alleles using allele frequency. An alternative error rate counted differences between heterozygous and homozygous genotypes as only half errors and differences between opposite homozyg otes as full error s across the imputed and assigned loci but not including the observed loci [1]. The percentage of true linkage s between consec utive heterozygous markers that differed from estimated linkages was determined, as well as the percentage of

Get 2ndhaplotypeby removing 1stfrom genotype: 022002222002220022022020220020200202202000202020020002020

Search for 1sthaplotypethat matches genotype: 022112222011221022021110220010110212202000102020120002021

3.2% 020022002200200202000220000202000002020202020220 Figure 1 Demonstration of algorithm to find first and second haplotypes.

VanRaden et al. Genetics Selection Evolution 2011, 43:10 Page 3 of 1 heterozygous loci at which the allele estimated to be paternal was actually maternally inherited.

Simulating linkage disequilibrium Methods to simulate LD were derived and the simulation program of [19] was modified to generate LD directly in the earliest known ancestors in the pedigree (the founding population). Previo usly, marker alleles were simulated in equilibrium and uncorrelated across loci in the founding population, but genotypes at adjacent markers become more correlated as marker densities increase. Most other studies [18] used thousands of generations of random mating to establish a balance between recombinatio n, drift, and mutat ion in small populations with actual size set equal to effective size. Fewer rare and more common haplotypes would occur than in actual populations with unbalan ced contributions to the next generation. Neither the standard nor the new approach may provide exactly the same LD pattern as in actual genotypes.

Initial LD was generated by establishing marker properties for the population, simulating underlying, unobservable, linked bi-allelic markers that each have an allele frequency of 0.5, and setting minor allele frequencies for observed markers to <0.5 by randomly replacing a corresponding fraction of the underlying alleles by the major allele.

(Parte 1 de 4)