Filter by type:


The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.


We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.

Availability and implementation

Our tool RecoverY is freely available at

Supplementary information

Supplementary data are available at Bioinformatics online.



Maternal breast milk (MBM) is enriched in microRNAs, factors that regulate protein translation throughout the human body. MBM from mothers of term and preterm infants differs in nutrient, hormone, and bioactive-factor composition, but the microRNA differences between these groups have not been compared. We hypothesized that gestational age at delivery influences microRNA in MBM, particularly microRNAs involved in immunologic and metabolic regulation.


MBM from mothers of premature infants (pMBM) obtained 3-4 weeks post delivery was compared with MBM from mothers of term infants obtained at birth (tColostrum) and 3-4 weeks post delivery (tMBM). The microRNA profile in lipid and skim fractions of each sample was evaluated with high-throughput sequencing.


The expression profiles of nine microRNAs in lipid and skim pMBM differed from those in tMBM. Gene targets of these microRNAs were functionally related to elemental metabolism and lipid biosynthesis. The microRNA profile of tColostrum was also distinct from that of pMBM, but it clustered closely with tMBM. Twenty-one microRNAs correlated with gestational age demonstrated limited relationships with method of delivery, but not other maternal-infant factors.


Premature delivery results in a unique MBM microRNA profile with metabolic targets. This suggests that preterm milk may have adaptive functions for growth in premature infants.


Hundreds of vertebrate genomes have been sequenced and assembled to date. However, most sequencing projects have ignored the sex chromosomes unique to the heterogametic sex – Y and W – that are known as sex-limited chromosomes (SLCs). Indeed, haploid and repetitive Y chromosomes in species with male heterogamety (XY), and W chromosomes in species with female heterogamety (ZW), are difficult to sequence and assemble. Nevertheless, obtaining their sequences is important for understanding the intricacies of vertebrate genome function and evolution. Recent progress has been made towards the adaptation of next-generation sequencing (NGS) techniques to deciphering SLC sequences. We review here currently available methodology and results with regard to SLC sequencing and assembly. We focus on vertebrates, but bring in some examples from other taxa.

Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.

Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs’ integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.

The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.


Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3” half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology.


Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies - and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy.


Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.


The uses of the Genome Reference Consortium’s human reference sequence can be roughly categorized into three related but distinct categories: as a representative species genome, as a coordinate systemfor identifying variants, and as an alignment reference for variation detection algorithms. However, the use of this reference sequence as simultaneously a representative species genome and as an alignment reference leads to unnecessary artifacts for structural variation detection algorithms and limits their accuracy.We show how decoupling these two references and developing a separate alignment reference can significantly improve the accuracy of structural variation detection, lead to improved genotyping of disease related genes, and decrease the cost of studying polymorphismin a population.
PLOS ONE, 2015

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCRfree protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

In this article we review a number of recent studies in which information derived from genomic alignments and data concerning composition, location and biochemical features of the nuclear DNA are used to investigate salient properties and determinants of change (mutations) in the human genome. The studies under review, all conducted by an interdisciplinary group of investigators at The Pennsylvania State University, required the use of a range of statistical techniques—from regression, to multivariate analysis, to the modeling of latent structures.

The variation in local rates of mutations can affect both the evolution of genes and their function in normal and cancer cells. Deciphering the molecular determinants of this variation will be aided by the elucidation of distinct types of mutations, as they differ in regional preferences and in associations with genomic features. Chromatin organization contributes to regional variation in mutation rates, but its contribution differs among mutation types. In both germline and somatic mutations, base substitutions are more abundant in regions of closed chromatin, perhaps reflecting error accumulation late in replication. By contrast, a distinctive mutational state with very high levels of insertions and deletions (indels) and substitutions is enriched in regions of open chromatin. These associations indicate an intricate interplay between the nucleotide sequence of DNA and its dynamic packaging into chromatin, and have important implications for current biomedical research. This Review focuses on recent studies showing associations between chromatin state and mutation rates, including pairwise and multivariate investigations of germline and somatic (particularly cancer) mutations.

The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000x per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the effective size of the germline mtDNA bottleneck at only ∼30-35 (interquartile range from 9 to 141). Accounting for heteroplasmies, we estimated the mtDNA germ-line mutation rate at 1.3 × 10-8 (interquartile range from 4.2 × 10-9 to 4.1 × 10-8) mutations per site per year, an order of magnitude higher than for nuclear DNA. Notably,we found a positive association between the number of heteroplasmies in a child andmaternal age at fertilization, likely attributable to oocyte aging. This study also took advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novomutation. Our results can be used to predict the transmission of disease-causing mtDNA variants and illuminate evolutionary dynamics of the mitochondrial genome.


Because early life growth has long-lasting metabolic and behavioral consequences, intervention during this period of developmental plasticity may alter long-term obesity risk. While modifiable factors during infancy have been identified, until recently, preventive interventions had not been tested. The Intervention Nurses Starting Infants Growing on Healthy Trajectories (INSIGHT). Study is a longitudinal, randomized, controlled trial evaluating a responsive parenting intervention designed for the primary prevention of obesity. This “parenting” intervention is being compared with a home safety control among first-born infants and their parents. INSIGHT’s central hypothesis is that responsive parenting and specifically responsive feeding promotes self-regulation and shared parent-child responsibility for feeding, reducing subsequent risk for overeating and overweight.


316 first-time mothers and their full-term newborns were enrolled from one maternity ward. Two weeks following delivery, dyads were randomly assigned to the “parenting” or “safety” groups. Subsequently, research nurses conduct study visits for both groups consisting of home visits at infant age 3-4, 16, 28, and 40 weeks, followed by annual clinic-based visits at 1, 2, and 3 years. Both groups receive intervention components framed around four behavior states: Sleeping, Fussy, Alert and Calm, and Drowsy. The main study outcome is BMI z-score at age 3 years; additional outcomes include those related to patterns of infant weight gain, infant sleep hygiene and duration, maternal responsiveness and soothing strategies for infant/toddler distress and fussiness, maternal feeding style and infant dietary content and physical activity. Maternal outcomes related to weight status, diet, mental health, and parenting sense of competence are being collected. Infant temperament will be explored as a moderator of parenting effects, and blood is collected to obtain genetic predictors of weight status. Finally, second-born siblings of INSIGHT participants will be enrolled in an observation-only study to explore parenting differences between siblings, their effect on weight outcomes, and carryover effects of INSIGHT interventions to subsequent siblings.


With increasing evidence suggesting the importance of early life experiences on long-term health trajectories, the INSIGHT trial has the ability to inform future obesity prevention efforts in clinical settings.

Trial registration

NCT01167270. Registered 21 July 2010.


Interruptions of microsatellite sequences impact genome evolution and can alter disease manifestation. However, human polymorphism levels at interrupted microsatellites (iMSs) are not known at a genome-wide scale, and the pathways for gaining interruptions are poorly understood. Using the 1000 Genomes Phase-1 variant call set, we interrogated mono-, di-, tri-, and tetranucleotide repeats up to 10 units in length. We detected ~26,000-40,000 iMSs within each of four human population groups (African, European, East Asian, and American). We identified population-specific iMSs within exonic regions, and discovered that known disease-associated iMSs contain alleles present at differing frequencies among the populations. By analyzing longer microsatellites in primate genomes, we demonstrate that single interruptions result in a genome-wide average two- to six-fold reduction in microsatellite mutability, as compared with perfect microsatellites. Centrally located interruptions lowered mutability dramatically, by two to three orders of magnitude. Using a biochemical approach, we tested directly whether the mutability of a specific iMS is lower because of decreased DNA polymerase strand slippage errors. Modeling the adenomatous polyposis coli tumor suppressor gene sequence, we observed that a single base substitution interruption reduced strand slippage error rates five- to 50-fold, relative to a perfect repeat, during synthesis by DNA polymerases α, β, or η. Computationally, we demonstrate that iMSs arise primarily by base substitution mutations within individual human genomes. Our biochemical survey of human DNA polymerase α, β, δ, κ, and η error rates within certain microsatellites suggests that interruptions are created most frequently by low fidelity polymerases. Our combined computational and biochemical results demonstrate that iMSs are abundant in human genomes and are sources of population-specific genetic variation that may affect genome stability. The genome-wide identification of iMSs in human populations presented here has important implications for current models describing the impact of microsatellite polymorphisms on gene expression.

The integration and fixation preferences of DNA transposons, one of the major classes of eukaryotic transposable elements, have never been evaluated comprehensively on a genome-wide scale. Here, we present a detailed study of the distribution of DNA transposons in the human and bat genomes. We studied three groups of DNA transposons that integrated at different evolutionary times: 1) ancient (>40 My) and currently inactive human elements, 2) younger (<40 My) bat elements, and 3) ex vivo integrations of piggyBat and Sleeping Beauty elements in HeLa cells. Although the distribution of ex vivo elements reflected integration preferences, the distribution of human and (to a lesser extent) bat elements was also affected by selection. We used regression techniques (linear, negative binomial, and logistic regression models with multiple predictors) applied to 20-kb and 1-Mb windows to investigate how the genomic landscape in the vicinity of DNA transposons contributes to their integration and fixation. Our models indicate that genomic landscape explains 16-79% of variability in DNA transposon genome-wide distribution. Importantly, we not only confirmed previously identified predictors (e.g., DNA conformation and recombination hotspots) but also identified several novel predictors (e.g., signatures of double-strand breaks and telomere hexamer). Ex vivo integrations showed a bias toward actively transcribed regions. Older DNA transposons were located in genomic regions scarce in most conserved elements - likely reflecting purifying selection. Our study highlights how DNA transposons are integral to the evolution of bat and human genomes, and has implications for the development of DNA transposon assays for gene therapy and mutagenesis applications.

The development of molecular tools to detect and report mitochondrial DNA (mtDNA) heteroplasmy will increase the discrimination potential of the testing method when applied to forensic cases. The inherent limitations of the current state-of-the-art, Sanger-based sequencing, including constrictions in speed, throughput, and resolution, have hindered progress in this area. With the advent of next-generation sequencing (NGS) approaches, it is now possible to clearly identify heteroplasmic variants, and at a much lower level than previously possible. However, in order to bring these approaches into forensic laboratories and subsequently as accepted scientific information in a court of law, validated methods will be required to produce and analyze NGS data. We report here on the development of an optimized approach to NGS analysis for the mtDNA genome (mtgenome) using the Illumina MiSeq instrument. This optimized protocol allows for the production of more than 5 gigabases of mtDNA sequence per run, sufficient for detection and reliable reporting of minor heteroplasmic variants down to approximately 0.5-1.0% when multiplexing twelve samples. Depending on sample throughput needs, sequence coverage rates can be set at various levels, but were optimized here for at least 5000 reads. In addition, analysis parameters are provided for a commercially available software package that identify the highest quality sequencing reads and effectively filter out sequencing-based noise. With this method it will be possible to measure the rates of low-level heteroplasmy across the mtgenome, evaluate the transmission of heteroplasmy between the generations of maternal lineages, and assess the drift of variant sequences between different tissue types within an individual.

Polymorphism discovery is a routine application of next-generation sequencing technology where multiple samples are sent to a service provider for library preparation, subsequent sequencing, and bioinformatic analyses. The decreasing cost and advances in multiplexing approaches have made it possible to analyze hundreds of samples at a reasonable cost. However, because of the manual steps involved in the initial processing of samples and handling of sequencing equipment, cross-contamination remains a significant challenge. It is especially problematic in cases where polymorphism frequencies do not adhere to diploid expectation, for example, heterogeneous tumor samples, organellar genomes, as well as during bacterial and viral sequencing. In these instances, low levels of contamination may be readily mistaken for polymorphisms, leading to false results. Here we describe practical steps designed to reliably detect contamination and uncover its origin, and also provide new, Galaxy-based, readily accessible computational tools and workflows for quality control. All results described in this report can be reproduced interactively on the web as described at

Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human-orangutan alignments of neutrally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states - each uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine- and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants-including those associated with cancer and other diseases-and to improve computational predictions of noncoding functional elements.

Dinucleotide microsatellites are dynamic DNA sequences that affect genome stability. Here, we focused on mature microsatellites, defined as pure repeats of lengths above the threshold and unlikely to mutate below it in a single mutational event. We investigated the prevalence and mutational behavior of these sequences by using human genome sequence data, human cells in culture, and purified DNA polymerases. Mature dinucleotides (≥10 units) are present within exonic sequences of >350 genes, resulting in vulnerability to cellular genetic integrity. Mature dinucleotide mutagenesis was examined experimentally using ex vivo and in vitro approaches. We observe an expansion bias for dinucleotide microsatellites up to 20 units in length in somatic human cells, in agreement withprevious computational analyses of germline biases. Using purified DNA polymerases and human cell lines deficient for mismatch repair (MMR), we show that the expansion bias is caused by functional MMR and is not due to DNA polymerase error biases. Specifically, we observe that the MutSα and MutLα complexes protect against expansion mutations. Our data support a model wherein different MMR complexes shift the balance of mutations toward deletionor expansion. Finally, we show that replication fork progression is stalled within long dinucleotides, suggesting that mutational mechanisms within long repeats may be distinct from shorter lengths, depending on the biochemistry of fork resolution. Our work combines computational and experimental approaches to explain the complex mutational behavior of dinucleotide microsatellites in humans.

A tandem repeat’s (TR) propensity to mutate increases with repeat number, and can become very pronounced beyond a critical boundary, transforming it into a microsatellite (MS). However, a clear understanding of the mutational behavior of different TR classes and motifs and related mechanisms is lacking, as is a consensus on the existence of a boundary separating short TRs (STRs) from MSs. This hinders our understanding of MSs’ mutational properties and their effective use as genetic markers. Using indel calls for 179 individuals from 1000 Genomes Pilot-1 Project, we determined polymorphism incidence for four major TR classes, and formalized its varying relationship with repeat number using segmented regression. We observed a biphasic regime with a transition from a faster to a slower exponential growth at 9, 5, 4, and 4 repeats for mono-, di-, tri-, and tetranucleotide TRs, respectively. We used an in vitro mutagenesis assay to evaluate the contribution of strand slippage errors to mutability. STRs and MSs differ in their absolute polymorphism levels, but more importantly in their rates of mutability growth. Although strand slippage is a major factor driving mononucleotide polymorphism incidence, dinucleotide polymorphism incidence is greater than that expected due to strand slippage alone, indicating that additional cellular factors might be driving dinucleotide mutability in the human genome. Leveraging on hundreds of human genomes, we present the first comprehensive, genome-wide analysis of TR mutational behavior, encompassing several motif sizes and compositions.

Alu elements are trans-mobilized by the autonomous non-LTR retroelement, LINE-1 (L1). Alu-induced insertion mutagenesis contributes to about 0.1% human genetic disease and is responsible for the majority of the documented instances of human retroelement insertion-induced disease. Here we introduce a SINE recovery method that provides a complementary approach for comprehensive analysis of the impact and biological mechanisms of Alu retrotransposition. Using this approach, we recovered 226 de novo tagged Alu inserts in HeLa cells. Our analysis reveals that in human cells marked Alu inserts driven by either exogenously supplied full length L1 or ORF2 protein are indistinguishable. Four percent of de novo Alu inserts were associated with genomic deletions and rearrangements and lacked the hallmarks of retrotransposition. In contrast to L1 inserts, 5′ truncations of Alu inserts are rare, as most of the recovered inserts (96.5%) are full length. De novo Alus show a random pattern of insertion across chromosomes, but further characterization revealed an Alu insertion bias exists favoring insertion near other SINEs, highly conserved elements, with almost 60% landing within genes. De novo Alu inserts show no evidence of RNA editing. Priming for reverse transcription rarely occurred within the first 20 bp (most 5′) of the A-tail. The A-tails of recovered inserts show significant expansion, with many at least doubling in length. Sequence manipulation of the construct led to the demonstration that the A-tail expansion likely occurs during insertion due to slippage by the L1 ORF2 protein. We postulate that the A-tail expansion directly impacts Alu evolution by reintroducing new active source elements to counteract the natural loss of active Alus and minimizing Alu extinction.

Chromosomal common fragile sites (CFSs) are unstable genomic regions that break under replication stress and are involved in structural variation. They frequently are sites of chromosomal rearrangements in cancer and of viral integration. However, CFSs are undercharacterized at the molecular level and thus difficult to predict computationally. Newly available genome-wide profiling studies provide us with an unprecedented opportunity to associate CFSs with features of their local genomic contexts. Here, we contrasted the genomic landscape of cytogenetically defined aphidicolin-induced CFSs (aCFSs) to that of nonfragile sites, using multiple logistic regression. We also analyzed aCFS breakage frequencies as a function of their genomic landscape, using standard multiple regression. We show that local genomic features are effective predictors both of regions harboring aCFSs (explaining ∼77% of the deviance in logistic regression models) and of aCFS breakage frequencies (explaining ∼45% of the variance in standard regression models). In our optimal models (having highest explanatory power), aCFSs are predominantly located in G-negative chromosomal bands and away from centromeres, are enriched in Alu repeats, and have high DNA flexibility. In alternative models, CpG island density, transcription start site density, H3K4me1 coverage, and mononucleotide microsatellite coverage are significant predictors. Also, aCFSs have high fragility when colocated with evolutionarily conserved chromosomal breakpoints. Our models are predictive of the fragility of aCFSs mapped at a higher resolution. Importantly, the genomic features we identified here as significant predictors of fragility allow us to draw valuable inferences on the molecular mechanisms underlying aCFSs.

Microsatellites-tandem repeats of short DNA motifs-are abundant in the human genome and have high mutation rates. While microsatellite instability is implicated in numerous genetic diseases, the molecular processes involved in their emergence and disappearance are still not well understood. Microsatellites are hypothesized to follow a life cycle, wherein they are born and expand into adulthood, until their degradation and death. Here we identified microsatellite births/deaths in human, chimpanzee, and orangutan genomes, using macaque and marmoset as outgroups.We inferred mutations causing births/deaths based on parsimony, and investigated local genomic environments affecting them. We also studied birth/death patterns within transposable elements (Alus and L1s), coding regions, and disease-associated loci. We observed that substitutions were the predominant cause for births of short microsatellites, while insertions and deletions were important for births of longermicrosatellites. Substitutions were the cause for deaths ofmicrosatellites of virtually all lengths. AT-rich L1 sequences exhibited elevated frequency of births/deaths over their entire length, while GC-rich Alus only in their 3′ poly(A) tails and middle A-stretches, with differences depending on transposable element integration timing. Births/deaths were strongly selected against in coding regions. Births/deaths occurred in genomic regions with high substitution rates, protomicrosatellite content, and L1 density, but low GC content and Alu density. The majority of the 17 disease-associated microsatellites examined are evolutionarily ancient (were acquired by the common ancestor of simians). Our genome-wide investigation of microsatellite life cycle has fundamental applications for predicting the susceptibility of birth/death of microsatellites, including many disease-causing loci.



Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission.


Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, we devised a set of criteria for detecting polymorphic sites in heterogeneous genetic samples that is resistant to the noise originating from massively parallel sequencing technologies. Application of these criteria to nine human mtDNA samples revealed four heteroplasmic sites.


Our results suggest that the incidence of heteroplasmy may be lower than estimated in some other recent re-sequencing studies, and that mtDNA allelic frequencies differ significantly both between tissues of the same individual and between a mother and her offspring. We designed our study in such a way that the complete analysis described here can be repeated by anyone either at our site or directly on the Amazon Cloud. Our computational pipeline can be easily modified to accommodate other applications, such as viral re-sequencing.



While the abundance of available sequenced genomes has led to many studies of regional heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the associations with multiple genomic features at different genomic scales and phylogenetic distances.


We observe a consistent, largely linear co-variation among rates of nucleotide substitutions, small insertions and small deletions, with some non-linear associations detected among these rates on chromosome X and near autosomal telomeres. This co-variation appears to be shaped by a common set of genomic features, some previously investigated and some novel to this study (nuclear lamina binding sites, methylated non-CpG sites and nucleosome-free regions). Strong non-linear relationships are also detected among genomic features near the centromeres of large chromosomes. Microsatellite mutability co-varies with other mutation rates at finer scales, but not at 1 Mb, and shows varying degrees of association with genomic features at different scales.


Our results allow us to speculate about the role of different molecular mechanisms, such as replication, recombination, repair and local chromatin environment, in mutagenesis. The software tools developed for our analyses are available through Galaxy, an open-source genomics portal, to facilitate the use of multivariate techniques in future large-scale genomics studies.


The cell response to virus infection and virus perturbation of that response is dynamic and is reflected by changes in cell susceptibility to infection. In this study, we evaluated the response of human epithelial cells to sequential infections with human respiratory syncytial virus strains A2 and B to determine if a primary infection with one strain will impact the ability of cells to be infected with the second as a function of virus strain and time elapsed between the two exposures. Infected cells were visualized with fluorescent markers, and location of all cells in the tissue culture well were identified using imaging software. We employed tools from spatial statistics to investigate the likelihood of a cell being infected given its proximity to a cell infected with either the homologous or heterologous virus. We used point processes, K-functions, and simulation procedures designed to account for specific features of our data when assessing spatial associations. Our results suggest that intrinsic cell properties increase susceptibility of cells to infection, more so for RSV-B than for RSV-A. Further, we provide evidence that the primary infection can decrease susceptibility of cells to the heterologous challenge virus but only at the 16 h time point evaluated in this study. Our research effort highlights the merits of integrating empirical and statistical approaches to gain greater insight on in vitro dynamics of virus-host interactions.

To achieve dosage balance of X-linked genes between mammalian males and females, one female X chromosome becomes inactivated. However, approximately 15% of genes on this inactivated chromosome escape X chromosome inactivation (XCI). Here, using a chromosome-wide analysis of primate X-linked orthologs, we test a hypothesis that such genes evolve under a unique selective pressure. We find that escape genes are subject to stronger purifying selection than inactivated genes and that positive selection does not significantly affect the evolution of these genes. The strength of selection does not differ between escape genes with similar versus different expression levels in males versus females. Intriguingly, escape genes possessing Y homologs evolve under the strongest purifying selection. We also found evidence of stronger conservation in gene expression levels in escape than inactivated genes. We hypothesize that divergence in function and expression between X and Y gametologs is driving such strong purifying selection for escape genes.

The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial and small sets of nuclear markers have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans. However, until now, fully sequenced human genomes have been limited to recently diverged populations. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
NATURE, 2010

Microsatellites are abundant in eukaryotic genomes and have high rates of strand slippage-induced repeat number alterations. They are popular genetic markers, and their mutations are associated with numerous neurological diseases. However, the minimal number of repeats required to constitute a microsatellite has been debated, and a definition of a microsatellite that considers its mutational behavior has been lacking. To define a microsatellite, we investigated slippage dynamics for a range of repeat sizes, utilizing two approaches. Computationally, we assessed length polymorphism at repeat loci in ten ENCODE regions resequenced in four human populations, assuming that the occurrence of polymorphism reflects strand slippage rates. Experimentally, we determined the in vitro DNA polymerase-mediated strand slippage error rates as a function of repeat number. In both approaches, we compared strand slippage rates at tandem repeats with the background slippage rates. We observed two distinct modes of mutational behavior. At small repeat numbers, slippage rates were low and indistinguishable from background measurements. A marked transition in mutability was observed as the repeat array lengthened, such that slippage rates at large repeat numbers were significantly higher than the background rates. For both mononucleotide and dinucleotide microsatellites studied, the transition length corresponded to a similar number of nucleotides (approximately 10). Thus, microsatellite threshold is determined not by the presence/absence of strand slippage at repeats but by an abrupt alteration in slippage rates relative to background. These findings have implications for understanding microsatellite mutagenesis, standardization of genome-wide microsatellite analyses, and predicting polymorphism levels of individual microsatellite loci.


Feline immunodeficiency virus (FIV) and human immunodeficiency virus (HIV) are recently identified lentiviruses that cause progressive immune decline and ultimately death in infected cats and humans. It is of great interest to understand how to prevent immune system collapse caused by these lentiviruses. We recently described that disease caused by a virulent FIV strain in cats can be attenuated if animals are first infected with a feline immunodeficiency virus derived from a wild cougar. The detailed temporal tracking of cat immunological parameters in response to two viral infections resulted in high-dimensional datasets containing variables that exhibit strong co-variation. Initial analyses of these complex data using univariate statistical techniques did not account for interactions among immunological response variables and therefore potentially obscured significant effects between infection state and immunological parameters.

Methodology and Principal Findings

Here, we apply a suite of multivariate statistical tools, including Principal Component Analysis, MANOVA and Linear Discriminant Analysis, to temporal immunological data resulting from FIV superinfection in domestic cats. We investigated the co-variation among immunological responses, the differences in immune parameters among four groups of five cats each (uninfected, single and dual infected animals), and the “immune profiles” that discriminate among them over the first four weeks following superinfection. Dual infected cats mount an immune response by 24 days post superinfection that is characterized by elevated levels of CD8 and CD25 cells and increased expression of IL4 and IFNγ, and FAS. This profile discriminates dual infected cats from cats infected with FIV alone, which show high IL-10 and lower numbers of CD8 and CD25 cells.


Multivariate statistical analyses demonstrate both the dynamic nature of the immune response to FIV single and dual infection and the development of a unique immunological profile in dual infected cats, which are protected from immune decline.

PLOS ONE, 2009

Recent studies have revealed that insertions and deletions (indels) are more different in their formation than previously assumed. What remains enigmatic is how the local DNA sequence context contributes to these differences. To investigate the relative impact of various molecular mechanisms to indel formation, we analyzed sequence contexts of indels in the non protein- or RNA-coding, nonrepetitive (NCNR) portion of the human genome. We considered small (≤30-bp) indels occurring in the human lineage since its divergence from chimpanzee and used wavelet techniques to study, simultaneously for multiple scales, the spatial patterns of short sequence motifs associated with indel mutagenesis. In particular, we focused on motifs associated with DNA polymerase activity, topoisomerase cleavage, double-strand breaks (DSBs), and their repair. We came to the following conclusions. First, many motifs are characterized by unique enrichment profiles in the vicinity of indels vs. indel-free portions of the genome, verifying the importance of sequence context in indel mutagenesis. Second, only limited similarity in motif frequency profiles is evident flanking insertions vs. deletions, confirming differences in their mutagenesis. Third, substantial similarity in frequency profiles exists between pairs of individual motifs flanking insertions (and separately deletions), suggesting “cooperation” among motifs, and thus molecular mechanisms, during indel formation. Fourth, the wavelet analyses demonstrate that all these patterns are highly dependent on scale (the size of an interval considered). Finally, our results depict a model of indel mutagenesis comprising both replication and recombination (via repair of paused replication forks and site-specific recombination).


The evolutionary distance between human and macaque is particularly attractive for investigating local variation in neutral substitution rates, because substitutions can be inferred more reliably than in comparisons with rodents and are less influenced by the effects of current and ancient diversity than in comparisons with closer primates. Here we investigate the human-macaque neutral substitution rate as a function of a number of genomic parameters.


Using regression analyses we find that male mutation bias, male (but not female) recombination rate, distance to telomeres and substitution rates computed from orthologous regions in mouse-rat and dog-cow comparisons are prominent predictors of the neutral rate. Additionally, we demonstrate that the previously observed biphasic relationship between neutral rate and GC content can be accounted for by properly combining rates at CpG and non-CpG sites. Finally, we find the neutral rate to be negatively correlated with the densities of several classes of computationally predicted functional elements, and less so with the densities of certain classes of experimentally verified functional elements.


Our results suggest that while female recombination may be mainly responsible for driving evolution in GC content, male recombination may be mutagenic, and that other mutagenic mechanisms acting near telomeres, and mechanisms whose effects are shared across mammalian genomes, play significant roles. We also have evidence that the nonlinear increase in rates at high GC levels may be largely due to hyper-mutability of CpG dinucleotides. Finally, our results suggest that the performance of conservation-based prediction methods can be improved by accounting for neutral rates.


Mutation rates of microsatellites vary greatly among loci. The causes of this heterogeneity remain largely enigmatic yet are crucial for understanding numerous human neurological diseases and genetic instability in cancer. In this first genome-wide study, the relative contributions of intrinsic features and regional genomic factors to the variation in mutability among orthologous human-chimpanzee microsatellites are investigated with resampling and regression techniques. As a result, we uncover the intricacies of microsatellite mutagenesis as follows. First, intrinsic features (repeat number, length, and motif size), which all influence the probability and rate of slippage, are the strongest predictors of mutability. Second, mutability increases nonuniformly with length, suggesting that processes additional to slippage, such as faulty repair, contribute to mutations. Third, mutability varies among microsatellites with different motif composition likely due to dissimilarities in secondary DNA structure formed by their slippage intermediates. Fourth, mutability of mononucleotide microsatellites is impacted by their location on sex chromosomes vs. autosomes and inside vs. outside of Alu repeats, the former confirming the importance of replication and the latter suggesting a role for gene conversion. Fifth, transcription status and location in a particular isochore do not influence microsatellite mutability. Sixth, compared with intrinsic features, regional genomic factors have only minor effects. Finally, our regression models explain ∼90% of variation in microsatellite mutability and can generate useful predictions for the studies of human diseases, forensics, and conservation genetics.

Insertions and deletions (indels) cause numerous genetic diseases and lead to pronounced evolutionary differences among genomes. The macaque sequences provide an opportunity to gain insights into the mechanisms generating these mutations on a genome-wide scale by establishing the polarity of indels occurring in the human lineage since its divergence from the chimpanzee. Here we apply novel regression techniques and multiscale analyses to demonstrate an extensive regional indel rate variation stemming from local fluctuations in divergence, GC content, male and female recombination rates, proximity to telomeres, and other genomic factors. We find that both replication and, surprisingly, recombination are significantly associated with the occurrence of small indels. Intriguingly, the relative inputs of replication versus recombination differ between insertions and deletions, thus the two types of mutations are likely guided in part by distinct mechanisms. Namely, insertions are more strongly associated with factors linked to recombination, while deletions are mostly associated with replication-related features. Indel as a term misleadingly groups the two types of mutations together by their effect on a sequence alignment. However, here we establish that the correct identification of a small gap as an insertion or a deletion (by use of an outgroup) is crucial to determining its mechanism of origin. In addition to providing novel insights into insertion and deletion mutagenesis, these results will assist in gap penalty modeling and eventually lead to more reliable genomic alignments.

What genomic landmarks render most genes silent while leaving others expressed on the inactive X chromosome in mammalian females? To date, signals determining expression status of genes on the inactive X remain enigmatic despite the availability of complete genomic sequences. Long interspersed repeats (L1s), particularly abundant on the X, are hypothesized to spread the inactivation signal and are enriched in the vicinity of inactive genes. However, both L1s and inactive genes are also more prevalent in ancient evolutionary strata. Did L1s accumulate there because of their role in inactivation or simply because they spent more time on the rarely recombining X? Here we utilize an experimentally derived inactivation profile of the entire human X chromosome to uncover sequences important for its inactivation, and to predict expression status of individual genes. Focusing on Xp22, where both inactive and active genes reside within evolutionarily young strata, we compare neighborhoods of genes with different inactivation states to identify enriched oligomers. Occurrences of such oligomers are then used as features to train a linear discriminant analysis classifier. Remarkably, expression status is correctly predicted for 84% and 91% of active and inactive genes, respectively, on the entire X, suggesting that oligomers enriched in Xp22 capture most of the genomic signal determining inactivation. To our surprise, the majority of oligomers associated with inactivated genes fall within L1 elements, even though L1 frequency in Xp22 is low. Moreover, these oligomers are enriched in parts of L1 sequences that are usually underrepresented in the genome. Thus, our results strongly support the role of L1s in X inactivation, yet indicate that a chromatin microenvironment composed of multiple genomic sequence elements determines expression status of X chromosome genes.

Male mutation bias is a higher mutation rate in males than in females thought to result from the greater number of germ line cell divisions in males. If errors in DNA replication cause most mutations, then the magnitude of male mutation bias, measured as the male-to-female mutation rate ratio (α), should reflect the relative excess of male versus female germ line cell divisions. Evolutionary rates averaged among all sites in a sequence and compared between mammalian sex chromosomes were shown to be indeed higher in males than in females. However, it is presently unknown whether individual classes of substitutions exhibit such bias. To address this issue, we investigated male mutation bias separately at non-CpG and CpG sites using human-chimpanzee whole-genome alignments. We observed strong male mutation bias at non-CpG sites: α in the X-autosome comparison was ∼6-7, which was similar to the male-to-female ratio in the number of germ line cell divisions. In contrast, mutations at CpG sites exhibited weak male mutation bias: α in the X-autosome comparison was only ∼2-3. This is consistent with the methylation-induced and replication-independent mechanism of CpG transitions, which constitute the majority of mutations at CpG sites. Interestingly, our study also indicated weak male mutation bias for transversions at CpG sites, implying a spontaneous mechanism largely not associated with replication. Male mutation bias was equally strong at CpG and non-CpG sites located within unmethylated “CpG islands,” suggesting the replication-dependent origin of these mutations. Thus, we found that the strength of male mutation bias is nonuniform in the primate genomes. Importantly, we discovered that male mutation bias depends on the proportion of CpG sites in the loci compared. This might explain the differences in the magnitude of primate male mutation bias observed among studies.

It is presently accepted that, in mammals, due to the greater number of cell divisions in the male germline than in the female germline, nucleotide substitutions occur more frequently in males. The data on mutation bias in insertions and deletions (indels) are contradictory, with some studies indicating no sex bias and others indicating either female or male bias. The sequenced rat and mouse genomes provide a unique opportunity to investigate a potential sex bias for different types of mutations. Indeed, mutation rates can be accurately estimated from a large number of orthologous loci in organisms similar in generation time and in the number of germline cell divisions. Here we compare the mutation rates between chromosome X and autosomes for likely neutral sites in eutherian ancestral interspersed repetitive elements present at orthologous locations in the rat and mouse genomes. We find that small indels are male biased: The male-to-female mutation rate ratio (α) for indels in rodents is ∼2. Similarly, our whole-genome analysis in rodents indicates an approximately twofold excess of nucleotide substitutions originating in males over that in females. This is the same as the male-to-female ratio of the number of germline cell divisions in rat and mouse. Thus, this is consistent with nucleotide substitutions and small indels occurring primarily during DNA replication.