Publications

Filter by type:

Evolutionary dynamics of predicted G-quadruplexes in human and other great apes

Background: G-quadruplexes (G4s) are non-canonical DNA structures that can form at approximately 1% of the human genome. They facilitate genomic instability by increasing point mutations and structural variation. Numerous G4s participate in telomere maintenance and regulating transcription and replication, and evolve under purifying selection. Despite these important functions, G4s have remained under-studied in human and ape genomes due to incomplete assemblies. Results: Here, we conduct a comprehensive analysis of predicted G4s (pG4s) in the recently released, telomere-to-telomere (T2T) genomes of human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan. We annotate 41,232–174,442 new pG4s in these T2T compared to previous ape genome assemblies (5%–21% increase). Analyzing inter-species whole-genome alignments, we identify pG4s shared across apes (approximately one-third of all pG4s) and thousands of species-specific pG4s. pG4s accumulate and diverge at rates consistent with divergence times between species, following molecular clock. pG4s shared across apes are enriched and hypomethylated at regulatory regions—enhancers, promoters, UTRs, and origins of replication—suggesting their conserved formation and functions. Species-specific pG4s (constituting 11–27% of all pG4s) are located in regulatory regions, potentially contributing to adaptations, and in repeats, likely driving genome expansions. Conclusions: Our findings illuminate the evolutionary dynamics of G4s, conservation of their role in gene regulation, and their contributions to ape genome evolution. Our study highlights the utility of high-resolution T2T genomes in revealing elusive yet likely functionally relevant genomic features previously hidden by incomplete assemblies.

SK Mohanty, F Chiaromonte, KD Makova

Genome Biol, 2025

DOI

Letter: Are Antispasmodics Truly Ineffective in IBD? Considerations on Nuanced Interpretation and Stratified Analysis. Authors' Reply

This article relates to The Impact of Antispasmodic Use on Abdominal Pain and Opioid Use in Inflammatory Bowel Disease: A Population-Based Study

C Khunsriraksakul, O Ziegler, DJ Liu, AS Kulaylat, MD Coates

Aliment Pharmacol Ther, 2025

DOI

The Impact of Antispasmodic Use on Abdominal Pain and Opioid Use in Inflammatory Bowel Disease: A Population-Based Study

Background: Patients with inflammatory bowel disease (IBD) are often prescribed antispasmodics for chronic abdominal pain. Large-scale data regarding efficacy and impact on clinical outcomes are lacking. Aim: To examine the association between antispasmodic use and outcomes of abdominal pain and opioid use before and after propensity matching key demographic and clinical characteristics. Methods:We used TriNetX Diamond Network, a medical and claims database. Patients were stratified by baseline abdominal pain and opioid use. Secondary outcomes were corticosteroid use, IBD-related complications and surgeries, emergency room (ER) visits, hospitalisation and mortality. Results: We included 85,859 patients (median age 50; 53.8% female) with IBD; 5661 used antispasmodics. On follow-up, those with antispasmodic use had higher rates of abdominal pain and opioid use (p < 0.001) regardless of baseline abdominal pain or opioid use. After matching, 5629 patients remained per group. Patients who used antispasmodics had higher rates of abdominal pain at 1 month, regardless of baseline abdominal pain. Opioid-naïve patients who used antispasmodics had higher rates of opioid use at follow-up (1.1% vs. 0.2%; p < 0.001). The likelihood of corticosteroid use, clinic visits, ER visits and hospitalisation were higher in those with antispasmodic use. No differences in IBD-related complications, surgery or mortality were observed. Conclusions: Antispasmodic use in patients with IBD was associated with increased abdominal pain and opioid use in opioid-naïve patients. Antispasmodic use was associated with increased likelihood of corticosteroid use, clinic and ER visits and hospitalisation.

C Khunsriraksakul, O Ziegler, DJ Liu, AS Kulaylat, MD Coates

Aliment Pharmacol Ther, 2025

DOI

Infection risk of rituximab monotherapy versus combination therapy with rituximab and mycophenolic acid in systemic sclerosis: A retrospective cohort study

Key words: complex medical dermatologyimmunomodulationinfectionmycophenolic acidrituximabsystemic sclerosis

JB Kang, KN Smith, EM Meara, M Cho, JD Silverman, AH LaChance, JS Smith

JAAD, 2025

DOI

IGLoo enables comprehensive analysis and assembly of immunoglobulin heavy-chain loci in lymphoblastoid cell lines using PacBio high-fidelity reads

High-quality human genome assemblies derived from lymphoblastoid cell lines (LCLs) provide reference genomes and pangenomes for genomics studies. However, LCLs pose technical challenges for profiling immunoglobulin (IG) genes, as their IG loci contain a mixture of germline and somatically recombined haplotypes, making genotyping and assembly difficult with widely used frameworks. To address this, we introduce IGLoo, a software tool that analyzes sequence data and assemblies derived from LCLs, characterizing somatic V(D)J recombination events and identifying breakpoints and missing IG genes in the assemblies. Furthermore, IGLoo implements a reassembly framework to improve germline assembly quality by integrating information on somatic events and population structural variations in IG loci. Applying IGLoo to the assemblies from the Human Pangenome Reference Consortium, we gained valuable insights into the mechanisms, gene usage, and patterns of V(D)J recombination and the causes of assembly artifacts in the IG heavy-chain (IGH) locus, and we improved the representation of IGH assemblies.

MJ Lin, B Langmead, Y Safonova

Cell Rep Methods, 2025

DOI

invertiaDB: a database of inverted repeats across organismal genomes

Inverted repeats are repetitive elements that can form hairpin and cruciform structures. They are linked to genomic instability; however, they also have various biological functions. Their distribution differs markedly across taxonomic groups in the tree of life, and they exhibit high polymorphism due to their inherent genomic instability. Advances in sequencing technologies and declined costs have enabled the generation of an ever-growing number of complete genomes for organisms across taxonomic groups in the tree of life. However, a comprehensive database encompassing inverted repeats across diverse organismal genomes has been lacking. We present invertiaDB, the first comprehensive database of inverted repeats spanning multiple taxa, featuring repeats identified in the genomes of 118 101 organisms across all major taxonomic groups. For each organism, we derived inverted repeats with arm lengths of at least 10 bp, spacer lengths up to 8 bp, and no mismatches in the arms. The database currently hosts 34 330 450 inverted repeat sequences, serving as a centralized, user-friendly repository to perform searches and interactive visualizations, and download existing inverted repeat data for independent analysis. invertiaDB is implemented as a web portal for browsing, analyzing, and downloading inverted repeat data. invertiaDB is publicly available at https://invertiadb.netlify.app/homepage.html.

K Provatas, N Chantzi, N Amptazi, M Patsakis, A Nayak, I Mouratidis, A Zaravinos, GA Pavlopoulos, I Georgakopoulos-Soares

Nucleic Acids Res, 2025

DOI

Replicative DNA polymerase epsilon and delta holoenzymes show wide-ranging inhibition at G-quadruplexes in the human genome

G-quadruplexes (G4s) are functional elements of the human genome, some of which inhibit DNA replication. We investigated replication of G4s within highly abundant microsatellite (GGGA, GGGT) and transposable element (L1 and SVA) sequences. We found that genome-wide, numerous motifs are located preferentially on the replication leading strand and the transcribed strand templates. We directly tested replicative polymerase ϵ and δ holoenzyme inhibition at these G4s, compared to low abundant motifs. For all G4s, DNA synthesis inhibition was higher on the G-rich than C-rich strand or control sequence. No single G4 was an absolute block for either holoenzyme; however, the inhibitory potential varied over an order of magnitude. Biophysical analyses showed the motifs form varying topologies, but replicative polymerase inhibition did not correlate with a specific G4 structure. Addition of the G4 stabilizer pyridostatin severely inhibited forward polymerase synthesis specifically on the G-rich strand, enhancing G/C strand asynchrony. Our results reveal that replicative polymerase inhibition at every G4 examined is distinct, causing complementary strand synthesis to become asynchronous, which could contribute to slowed fork elongation. Altogether, we provide critical information regarding how replicative eukaryotic holoenzymes navigate synthesis through G4s naturally occurring thousands of times in functional regions of the human genome.

SE Hile, MH Weissensteiner, KG Pytko, J Dahl, E Kejnovsky, I Kejnovská, M Hedglin, I Georgakopoulos-Soares, KD Makova, KA Eckert

Nucleic Acids Res, 2025

DOI

Complete sequencing of ape genomes

We present haplotype-resolved reference genomes and comparative analyses of six ape species, namely: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. We achieve chromosome-level contiguity with unparalleled sequence accuracy (<1 error in 500,000 base pairs), completely sequencing 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, providing more in-depth evolutionary insights. Comparative analyses, including human, allow us to investigate the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference. This includes newly minted gene families within lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes, and subterminal heterochromatin. This resource should serve as a definitive baseline for all future evolutionary studies of humans and our closest living ape relatives.

D Yoo, A Rhie, P Hebbar, F Antonacci, GA Logsdon, SJ Solar, D Antipov, BD Pickett, Y Safonova, F Montinaro, Y Luo, J Malukiewicz, JM Storer, J Lin, AN Sequeira, RJ Mangan, G Hickey, GM Anez, P Balachandran, A Bankevich, CR Beck, A Biddanda, M Borchers, GG Bouffard, E Brannan, SY Brooks, L Carbone, L Carrel, AP Chan, J Crawford, M Diekhans, E Engelbrecht, C Feschotte, G Formenti, GH Garcia, L de Gennaro, D Gilbert, RE Green, A Guarracino, I Gupta, D Haddad, J Han, RS Harris, GA Hartley, WT Harvey, M Hiller, K Hoekzema, ML Houck, H Jeong, K Kamali, M Kellis, B Kille, C Lee, Y Lee, W Lees, AP Lewis, Q Li, M Loftus, YHE Loh, H Loucks, J Ma, Y Mao, JFI Martinez, P Masterson, RC McCoy, B McGrath, S McKinney, BS Meyer, KH Miga, SK Mohanty, KM Munson, K Pal, M Pennell, PA Pevzner, D Porubsky, T Potapova, FR Ringeling, JL Rocha, OA Ryder, S Sacco, S Saha, T Sasaki, MC Schatz, NJ Schork, C Shanks, L Smeds, DR Son, C Steiner, AP Sweeten, MG Tassia, F Thibaud-Nissen, E Torres-González, M Trivedi, W Wei, J Wertz, M Yang, P Zhang, S Zhang, Y Zhang, Z Zhang, SA Zhao, Y Zhu, ED Jarvis, JL Gerton, I Rivas-González, B Paten, ZA Szpiech, CD Huber, TL Lenz, MK Konkel, SV Yi, S Canzar, CT Watson, PH Sudmant, E Molloy, E Garrison, CB Lowe, M Ventura, RJ O’Neill, S Koren, KD Makova, AM Phillippy, EE Eichler

Nature, 2025

DOI

MyD88-mediated signaling in intestinal fibroblasts regulates macrophage antimicrobial defense and prevents dysbiosis in the gut

Fibroblasts that reside in the gut mucosa are among the key regulators of innate immune cells, but their role in the regulation of the defense functions of macrophages remains unknown. MyD88 is suggested to shape fibroblast responses in the intestinal microenvironment. We found that mice lacking MyD88 in fibroblasts showed a decrease in the colonic antimicrobial defense, developing dysbiosis and aggravated dextran sulfate sodium (DSS)-induced colitis. These pathological changes were associated with the accumulation of Arginase 1+ macrophages with low antimicrobial defense capability. Mechanistically, the production of interleukin (IL)-6 and CCL2 downstream of MyD88 was critically involved in fibroblast-mediated support of macrophage antimicrobial function, and IL-6/CCL2 neutralization resulted in the generation of macrophages with decreased production of the antimicrobial peptide cathelicidin and impaired bacterial clearance. Collectively, these findings revealed a critical role of fibroblast-intrinsic MyD88 signaling in regulating macrophage antimicrobial defense under colonic homeostasis, and its disruption results in dysbiosis, predisposing the host to the development of intestinal inflammation.

M Chulkina, H Tran, G Uribe, SB McAninch, C McAninch, A Seideneck, B He, M Lanza, K Khanipov, G Golovko, DW Powell, ER Davenport, IV Pinchuk

Cell Rep, 2025

DOI

Transcriptome signatures of the medial prefrontal cortex underlying GABAergic control of resilience to chronic stress exposure

Analyses of postmortem human brains and preclinical studies of rodents have identified somatostatin (SST)-positive, dendrite-targeting GABAergic interneurons as key elements that regulate the vulnerability to stress-related psychiatric disorders. Conversely, genetically induced disinhibition of SST neurons (induced by Cre-mediated deletion of the γ2 GABAA receptor subunit gene selectively from SST neurons, SSTCre:γ2f/f mice) results in stress resilience. Similarly, chronic chemogenetic activation of SST neurons in the medial prefrontal cortex (mPFC) results in stress resilience but only in male and not in female mice. Here, we used RNA sequencing of the mPFC of SSTCre:γ2f/f mice to characterize the transcriptome changes underlying GABAergic control of stress resilience. We found that stress resilience of male but not female SSTCre:γ2f/f mice is characterized by resilience to chronic stress-induced transcriptome changes in the mPFC. Interestingly, the transcriptome of non-stressed SSTCre:γ2f/f (stress-resilient) male mice resembled that of chronic stress-exposed SSTCre (stress-vulnerable) mice. However, the behavior and the serum corticosterone levels of non-stressed SSTCre:γ2f/f mice showed no signs of physiological stress. Most strikingly, chronic stress exposure of SSTCre:γ2f/f mice was associated with an almost complete reversal of their chronic stress-like transcriptome signature, along with pathway changes suggesting stress-induced enhancement of mRNA translation. Behaviorally, the SSTCre:γ2f/f mice were not only resilient to chronic stress-induced anhedonia — they also showed an inversed, anxiolytic-like behavioral response to chronic stress exposure that mirrored the chronic stress-induced reversal of the chronic stress-like transcriptome signature. We conclude that GABAergic dendritic inhibition by SST neurons exerts bidirectional control over behavioral vulnerability and resilience to chronic stress exposure that is mirrored in bidirectional changes in the expression of putative stress resilience genes, through a sex-specific brain substrate.

M Shao, J Botvinov, D Banerjee, S Girirajan, B Lüscher

Mol Psychiatry, 2025

DOI

Performance of qpAdm-based screens for genetic admixture on graph–shaped histories and stepping stone landscapes

qpAdm is a statistical tool that is often used for testing large sets of alternative admixture models for a target population. Despite its popularity, qpAdm remains untested on 2D stepping stone landscapes and in situations with low prestudy odds (low ratio of true to false models). We tested high-throughput qpAdm protocols with typical properties such as number of source combinations per target, model complexity, model feasibility criteria, etc. Those protocols were applied to admixture graph–shaped and stepping stone simulated histories sampled randomly or systematically. We demonstrate that false discovery rates of high-throughput qpAdm protocols exceed 50% for many parameter combinations since: (1) prestudy odds are low and fall rapidly with increasing model complexity; (2) complex migration networks violate the assumptions of the method; hence, there is poor correlation between qpAdm P-values and model optimality, contributing to low but nonzero false-positive rate and low power; and (3) although admixture fraction estimates between 0 and 1 are largely restricted to symmetric configurations of sources around a target, a small fraction of asymmetric highly nonoptimal models have estimates in the same interval, contributing to the false-positive rate. We also reinterpret large sets of qpAdm models from 2 studies in terms of source–target distance and symmetry and suggest improvements to qpAdm protocols: (1) temporal stratification of targets and proxy sources in the case of admixture graph–shaped histories, (2) focused exploration of few models for increasing prestudy odds; and (3) dense landscape sampling for increasing power and stringent conditions on estimated admixture fractions for decreasing the false-positive rate.

O Flegontova, U Işıldak, E Yüncü, MP Williams, CD Huber, J Kočí, LA Vyazov, P Changmai, P Flegontov

Genetics, 2025

DOI

Differentiating mechanism from outcome for ancestry-assortative mating in admixed human populations

Population genetic theory, and the empirical methods built upon it, often assumes that individuals pair randomly for reproduction. However, natural populations frequently violate this assumption, which may potentially confound genome-wide association studies, selection scans, and demographic inference. Within several recently admixed human populations, empirical genetic studies have reported a correlation in global ancestry proportion between spouses, referred to as ancestry-assortative mating. Here, we use forward genomic simulations to link correlations in global ancestry proportion between mates to the underlying mechanistic mate choice process. We consider the impacts of 2 types of mate choice model, using either ancestry-based preferences or social groups as the basis for mate pairing. We find that multiple mate choice models can produce the same correlations in global ancestry proportion between spouses; however, we also highlight alternative analytic approaches and circumstances in which these models may be distinguished. With this work, we seek to highlight potential pitfalls when interpreting correlations in empirical data as evidence for a particular model of human mating practices and to offer suggestions toward development of new best practices for analysis of human ancestry-assortative mating.

DJ Massey, ZA Szpiech, A Goldberg

Genetics, 2025

DOI

A multi-omics analysis of effector and resting treg cells in pan-cancer

Regulatory T cells (Tregs) are critical for maintaining the stability of the immune system and facilitating tumor escape through various mechanisms. Resting T cells are involved in cell-mediated immunity and remain in a resting state until stimulated, while effector T cells promote immune responses. Here, we investigated the roles of two gene signatures, one for resting Tregs (FOXP3 and IL2RA) and another for effector Tregs (FOXP3, CTLA-4, CCR8 and TNFRSF9) in pan-cancer. Using data from The Cancer Genome Atlas (TCGA), The Cancer Proteome Atlas (TCPA) and Gene Expression Omnibus (GEO), we focused on the expression profile of the two signatures, the existence of single nucleotide variants (SNVs) and copy number variants (CNVs), methylation, infiltration of immune cells in the tumor and sensitivity to different drugs. Our analysis revealed that both signatures are differentially expressed across different cancer types, and correlate with patient survival. Furthermore, both types of Tregs influence important pathways in cancer development and progression, like apoptosis, epithelial-to-mesenchymal transition (EMT) and the DNA damage pathway. Moreover, a positive correlation was highlighted between the expression of gene markers in both resting and effector Tregs and immune cell infiltration in adrenocortical carcinoma, while mutations in both signatures correlated with enrichment of specific immune cells, mainly in skin melanoma and endometrial cancer. In addition, we reveal the existence of widespread CNVs and hypomethylation affecting both Treg signatures in most cancer types. Last, we identified a few correlations between the expression of CCR8 and TNFRSF9 and sensitivity to several drugs, including COL-3, Chlorambucil and GSK1070916, in pan-cancer. Overall, these findings highlight new evidence that both Treg signatures are crucial regulators of cancer progression, providing potential clinical outcomes for cancer therapy.

AM Chalepaki, M Gkoris, I Chondrou, M Kourti, I Georgakopoulos-Soares, A Zaravinos

Comput Biol Med, 2025

DOI

A data-driven personalized approach to predict blood glucose levels in type-1 diabetes patients exercising in free-living conditions

Objective:The development of new technologies has generated vast amount of data that can be analyzed to better understand and predict the glycemic behavior of people living with type 1 diabetes. This paper aims to assess whether a data-driven approach can accurately and safely predict blood glucose levels in patients with type 1 diabetes exercising in free-living conditions. Methods:Multiple machine learning (XGBoost, Random Forest) and deep learning (LSTM, CNN-LSTM, Dual-encoder with Attention layer) regression models were considered. Each deep-learning model was implemented twice: first, as a personalized model trained solely on the target patient’s data, and second, as a fine-tuned model of a population-based training model. The datasets used for training and testing the models were derived from the Type 1 Diabetes Exercise Initiative (T1DEXI). A total of 79 patients in T1DEXI met our inclusion criteria. Our models used various features related to continuous glucose monitoring, insulin pumps, carbohydrate intake, exercise (intensity and duration), and physical activity-related information (steps and heart rate). This data was available for four weeks for each of the 79 included patients. Three prediction horizons (10, 20, and 30 min) were tested and analyzed. Results:For each patient, there always exists either a machine learning or a deep learning model that conveniently predicts BGLs for up to 30 min. The best performing model differs from one patient to another. When considering the best performing model for each patient, the median and the mean Root Mean Squared Error (RMSE) values (across the 79 patients) for predictions made 10 min ahead were 6.99 mg/dL and 7.46 mg/dL, respectively. For predictions made 30 min ahead, the median and mean RMSE values were 16.85 mg/dL and 17.74 mg/dL, respectively. The majority of the predictions output by the best model of each patient fell within the clinically safe zones A and B of the Clarke Error Grid (CEG), with almost no predictions falling into the unsafe zone E. The most challenging patient to predict 30 min ahead achieved an RMSE value of 32.31 mg/dL (with the corresponding best performing model). The best-predicted patient had an RMSE value of 10.48 mg/dL. Predicting blood glucose levels was more difficult during and after exercise, resulting in higher RMSE values on average. Prediction errors during and after physical activity (two hours and four hours after) generally remained within the clinical safe zones of the CEG with less than 0.5% of predictions falling into the harmful zones D and E, regardless of the exercise category. Conclusions:Data-driven approaches can accurately predict blood glucose levels in type 1 diabetes patients exercising in free-living conditions. The best-performing model varies across patients. Approaches in which a population-based model is initially trained and then fine-tuned for each individual patient generally achieve the best performance for the majority of patients. Some patients remain challenging to predict with no straightforward explanation of why a patient is more challenging to predict than another.

A Neumann, Y Zghal, MA Cremona, A Hajji, M Morin, M Rekik

Comput Biol Med, 2025

DOI

TRANSCENDENT (Transforming Research by Assessing Neuroinformatics across the Spectrum of Concussion by Embedding iNterdisciplinary Data-collection to Enable Novel Treatments): protocol for a prospective observational cohort study of concussion patients with embedded comparative effectiveness research within a network of learning health system concussion clinics in Canada

Introduction: Concussion affects over 400 000 Canadians annually, with a range of causes and impacts on health-related quality of life. Research to date has disproportionately focused on athletes, military personnel and level I trauma centre patients, and may not be applicable to the broader community. The TRANSCENDENT Concussion Research Program aims to address patient- and clinician-identified research priorities, through the integration of clinical data from patients of all ages and injury mechanisms, patient-reported outcomes and objective biomarkers across factors of intersectionality. Seeking guidance from our Community Advisory Committee will ensure meaningful patient partnership and research findings that are relevant to the wider concussion community. Methods and analysis: This prospective observational cohort study will recruit 5500 participants over 5 years from three 360 Concussion Care clinic locations across Ontario, Canada, with a subset of participants enrolling in specific objective assessments including testing of autonomic function, exercise tolerance, vision, advanced neuroimaging and fluid biomarkers. Analysis will be predicated on pre-specified research questions, and data shared with the Ontario Brain Institute’s Brain-CODE database. This work will represent one of the largest concussion databases to date, and by sharing it, we will advance the field of concussion and prevent siloing within brain health research.

R Zemek, LM Albrecht, S Johnston, J Leddy, AA Ledoux, N Reed, N Silverberg, K Yeates, M Lamoureux, C Anderson, N Barrowman, MH Beauchamp, K Chen, A Chintoh, A Cortel-LeBlanc, M Cortel-LeBlanc, DJ Corwin, S Cowle, K Dalton, J Dawson, A Dodd, KE Emam, C Emery, E Fox, P Fuselli, IJ Gagnon, C Giza, S Hicks, DR Howell, SA Kutcher, C Lalonde, RC Mannix, CL Master, AR Mayer, MH Osmond, R Robillard, KJ Schneider, P Tanuseputro, I Terekhov, R Webster, CL Wellington, TRANSCENDENT Concussion Integrated Discovery Program

BMJ Open, 2025

DOI

MPRAbase a Massively Parallel Reporter Assay database

Massively parallel reporter assays (MPRAs) represent a set of high-throughput technologies that measure the functional effects of thousands of sequences/variants on gene regulatory activity. There are several different variations of MPRA technology and they are used for numerous applications, including regulatory element discovery, variant effect measurement, saturation mutagenesis, synthetic regulatory element generation or characterization of evolutionary gene regulatory differences. Despite their many designs and uses, there is no comprehensive database that incorporates the results of these experiments. To address this, we developed MPRAbase, a manually curated database that currently harbors 130 experiments, encompassing 17,718,677 elements tested across 35 cell types and 4 organisms. The MPRAbase web interface serves as a centralized user-friendly repository to examine online the activity of regulatory elements across cell types and organisms, and to download MPRA data for independent analysis.

J Zhao, FA Baltoumas, MA Konnaris, I Mouratidis, Z Liu, J Sims, V Agarwal, GA Pavlopoulos, I Georgakopoulos-Soares, N Ahituv

Genome Res, 2025

DOI

Non-canonical DNA in human and other ape telomere-to-telomere genomes

Non-canonical (non-B) DNA structures, e.g., bent DNA, hairpins, G-quadruplexes (G4s), Z-DNA, etc., which form at certain sequence motifs (e.g., A-phased repeats, inverted repeats, etc.), have emerged as important regulators of cellular processes and drivers of genome evolution. Yet, they have been understudied due to their repetitive nature and potentially inaccurate sequences generated with short-read technologies. Here we comprehensively characterize such motifs in the long-read telomere-to-telomere (T2T) genomes of human, bonobo, chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. Non-B DNA motifs are enriched at the genomic regions added to T2T assemblies, and occupy 9-15%, 9-11%, and 12-38% of autosomes, and chromosomes X and Y, respectively. G4s and Z-DNA are enriched at promoters and enhancers, as well as at origins of replication. Repetitive sequences harbor more non-B DNA motifs than non-repetitive sequences, especially in the short arms of acrocentric chromosomes. Most centromeres and/or their flanking regions are enriched in at least one non-B DNA motif type, consistent with a potential role of non-B structures in determining centromeres. Our results highlight the uneven distribution of predicted non-B DNA structures across ape genomes and suggest their novel functions in previously inaccessible genomic regions.

L Smeds, K Kamali, I Kejnovská, E Kejnovský, F Chiaromonte, KD Makova

Nucleic Acids Res, 2025

DOI

An atlas of single-cell eQTLs dissects autoimmune disease genes and identifies novel drug classes for treatment

Most variants identified from genome-wide association studies (GWASs) are non-coding and regulate gene expression. However, many risk loci fail to colocalize with expression quantitative trait loci (eQTLs), potentially due to limited GWAS and eQTL analysis power or cellular heterogeneity. Population-scale single-cell RNA-sequencing (scRNA-seq) datasets are emerging, enabling mapping of eQTLs in different cell types (sc-eQTLs). Compared to eQTL data from bulk tissues (bk-eQTLs), sc-eQTL datasets are smaller. We propose a joint model of bk-eQTLs as a weighted sum of sc-eQTLs (JOBS) from constituent cell types to improve power. Applying JOBS to One1K1K and eQTLGen data, we identify 586% more eQTLs, matching the power of 4× the sample sizes of OneK1K. Integrating sc-eQTLs with GWAS data creates an atlas for 14 immune-mediated disorders, colocalizing 29.9% or 32.2% more loci than using sc-eQTL or bk-eQTL alone. Extending JOBS, we develop a drug-repurposing pipeline and identify novel drugs validated by real-world data.

L Wang, H Markus, D Chen, S Chen, F Zhang, S Gao, C Khunsriraksakul, F Chen, N Olsen, G Foulke, Bibo Jiang, L Carrel, DJ Liu

Cell Genom, 2025

DOI

Family Functioning and Pubertal Maturation in Hispanic/Latino Children from the HCHS/SOL Youth

Previous studies have examined the association between family dysfunction and pubertal timing in adolescent girls. However, the evidence is lacking on the role of family dysfunction during sensitive developmental periods in both boys and girls from racial and ethnic minority groups. This study aimed to determine the effect of family dysfunction on the timing of pubertal maturation among US Hispanic/Latino children and adolescents. Participants were 1466 youths (50% female; ages 8-16 years) from the Hispanic Community Children’s Health Study/Study of Latino Youth (SOL Youth). Pubertal maturation was measured using self-administered Pubertal Development Scale (PDS) items for boys and girls. Family dysfunction included measures of single-parent family structure, unhealthy family functioning, low parental closeness, and neglectful parenting style. We used multivariable ordinal logistic and linear regression analyses to examine the associations between family dysfunction and pubertal maturation (individual and cumulative measures), with adjustment for childhood BMI and socioeconomic factors, design effects (strata and clustering), and sample weights. Multivariable models of individual PDS items showed that family dysfunction was negatively associated with growth in height (OR = 0.66, 95% CI: 0.44, 0.99) in girls; no associations were found in boys. In the assessment of cumulative PDS scores, family dysfunction was associated with a lower average pubertal maturation score (b = -0.63, 95% CI: -1.21, -0.05) in boys, while no associations were found in girls. Pubertal timing lies at the intersection of associations between childhood adversity and adult health and warrants further investigation to understand the factors affecting timing and differences across sex and sociocultural background.

AK April-Sanders, P Tehranifar, MB Terry, DM Crookes, CR Isasi, LC Gallo, L Fernandez-Rhodes, KM Perreira, ML Daviglus, SF Suglia

Int J Environ Res Public Health, 2025

DOI

Cystic Fibrosis Newborn Screening: A Systematic Review-Driven Consensus Guideline from the United States Cystic Fibrosis Foundation

Newborn screening for cystic fibrosis (CF) has been universal in the US since 2010; however, there is significant variation among newborn screening algorithms. Systematic reviews were used to develop seven recommendations for newborn screening program practices to improve timeliness, sensitivity, and equity in diagnosing infants with CF: (1) The CF Foundation recommends the use of a floating immunoreactive trypsinogen (IRT) cutoff over a fixed IRT cutoff; (2) The CF Foundation recommends using a very high IRT referral strategy in CF newborn screening programs whose variant panel does not include all CF-causing variants in CFTR2 or does not have a variant panel that achieves at least 95% sensitivity in all ancestral groups within the state; (3) The CF Foundation recommends that CF newborn screening algorithms should not limit CFTR variant detection to the F508del variant or variants included in the American College of Medical Genetics-23 panel; (4) The CF Foundation recommends that CF newborn screening programs screen for all CF-causing CFTR variants in CFTR2; (5) The CF Foundation recommends conducting CFTR variant screening twice weekly or more frequently as resources allow; (6) The CF Foundation recommends the inclusion of a CFTR sequencing tier following IRT and CFTR variant panel testing to improve the specificity and positive predictive value of CF newborn screening; (7) The CF Foundation recommends that both the primary care provider and the CF specialist be notified of abnormal newborn screening results. Through implementation, it is anticipated that these recommendations will result in improved sensitivity, equity, and timeliness of CF newborn screening, leading to improved health outcomes for all individuals diagnosed with CF following newborn screening and a decreased burden on families.

ME McGarry, KS Raraigh, P Farrell, F Shropshire, K Padding, C White, MC Dorley, S Hicks, CL Ren, K Tullis, D Freedenberg, QE Wafford, SE Hempstead, MA Taylor, A Faro, MK Sontag, SA McColley

Int J Neonatal Screen, 2025

DOI

A human iPSC-derived midbrain neural stem cell model of prenatal opioid exposure and withdrawal: A proof of concept study

A growing body of clinical literature has described neurodevelopmental delays in infants with chronic prenatal opioid exposure and withdrawal. Despite this, the mechanism of how opioids impact the developing brain remains unknown. Here, we developed an in vitro model of prenatal morphine exposure and withdrawal using healthy human induced pluripotent stem cell (iPSC)-derived midbrain neural progenitors in monolayer. To optimize our model, we identified that a longer neural induction and regional patterning period increases expression of canonical opioid receptors mu and kappa in midbrain neural progenitors compared to a shorter protocol (OPRM1, two-tailed t-test, p = 0.004; OPRK1, p = 0.0003). Next, we showed that the midbrain neural progenitors derived from a longer iPSC neural induction also have scant toll-like receptor 4 (TLR4) expression, a key player in neonatal opioid withdrawal syndrome pathophysiology. During morphine withdrawal, differentiating neural progenitors experience cyclic adenosine monophosphate overshoot compared to cell exposed to vehicle (p = 0.0496) and morphine exposure conditions (p, = 0.0136, 1-way ANOVA). Finally, we showed that morphine exposure and withdrawal alters proportions of differentiated progenitor cell fates (2-way ANOVA, F = 16.05, p < 0.0001). Chronic morphine exposure increased proportions of nestin positive progenitors (p = 0.0094), and decreased proportions of neuronal nuclear antigen positive neurons (NEUN) (p = 0.0047) compared to those exposed to vehicle. Morphine withdrawal decreased proportions of glial fibrillary acidic protein positive cells of astrocytic lineage (p = 0.044), and increased proportions of NEUN-positive neurons (p < 0.0001) compared to those exposed to morphine only. Applications of this paradigm include mechanistic studies underscoring neural progenitor cell fate commitments in early neurodevelopment during morphine exposure and withdrawal.

R Sullivan, Q Ahrens, SL Mills-Huffnagle, IA Elcheva, SD Hicks

PLoS One, 2025

DOI

All Together Now: Data Work to Advance Privacy, Science, and Health in the Age of Synthetic Data

There is a disconnect between data practices in biomedicine and public understanding of those data practices, and this disconnect is expanding rapidly every day (with the emergence of synthetic data and digital twins and more widely adopted Artificial Intelligence (AI)/Machine Learning tools). Transparency alone is insufficient to bridge this gap. Concurrently, there is an increasingly complex landscape of laws, regulations, and institutional/ programmatic policies to navigate when engaged in biocomputing and digital health research, which makes it increasingly difficult for those wanting to ‘get it right’ or ‘do the right thing.’ Mandatory data protection obligations vary widely, sometimes focused on the type of data (and nuanced definition and scope parameters), the actor/entity involved, or the residency of the data subjects. Additional challenges come from attempts to celebrate biocomputing discoveries and digital health innovations, which frequently transform fair and accurate communications into exaggerated hype (e.g., to secure financial investment in future projects or lead to more favorable tenure and promotion decisions). Trust in scientists and scientific expertise can be quickly eroded if, for example, synthetic data is perceived by the public as ‘fake data’ or if digital twins are perceived as ‘imaginary’ patients. Researchers appear increasingly aware of the scientific and moral imperative to strengthen their work and facilitate its sustainability through increased diversity and community engagement. Moreover, there is a growing appreciation for the ‘data work’ necessary to have scientific data become meaningful, actionable information, knowledge, and wisdom–not only for scientists but also for the individuals from whom those data were derived or to whom those data relate. Equity in the process of biocomputing and equity in the distribution of benefits and burdens of biocomputing both demand ongoing development, implementation, and refinement of embedded Ethical, Legal and Social Implications (ELSI) research practices. This workshop is intended to nurture interdisciplinary discussion of these issues and to highlight the skills and competencies all too often considered ‘soft skills’ peripheral to other skills prioritized in traditional training and professional development programs. Data scientists attending this workshop will become better equipped to embed ELSI practices into their research.

L Fernández-Rhodes, JK Wagner

Biocomput, 2025

DOI

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction R² and the resulting PRS yields the strongest correlation with progression prevalence.

C Wang, H Markus, AR Diwadkar, C Khunsriraksakul, L Carrel, B Li, X Zhong, X Wang, X Zhan, GT Foulke, NJ Olsen, DJ Liu, B Jiang

Nat Commun, 2025

DOI

funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores

Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical shapes or patterns that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.

J Di Iorio, MA Cremona, F Chiaromonte

Stat Comput, 2024

DOI

Assessing Assembly Errors in Immunoglobulin Loci: A Comprehensive Evaluation of Long-read Genome Assemblies Across Vertebrates

Long-read sequencing technologies have revolutionized genome assembly producing near-complete chromosome assemblies for numerous organisms, which are invaluable to research in many fields. However, regions with complex repetitive structure continue to represent a challenge for genome assembly algorithms, particularly in areas with high heterozygosity. Robust and comprehensive solutions for the assessment of assembly accuracy and completeness in these regions do not exist. In this study we focus on the assembly of biomedically important antibody-encoding immunoglobulin (IG) loci, which are characterized by complex duplications and repeat structures. High-quality full-length assemblies for these loci are critical for resolving haplotype-level annotations of IG genes, without which, functional and evolutionary studies of antibody immunity across vertebrates are not tractable. To address these challenges, we developed a pipeline, CloseRead, that generates multiple assembly verification metrics for analysis and visualization. These metrics expand upon those of existing quality assessment tools and specifically target complex and highly heterozygous regions. Using CloseRead, we systematically assessed the accuracy and completeness of IG loci in publicly available assemblies of 74 vertebrate species, identifying problematic regions. We also demonstrated that inspecting assembly graphs for problematic regions can both identify the root cause of assembly errors and illuminate solutions for improving erroneous assemblies. For a subset of species, we were able to correct assembly errors through targeted reassembly. Together, our analysis demonstrated the utility of assembly assessment in improving the completeness and accuracy of IG loci across species.

Y Zhu, C Watson, Y Safonova, M Pennell, A Bankevich

bioRxiv, 2024

DOI

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Large national-level electronic health record (EHR) datasets offer new opportunities for disentangling the role of genes and environment through deep phenotype information and approximate pedigree structures. Here we use the approximate geographical locations of patients as a proxy for spatially correlated community-level environmental risk factors. We develop a spatial mixed linear effect (SMILE) model that incorporates both genetics and environmental contribution. We extract EHR and geographical locations from 257,620 nuclear families and compile 1083 disease outcome measurements from the MarketScan dataset. We augment the EHR with publicly available environmental data, including levels of particulate matter 2.5 (PM2.5), nitrogen dioxide (NO2), climate, and sociodemographic data. We refine the estimates of genetic heritability and quantify community-level environmental contributions. We also use wind speed and direction as instrumental variables to assess the causal effects of air pollution. In total, we find PM2.5 or NO2 have statistically significant causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders, where PM2.5 and NO2 tend to affect biologically distinct disease categories. These analyses showcase several robust strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets and will benefit upcoming biobank studies in the era of precision medicine.

D McGuire, H Markus, L Yang, J Xu, A Montgomery, A Berg, Q Li, L Carrel, DJ Liu, B Jiang

Nat Commun, 2024

DOI

The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes

Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.

KD Makova, BD Pickett, RS Harris, GA Hartley, M Cechova, K Pal, S Nurk, D Yoo, Q Li, P Hebbar, BC McGrath, F Antonacci, M Aubel, A Biddanda, M Borchers, E Bomberg, GG Bouffard, SY Brooks, L Carbone, L Carrel, A Carroll, PC Chang, CS Chin, DE Cook, SJC Craig, L de Gennaro, M Diekhans, A Dutra, GH Garcia, PGS Grady, RE Green, D Haddad, P Hallast, WT Harvey, G Hickey, DA Hillis, SJ Hoyt, H Jeong, K Kamali, SLK Pond, TM LaPolice, C Lee, AP Lewis, YE Loh, P Masterson, RC McCoy, P Medvedev, KH Miga, KM Munson, E Pak, B Paten, BJ Pinto, T Potapova, A Rhie, JL Rocha, F Ryabov, OA Ryder, S Sacco, K Shafin, VA Shepelev, V Slon, SJ Solar, JM Storer, PH Sudmant, Sweetalana, A Sweeten, MG Tassia, F Thibaud-Nissen, M Ventura, MA Wilson, AC Young, H Zeng, X Zhang, ZA Szpiech, CD Huber, JL Gerton, SV Yi, MC Schatz, IA Alexandrov, S Koren, RJ O’Neill, E Eichler, AM Phillippy

Nature, 2024

DOI

Integrating single cell expression quantitative trait loci summary statistics to understand complex trait risk genes

Transcriptome-wide association study (TWAS) is a popular approach to dissect the functional consequence of disease associated non-coding variants. Most existing TWAS use bulk tissues and may not have the resolution to reveal cell-type specific target genes. Single-cell expression quantitative trait loci (sc-eQTL) datasets are emerging. The largest bulk- and sc-eQTL datasets are most conveniently available as summary statistics, but have not been broadly utilized in TWAS. Here, we present a new method EXPRESSO (EXpression PREdiction with Summary Statistics Only), to analyze sc-eQTL summary statistics, which also integrates 3D genomic data and epigenomic annotation to prioritize causal variants. EXPRESSO substantially improves existing methods. We apply EXPRESSO to analyze multi-ancestry GWAS datasets for 14 autoimmune diseases. EXPRESSO uniquely identifies 958 novel gene x trait associations, which is 26% more than the second-best method. Among them, 492 are unique to cell type level analysis and missed by TWAS using whole blood. We also develop a cell type aware drug repurposing pipeline, which leverages EXPRESSO results to identify drug compounds that can reverse disease gene expressions in relevant cell types. Our results point to multiple drugs with therapeutic potentials, including metformin for type 1 diabetes, and vitamin K for ulcerative colitis.

L Wang, C Khunsriraksakul, H Markus, D Chen, F Zhang, F Chen, X Zhan, L Carrel, DJ Liu, B Jiang

Nat Commun, 2024

DOI

Methylation profiles at birth linked to early childhood obesity

Childhood obesity represents a significant global health concern and identifying risk factors is crucial for developing intervention programs. Many ‘omics’ factors associated with the risk of developing obesity have been identified, including genomic, microbiomic, and epigenomic factors. Here, using a sample of 48 infants, we investigated how the methylation profiles in cord blood and placenta at birth were associated with weight outcomes (specifically, conditional weight gain, body mass index, and weight-for-length ratio) at age six months. We characterized genome-wide DNA methylation profiles using the Illumina Infinium MethylationEpic chip, and incorporated information on child and maternal health, and various environmental factors into the analysis. We used regression analysis to identify genes with methylation profiles most predictive of infant weight outcomes, finding a total of 23 relevant genes in cord blood and 10 in placenta. Notably, in cord blood, the methylation profiles of three genes (PLIN4, UBE2F, and PPP1R16B) were associated with all three weight outcomes, which are also associated with weight outcomes in an independent cohort suggesting a strong relationship with weight trajectories in the first six months after birth. Additionally, we developed a Methylation Risk Score (MRS) that could be used to identify children most at risk for developing childhood obesity. While many of the genes identified by our analysis have been associated with weight-related traits (e.g., glucose metabolism, BMI, or hip-to-waist ratio) in previous genome-wide association and variant studies, our analysis implicated several others, whose involvement in the obesity phenotype should be evaluated in future functional investigations.

D Lariviere, SJC Craig, IM Paul, EE Hohman, JS Savage, RO Wright, F Chiaromonte, KD Makova, ML Reimherr

J Dev Orig Health Dis, 2024

DOI

In vivo detection of DNA secondary structures using Permanganate/S1 Footprinting with Direct Adapter Ligation and Sequencing (PDAL-Seq)

DNA secondary structures are essential elements of the genomic landscape, playing a critical role in regulating various cellular processes. These structures refer to G-quadruplexes, cruciforms, Z-DNA or H-DNA structures, amongst others (collectively called ‘non-B DN’), which DNA molecules can adopt beyond the B conformation. DNA secondary structures have significant biological roles, and their landscape is dynamic and can rearrange due to various factors, including changes in cellular conditions, temperature, and DNA-binding proteins. Understanding this dynamic nature is crucial for unraveling their functions in cellular processes. Detecting DNA secondary structures remains a challenge. Conventional methods, such as gel electrophoresis and chemical probing, have limitations in terms of sensitivity and specificity. Emerging techniques, including next-generation sequencing and single-molecule approaches, offer promise but face challenges since these techniques are mostly limited to only one type of secondary structure. Here we describe an updated version of a technique permanganate/S1 nuclease footprinting, which uses potassium permanganate to trap single-stranded DNA regions as found in non-B structures, in combination with S1 nuclease digest and adapter ligation to detect genome-wide non-B formation. To overcome technical hurdles, we combined this method with direct adapter ligation and sequencing (PDAL-Seq). Furthermore, we established a user-friendly pipeline available on Galaxy to standardize PDAL-Seq data analysis. This optimized method allows the analysis of many types of DNA secondary structures that form in a living cell and will advance our knowledge of their roles in health and disease.

A Lahnsteiner, SJC Craig, K Kamali, B Weissensteiner, B McGrath, A Risch, KD Makova

Methods in Enzymology, 2024

DOI

Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this data set resulted in several findings. First, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Second, our results suggest that BPY2 transcripts and proteins originate from separate genomic regions in bonobo versus human, which is possibly facilitated by acquiring new promoters. Third, our analysis indicates that the PRY gene family, having the highest representation of noncoding transcripts, has been undergoing pseudogenization. Fourth, we have not detected signatures of selection in the five YAG families shared among great apes, even though we identified many species-specific protein-coding transcripts. Fifth, we predicted consensus disorder regions across most gene families and species, which could be used for future investigations of male infertility. Overall, our work illuminates the YAG isoform landscape and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

M Tomaszkiewicz, K Sahlin, P Medvedev, KD Makova

Genome Biol Evol, 2023

DOI

The complete sequence of a human Y chromosome

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.

A Rhie, S Nurk, M Cechova, SJ Hoyt, DJ Taylor, N Altemose, PW Hook, S Koren, M Rautiainen, IA Alexandrov, J Allen, M Asri, AV Bzikadze, NC Chen, CS Chin, M Diekhans, P Flicek, G Formenti, A Fungtammasan, CG Giron, E Garrison, A Gershman, JL Gerton, PGS Grady, A Guarracino, L Haggerty, R Halabian, NF Hansen, R Harris, GA Hartley, WT Harvey, M Haukness, J Heinz, T Hourlier, RM Hubley, SE Hunt, S Hwang, M Jain, RK Kesharwani , AP Lewis, H Li, GA Logsdon, JK Lucas, W Makalowski, C Markovic, FJ Martin, AMM Cartney, RC McCoy, J McDaniel, BM McNulty, P Medvedev, A Mikheenko, KM Munson , TD Murphy, H Olsen, ND Olson, LF Paulin, D Porubsky, T Potapova, F Ryabov, SL Salzberg, MEG Sauria, FJ Sedlazeck, K Shafin, VA Shepelev, A Shumate, JM Storer, L Surapaneni, AMT Oill , F Thibaud-Nissen, W Timp, M Tomaszkiewicz, MR Vollger, BP Walenz, AC Watwood, MH Weissensteiner, AM Wenger, MA Wilson, S Zarate, Y Zhu, JM Zook, EE Eichler, RJ O’Neill, MC Schatz, KH Miga, KD Makova, AM Phillippy

Nature, 2023

DOI

Native American genetic ancestry and pigmentation allele contributions to skin color in a Caribbean population

Our interest in the genetic basis of skin color variation between populations led us to seek a Native American population with genetically African admixture but low frequency of European light skin alleles. Analysis of 458 genomes from individuals residing in the Kalinago Territory of the Commonwealth of Dominica showed approximately 55% Native American, 32% African, and 12% European genetic ancestry, the highest Native American genetic ancestry among Caribbean populations to date. Skin pigmentation ranged from 20 to 80 melanin units, averaging 46. Three albino individuals were determined to be homozygous for a causative multi-nucleotide polymorphism OCA2NW273KV contained within a haplotype of African origin; its allele frequency was 0.03 and single allele effect size was –8 melanin units. Derived allele frequencies of SLC24A5A111T and SLC45A2L374F were 0.14 and 0.06, with single allele effect sizes of –6 and –4, respectively. Native American genetic ancestry by itself reduced pigmentation by more than 20 melanin units (range 24–29). The responsible hypopigmenting genetic variants remain to be identified, since none of the published polymorphisms predicted in prior literature to affect skin color in Native Americans caused detectable hypopigmentation in the Kalinago.

KC Ang, VA Canfield, TC Foster, TD Harbaugh, KA Early, RL Harter, KP Reid, SL Leong, Y Kawasawa, DJ Liu, JW Hawley, KC Cheng

elife, 2023

DOI

Accurate sequencing of DNA motifs able to form alternative (non-B) structures

Approximately 13% of the human genome at certain motifs have the potential to form noncanonical (non-B) DNA structures (e.g., G-quadruplexes, cruciforms, and Z-DNA), which regulate many cellular processes but also affect the activity of polymerases and helicases. Because sequencing technologies use these enzymes, they might possess increased errors at non-B structures. To evaluate this, we analyzed error rates, read depth, and base quality of Illumina, Pacific Biosciences (PacBio) HiFi, and Oxford Nanopore Technologies (ONT) sequencing at non-B motifs. All technologies showed altered sequencing success for most non-B motif types, although this could be owing to several factors, including structure formation, biased GC content, and the presence of homopolymers. Single-nucleotide mismatch errors had low biases in HiFi and ONT for all non-B motif types but were increased for G-quadruplexes and Z-DNA in all three technologies. Deletion errors were increased for all non-B types but Z-DNA in Illumina and HiFi, as well as only for G-quadruplexes in ONT. Insertion errors for non-B motifs were highly, moderately, and slightly elevated in Illumina, HiFi, and ONT, respectively. Additionally, we developed a probabilistic approach to determine the number of false positives at non-B motifs depending on sample size and variant frequency, and applied it to publicly available data sets (1000 Genomes, Simons Genome Diversity Project, and gnomAD). We conclude that elevated sequencing errors at non-B DNA motifs should be considered in low-read-depth studies (single-cell, ancient DNA, and pooled-sample population sequencing) and in scoring rare variants. Combining technologies should maximize sequencing accuracy in future studies of non-B DNA.

MH Weissensteiner, MA Cremona, WM Guiblet, N Stoler, RS Harris, M Cechova, KA Eckert, F Chiaromonte, YF Huang, KD Makova

Genome Res, 2023

DOI

Whole-genome sequence and assembly of the Javan gibbon (Hylobates moloch)

The Javan gibbon, Hylobates moloch, is an endangered gibbon species restricted to the forest remnants of western and central Java, Indonesia, and one of the rarest of the Hylobatidae family. Hylobatids consist of 4 genera (Holoock, Hylobates, Symphalangus, and Nomascus) that are characterized by different numbers of chromosomes, ranging from 38 to 52. The underlying cause of this karyotype plasticity is not entirely understood, at least in part, due to the limited availability of genomic data. Here we present the first scaffold-level assembly for H. moloch using a combination of whole-genome Illumina short reads, 10X Chromium linked reads, PacBio, and Oxford Nanopore long reads and proximity-ligation data. This Hylobates genome represents a valuable new resource for comparative genomics studies in primates.

M Escalona, J VanCampen, NW Maurer, M Haukness, M Okhovat, RS Harris, A Watwood, GA Hartley, RJ O’Neill, P Medvedev, KD Makova, C Vollmers, L Carbone, RE Green

J Hered, 2023

DOI

Probabilistic K-means with Local Alignment for Clustering and Motif Discovery in Functional Data

We develop a new method to locally cluster curves and discover functional motifs, that is, typical shapes that may recur several times along and across the curves capturing important local characteristics. In order to identify these shared curve portions, our method leverages ideas from functional data analysis (joint clustering and alignment of curves), bioinformatics (local alignment through the extension of high similarity seeds) and fuzzy clustering (curves belonging to more than one cluster, if they contain more than one typical shape). It can employ various dissimilarity measures and incorporate derivatives in the discovery process, thus exploiting complex facets of shapes. We demonstrate the performance of our method with an extensive simulation study, and show how it generalizes other clustering methods for functional data. Finally, we provide real data applications to Italian Covid-19 death curves and Omics data related to mutagenesis.

MA Cremona, F Chiaromonte

JCGS, 2023

DOI

Multi-ancestry and multi-trait genome-wide association meta-analyses inform clinical risk prediction for systemic lupus erythematosus

Systemic lupus erythematosus is a heritable autoimmune disease that predominantly affects young women. To improve our understanding of genetic etiology, we conduct multi-ancestry and multi-trait meta-analysis of genome-wide association studies, encompassing 12 systemic lupus erythematosus cohorts from 3 different ancestries and 10 genetically correlated autoimmune diseases, and identify 16 novel loci. We also perform transcriptome-wide association studies, computational drug repurposing analysis, and cell type enrichment analysis. We discover putative drug classes, including a histone deacetylase inhibitor that could be repurposed to treat lupus. We also identify multiple cell types enriched with putative target genes, such as non-classical monocytes and B cells, which may be targeted for future therapeutics. Using this newly assembled result, we further construct polygenic risk score models and demonstrate that integrating polygenic risk score with clinical lab biomarkers improves the diagnostic accuracy of systemic lupus erythematosus using the Vanderbilt BioVU and Michigan Genomics Initiative biobanks.

C Khunsriraksakul, Q Li, H Markus, MT Patrick, R Sauteraud, D McGuire, X Wang, C Wang, L Wang, S Chen, G Shenoy, B Li, X Zhong, NJ Olsen, L Carrel, LC Tsoi, B Jiang, DJ Liu

Nat Commun, 2023

DOI

Constructing a polygenic risk score for childhood obesity using functional data analysis

Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain—a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.

SJC Craig, AM Kenney, J Lin, IM Paul, LL Birch, JS Savage, ME Marini, F Chiaromonte, ML Reimherr, KD Makova

Econom Stat, 2023

DOI

Variation in G-quadruplex sequence and topology differentially impacts human DNA polymerase fidelity

G-quadruplexes (G4s), a type of non-B DNA, play important roles in a wide range of molecular processes, including replication, transcription, and translation. Genome integrity relies on efficient and accurate DNA synthesis, and is compromised by various stressors, to which non-B DNA structures such as G4s can be particularly vulnerable. However, the impact of G4 structures on DNA polymerase fidelity is largely unknown. Using an in vitro forward mutation assay, we investigated the fidelity of human DNA polymerases delta (δ4, four-subunit), eta (η), and kappa (κ) during synthesis of G4 motifs representing those in the human genome. The motifs differ in sequence, topology, and stability, features that may affect DNA polymerase errors. Polymerase error rate hierarchy (δ4 < κ < η) is largely maintained during G4 synthesis. Importantly, we observed unique polymerase error signatures during synthesis of VEGF G4 motifs, stable G4s which form parallel topologies. These statistically significant errors occurred within, immediately flanking, and encompassing the G4 motif. For pol δ4, the errors were deletions, insertions and complex errors within the G4 or encompassing the G4 motif and surrounding sequence. For pol η, the errors occurred in 3’ sequences flanking the G4 motif. For pol κ, the errors were frameshift mutations within G-tracts of the G4. Because these error signatures were not observed during synthesis of an antiparallel G4 and, to a lesser extent, a hybrid G4, we suggest that G4 topology and/or stability could influence polymerase fidelity. Using in silico analyses, we show that most polymerase errors are predicted to have minimal effects on predicted G4 stability. Our results provide a unique view of G4s not previously elucidated, showing that G4 motif heterogeneity differentially influences polymerase fidelity within the motif and flanking sequences. Thus, our study advances the understanding of how DNA polymerase errors contribute to G4 mutagenesis.

ME Stein, SE Hile, MH Weissensteiner, M Lee, S Zhang, E Kejnovský, I Kejnovská, KD Makova, KA Eckert

DNA Repair, 2022

DOI

Construction and Application of Polygenic Risk Scores in Autoimmune Diseases

Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with autoimmune diseases and provided unique mechanistic insights and informed novel treatments. These individual genetic variants on their own typically confer a small effect of disease risk with limited predictive power; however, when aggregated (e.g., via polygenic risk score method), they could provide meaningful risk predictions for a myriad of diseases. In this review, we describe the recent advances in GWAS for autoimmune diseases and the practical application of this knowledge to predict an individual’s susceptibility/severity for autoimmune diseases such as systemic lupus erythematosus (SLE) via the polygenic risk score method. We provide an overview of methods for deriving different polygenic risk scores and discuss the strategies to integrate additional information from correlated traits and diverse ancestries. We further advocate for the need to integrate clinical features (e.g., anti-nuclear antibody status) with genetic profiling to better identify patients at high risk of disease susceptibility/severity even before clinical signs or symptoms develop. We conclude by discussing future challenges and opportunities of applying polygenic risk score methods in clinical care.

C Khunsriraksakul, H Markus, NJ Olsen, L Carrel, B Jiang, DJ Liu

Front Immunol, 2022

DOI

Integrating 3D genomic and epigenomic data to enhance target gene discovery and drug repurposing in transcriptome-wide association studies

Transcriptome-wide association studies (TWAS) are popular approaches to test for association between imputed gene expression levels and traits of interest. Here, we propose an integrative method PUMICE (Prediction Using Models Informed by Chromatin conformations and Epigenomics) to integrate 3D genomic and epigenomic data with expression quantitative trait loci (eQTL) to more accurately predict gene expressions. PUMICE helps define and prioritize regions that harbor cis-regulatory variants, which outperforms competing methods. We further describe an extension to our method PUMICE +, which jointly combines TWAS results from single- and multi-tissue models. Across 79 traits, PUMICE + identifies 22% more independent novel genes and increases median chi-square statistics values at known loci by 35% compared to the second-best method, as well as achieves the narrowest credible interval size. Lastly, we perform computational drug repurposing and confirm that PUMICE + outperforms other TWAS methods.

C Khunsriraksakul, D McGuire, R Sauteraud, F Chen, L Yang, L Wang, J Hughey, S Eckert, JD Weissenkampen, G Shenoy, O Marx, L Carrel, B Jiang, DJ Liu

Nat Commun, 2022

DOI

Advanced age increases frequencies of de novo mitochondrial mutations in macaque oocytes and somatic tissues

Mutations in mitochondrial DNA (mtDNA) contribute to multiple diseases. However, how new mtDNA mutations arise and accumulate with age remains understudied because of the high error rates of current sequencing technologies. Duplex sequencing reduces error rates by several orders of magnitude via independently tagging and analyzing each of the two template DNA strands. Here, using duplex sequencing, we obtained high-quality mtDNA sequences for somatic tissues (liver and skeletal muscle) and single oocytes of 30 unrelated rhesus macaques, from 1 to 23 y of age. Sequencing single oocytes minimized effects of natural selection on germline mutations. In total, we identified 17,637 tissue-specific de novo mutations. Their frequency increased ∼3.5-fold in liver and ∼2.8-fold in muscle over the ∼20 y assessed. Mutation frequency in oocytes increased ∼2.5-fold until the age of 9 y, but did not increase after that, suggesting that oocytes of older animals maintain the quality of their mtDNA. We found the light-strand origin of replication (OriL) to be a hotspot for mutation accumulation with aging in liver. Indeed, the 33-nucleotide-long OriL harbored 12 variant hotspots, 10 of which likely disrupt its hairpin structure and affect replication efficiency. Moreover, in somatic tissues, protein-coding variants were subject to positive selection (potentially mitigating toxic effects of mitochondrial activity), the strength of which increased with the number of macaques harboring variants. Our work illuminates the origins and accumulation of somatic and germline mtDNA mutations with aging in primates and has implications for delayed reproduction in modern human societies.

B Arbeithuber, MA Cremona, J Hester, A Barrett, B Higgins, K Anthony, F Chiaromonte, FJ Diaz, KD Makova

P Natl Acad Sci, 2022

DOI

Metabolomic profiling of stool of two-year old children from the INSIGHT study reveals links between butyrate and child weight outcomes

Background: Metabolomic analysis is commonly used to understand the biological underpinning of diseases such as obesity. However, our knowledge of gut metabolites related to weight outcomes in young children is currently limited. Objectives: To (1) explore the relationships between metabolites and child weight outcomes, (2) determine the potential effect of covariates (e.g., child’s diet, maternal health/habits during pregnancy, etc.) in the relationship between metabolites and child weight outcomes, and (3) explore the relationship between selected gut metabolites and gut microbiota abundance. Methods: Using 1 H-NMR, we quantified 30 metabolites from stool samples of 170 two-year-old children. To identify metabolites and covariates associated with children’s weight outcomes (BMI [weight/height2 ], BMI z-score [BMI adjusted for age and sex], and growth index [weight/height]), we analysed the 1 H-NMR data, along with 20 covariates recorded on children and mothers, using LASSO and best subset selection regression techniques. Previously characterized microbiota community information from the same stool samples was used to determine associations between selected gut metabolites and gut microbiota. Results: At age 2 years, stool butyrate concentration had a significant positive association with child BMI (p-value = 3.58 × 10-4 ), BMI z-score (p-value = 3.47 × 10-4 ), and growth index (p-value = 7.73 × 10-4 ). Covariates such as maternal smoking during pregnancy are important to consider. Butyrate concentration was positively associated with the abundance of the bacterial genus Faecalibacterium (p-value = 9.61 × 10-3 ). Conclusions: Stool butyrate concentration is positively associated with increased child weight outcomes and should be investigated further as a factor affecting childhood obesity.

D Nandy, SJC Craig, J Cai, Y Tian, IM Paul, JS Savage, ME Marini, EE Hohman, ML Reimherr, AD Patterson, KD Makova, F Chiaromonte

Pediatr Obes, 2022

DOI

INSIGHT responsive parenting educational intervention for firstborns is associated with growth of second-born siblings

Objective: The aim of this study was to test whether the Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) responsive parenting (RP) intervention, delivered to parents of firstborn children, is associated with the BMI of first- and second-born siblings during infancy.Methods: Participants included 117 firstborn infants enrolled in a randomized controlled trial and their second-born siblings enrolled in an observation-only ancillary study. The RP curriculum for firstborn children included guidance on feeding, sleep, interactive play, and emotion regulation. The control curriculum focused on safety. Anthropometrics were measured in both siblings at ages 3, 16, 28, and 52 weeks. Growth curve models for BMI by child age were fit.Results: Second-born children were delivered 2.5 (SD 0.9) years after firstborns. Firstborn and second-born children whose parents received the RP intervention with their first child had BMI that was 0.44 kg/m2 (95% CI: -0.82 to 0.06) and 0.36 kg/m2 (95% CI: -0.75 to 0.03) lower than controls, respectively. Linear and quadratic growth rates for BMI for firstborn and second-born cohorts were similar, but second-born children had a greater average BMI at 1 year of age (difference = -0.33 [95% CI: -0.52 to -0.15]).Conclusions: A RP educational intervention for obesity prevention delivered to parents of firstborns appears to spill over to second-born siblings.

JS Savage, AK Hochgraf, E Loken, ME Marini, SJC Craig, KD Makova, LL Birch, IM Paul

Obesity, 2022

DOI

Associations between stool micro-transcriptome, gut microbiota, and infant growth

Rapid infant growth increases the risk for adult obesity. The gut microbiome is associated with early weight status; however, no study has examined how interactions between microbial and host ribonucleic acid (RNA) expression influence infant growth. We hypothesized that dynamics in infant stool micro-ribonucleic acids (miRNAs) would be associated with both microbial activity and infant growth via putative metabolic targets. Stool was collected twice from 30 full-term infants, at 1 month and again between 6 and 12 months. Stool RNA were measured with high-throughput sequencing and aligned to human and microbial databases. Infant growth was measured by weight-for-length z-score at birth and 12 months. Increased RNA transcriptional activity of Clostridia (R = 0.55; Adj p = 3.7E-2) and Burkholderia (R = -0.820, Adj p = 2.62E-3) were associated with infant growth. Of the 25 human RNAs associated with growth, 16 were miRNAs. The miRNAs demonstrated significant target enrichment (Adj p < 0.05) for four metabolic pathways. There were four associations between growth-related miRNAs and growth-related phyla. We have shown that longitudinal trends in gut microbiota activity and human miRNA levels are associated with infant growth and the metabolic targets of miRNAs suggest these molecules may regulate the biosynthetic landscape of the gut and influence microbial activity.

MC Carney, X Zhan, A Rangnekar, MZ Chroneos, SJC Craig, KD Makova, IM Paul, SD Hicks

J Dev Orig Health Dis, 2021

DOI

Inferring genes that escape X-Chromosome inactivation reveals important contribution of variable escape genes to sex-biased diseases

The X Chromosome plays an important role in human development and disease. However, functional genomic and disease association studies of X genes greatly lag behind autosomal gene studies, in part owing to the unique biology of X-Chromosome inactivation (XCI). Because of XCI, most genes are only expressed from one allele. Yet, ∼30% of X genes ‘escape’ XCI and are transcribed from both alleles, many only in a proportion of the population. Such interindividual differences are likely to be disease relevant, particularly for sex-biased disorders. To understand the functional biology for X-linked genes, we developed X-Chromosome inactivation for RNA-seq (XCIR), a novel approach to identify escape genes using bulk RNA-seq data. Our method, available as an R package, is more powerful than alternative approaches and is computationally efficient to handle large population-scale data sets. Using annotated XCI states, we examined the contribution of X-linked genes to the disease heritability in the United Kingdom Biobank data set. We show that escape and variable escape genes explain the largest proportion of X heritability, which is in large part attributable to X genes with Y homology. Finally, we investigated the role of each XCI state in sex-biased diseases and found that although XY homologous gene pairs have a larger overall effect size, enrichment for variable escape genes is significantly increased in female-biased diseases. Our results, for the first time, quantitate the importance of variable escape genes for the etiology of sex-biased disease, and our pipeline allows analysis of larger data sets for a broad range of phenotypes.

R Sauteraud, JM Stahl, J James, M Englebright, F Chen, X Zhan, L Carrel, DJ Liu

Genome Res, 2021

DOI

Functional data analysis characterizes the shapes of the first COVID-19 epidemic wave in Italy

We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an “exponential” one unfolding in Lombardia and the worst hit areas of the north, and a milder, “flat(tened)” one in the rest of the country—including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID-19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and workplaces to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data.

T Boschi, J Di Iorio, L Testa, MA Cremona, F Chiaromonte

Sci Rep, 2021

DOI

Selection and thermostability suggest G-quadruplexes are novel functional elements of the human genome

Approximately 1% of the human genome has the ability to fold into G-quadruplexes (G4s)-noncanonical strand-specific DNA structures forming at G-rich motifs. G4s regulate several key cellular processes (e.g., transcription) and have been hypothesized to participate in others (e.g., firing of replication origins). Moreover, G4s differ in their thermostability, and this may affect their function. Yet, G4s may also hinder replication, transcription, and translation and may increase genome instability and mutation rates. Therefore, depending on their genomic location, thermostability, and functionality, G4 loci might evolve under different selective pressures, which has never been investigated. Here we conducted the first genome-wide analysis of G4 distribution, thermostability, and selection. We found an overrepresentation, high thermostability, and purifying selection for G4s within genic components in which they are expected to be functional-promoters, CpG islands, and 5’ and 3’ UTRs. A similar pattern was observed for G4s within replication origins, enhancers, eQTLs, and TAD boundary regions, strongly suggesting their functionality. In contrast, G4s on the nontranscribed strand of exons were underrepresented, were unstable, and evolved neutrally. In general, G4s on the nontranscribed strand of genic components had lower density and were less stable than those on the transcribed strand, suggesting that the former are avoided at the RNA level. Across the genome, purifying selection was stronger at stable G4s. Our results suggest that purifying selection preserves the sequences of functional G4s, whereas nonfunctional G4s are too costly to be tolerated in the genome. Thus, G4s are emerging as fundamental, functional genomic elements.

WM Guiblet, M DeGiorgio, X Cheng, F Chiaromonte, KA Eckert, YF Huang, KD Makova

Genome Res, 2021

DOI

Prothrombotic variants as modifiers of clinical phenotype in four related individuals with haemophilia A

Haemophilia A (HA) is an X-linked bleeding disorder that results from coagulation factor VIII deficiency. Residual factor VIII activity levels (FVIII:C) largely reflect F8 gene mutations and are used to classify HA as severe (<1%), moderate (1–5%) or mild (>5%). However, FVIII:C may differ among individuals carrying the same F8 mutation.1 Furthermore, bleeding phenotypes and FVIII:C can be discordant,2 which poses a particular challenge for mild/moderate individuals.3 Identification of variants in additional genes involved in haemostasis is important for improving classification and treatment guidelines for individuals with HA. Here, we describe four related males carrying F8 mutation c.494C>T (p.Pro146Leu) with moderate FVIII:C levels. However, clinical severity differs: two are mild and two are severe. The aim of this study was to identify gene variants in these individuals that may explain discordant bleeding phenotypes.

L Carrel, S Arnold-Croop, T Achtermann, F Chen, Y Cheng, D Liu, ME Eyster

haemophilia, 2021

DOI

Towards complete and error-free genome assemblies of all vertebrate species

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

A Rhie, SA McCarthy, O Fedrigo, J Damas, G Formenti, S Koren, M Uliano-Silva, W Chow, A Fungtammasan, J Kim, C Lee, BJ Ko, M Chaisson, GL Gedman, LJ Cantin, F Thibaud-Nissen, L Haggerty, I Bista, M Smith, B Haase, J Mountcastle, S Winkler, S Paez, J Howard, SC Vernes, TM Lama, F Grutzner, WC Warren, CN Balakrishnan, D Burt, JM George, MT Biegler, D Iorns, A Digby, D Eason, B Robertson, T Edwards, M Wilkinson, G Turner, Axel Meyer, Andreas F Kautt, P Franchini, HW Detrich 3rd, H Svardal, M Wagner, GJP Naylor, M Pippel, M Malinsky, M Mooney, M Simbirsky, BT Hannigan, T Pesout, M Houck, A Misuraca, SB Kingan, R Hall, Z Kronenberg, I Sović, C Dunn, Z Ning, A Hastie, J Lee, S Selvaraj, RE Green, NH Putnam, I Gut, J Ghurye, E Garrison, Y Sims, J Collins, S Pelan, J Torrance, A Tracey, J Wood, RE Dagnew, D Guan, SE London, DF Clayton, CV Mello, SR Friedrich, PV Lovell, E Osipova, FO Al-Ajli, S Secomandi, H Kim, C Theofanopoulou, M Hiller, Y Zhou, RS Harris, KD Makova, P Medvedev, J Hoffman, P Masterson, K Clark, F Martin, K Howe, P Flicek, BP Walenz, W Kwak, H Clawson, M Diekhans, L Nassar, B Paten, RHS Kraus, AJ Crawford, MTP Gilbert, G Zhang, B Venkatesh, RW Murphy, KP Koepfli, B Shapiro, WE Johnson, F Di Palma, T Marques-Bonet, EC Teeling, T Warnow, JM Graves, OA Ryder, D Haussler, SJ O’Brien, J Korlach, HA Lewin, K Howe, EW Myers, R Durbin, AM Phillippy, ED Jarvis

Nature, 2021

DOI

Model-based assessment of replicability for genome-wide association meta-analysis

Genome-wide association meta-analysis (GWAMA) is an effective approach to enlarge sample sizes and empower the discovery of novel associations between genotype and phenotype. Independent replication has been used as a gold-standard for validating genetic associations. However, as current GWAMA often seeks to aggregate all available datasets, it becomes impossible to find a large enough independent dataset to replicate new discoveries. Here we introduce a method, MAMBA (Meta-Analysis Model-based Assessment of replicability), for assessing the ‘posterior-probability-of-replicability’ for identified associations by leveraging the strength and consistency of association signals between contributing studies. We demonstrate using simulations that MAMBA is more powerful and robust than existing methods, and produces more accurate genetic effects estimates. We apply MAMBA to a large-scale meta-analysis of addiction phenotypes with 1.2 million individuals. In addition to accurately identifying replicable common variant associations, MAMBA also pinpoints novel replicable rare variant associations from imputation-based GWAMA and hence greatly expands the set of analyzable variants.

D McGuire, Y Jiang, M Liu, JD Weissenkampen, S Eckert, L Yang, F Chen, A Berg, S Vrieze, B Jiang, Q Li, DJ Liu

Nat Commun, 2021

DOI

Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome

Approximately 13% of the human genome can fold into non-canonical (non-B) DNA structures (e.g. G-quadruplexes, Z-DNA, etc.), which have been implicated in vital cellular processes. Non-B DNA also hinders replication, increasing errors and facilitating mutagenesis, yet its contribution to genome-wide variation in mutation rates remains unexplored. Here, we conducted a comprehensive analysis of nucleotide substitution frequencies at non-B DNA loci within noncoding, non-repetitive genome regions, their ±2 kb flanking regions, and 1-Megabase windows, using human-orangutan divergence and human single-nucleotide polymorphisms. Functional data analysis at single-base resolution demonstrated that substitution frequencies are usually elevated at non-B DNA, with patterns specific to each non-B DNA type. Mirror, direct and inverted repeats have higher substitution frequencies in spacers than in repeat arms, whereas G-quadruplexes, particularly stable ones, have higher substitution frequencies in loops than in stems. Several non-B DNA types also affect substitution frequencies in their flanking regions. Finally, non-B DNA explains more variation than any other predictor in multiple regression models for diversity or divergence at 1-Megabase scale. Thus, non-B DNA substantially contributes to variation in substitution frequencies at small and large scales. Our results highlight the role of non-B DNA in germline mutagenesis with implications to evolution and genetic diseases.

WM Guiblet, MA Cremona, RS Harris, D Chen, KA Eckert, F Chiaromonte, YF Huang, KD Makova

Nucleic Acids Res, 2021

DOI

Selective synthetic augmentation with HistoGAN for improved histopathology image classification

Histopathological analysis is the present gold standard for precancerous lesion diagnosis. The goal of automated histopathological classification from digital images requires supervised training, which requires a large number of expert annotations that can be expensive and time-consuming to collect. Meanwhile, accurate classification of image patches cropped from whole-slide images is essential for standard sliding window based histopathology slide classification methods. To mitigate these issues, we propose a carefully designed conditional GAN model, namely HistoGAN, for synthesizing realistic histopathology image patches conditioned on class labels. We also investigate a novel synthetic augmentation framework that selectively adds new synthetic image patches generated by our proposed HistoGAN, rather than expanding directly the training set with synthetic images. By selecting synthetic images based on the confidence of their assigned labels and their feature similarity to real labeled images, our framework provides quality assurance to synthetic augmentation. Our models are evaluated on two datasets: a cervical histopathology image dataset with limited annotations, and another dataset of lymph node histopathology images with metastatic cancer. Here, we show that leveraging HistoGAN generated images with selective augmentation results in significant and consistent improvements of classification performance ( and higher accuracy, respectively) for cervical histopathology and metastatic cancer datasets.

Y Xue, J Ye, Q Zhou, LR Long, S Antani, Z Xue, C Cornwell, R Zaino, KC Cheng, X Huang

MED IMAGE ANAL, 2021

DOI

Human L1 Transposition Dynamics Unraveled with Functional Data Analysis

Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features—proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.—in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection—depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.

D Chen, MA Cremona, Z Qi, RD Mitra, F Chiaromonte, KD Makova

Mol Biol Evol, 2020

DOI

Dynamic evolution of great ape Y chromosomes

The mammalian male-specific Y chromosome plays a critical role in sex determination and male fertility. However, because of its repetitive and haploid nature, it is frequently absent from genome assemblies and remains enigmatic. The Y chromosomes of great apes represent a particular puzzle: their gene content is more similar between human and gorilla than between human and chimpanzee, even though human and chimpanzee share a more recent common ancestor. To solve this puzzle, here we constructed a dataset including Ys from all extant great ape genera. We generated assemblies of bonobo and orangutan Ys from short and long sequencing reads and aligned them with the publicly available human, chimpanzee, and gorilla Y assemblies. Analyzing this dataset, we found that the genus Pan, which includes chimpanzee and bonobo, experienced accelerated substitution rates. Pan also exhibited elevated gene death rates. These observations are consistent with high levels of sperm competition in Pan Furthermore, we inferred that the great ape common ancestor already possessed multicopy sequences homologous to most human and chimpanzee palindromes. Nonetheless, each species also acquired distinct ampliconic sequences. We also detected increased chromatin contacts between and within palindromes (from Hi-C data), likely facilitating gene conversion and structural rearrangements. Our results highlight the dynamic mode of Y chromosome evolution and open avenues for studies of male-specific dispersal in endangered great ape species.

M Cechova, R Vegesna, M Tomaszkiewicz, RS Harris, D Chen, S Rangavittal, P Medvedev, KD Makova

Proc Natl Acad Sci, 2020

DOI

Investigation of discordant phenotype in mild Hemophilia A using whole exome sequencing

Keywords: Factor V; Hemophilia A; Prothrombotic gene variants; Rebalanced hemostasis; von Willebrand factor.

PH Cygan, SE Arnold-Croop, EA Weidman, F Chen, DJ Liu , ME Eyster, L Carrel

Thromb Res, 2020

DOI

Age-related accumulation of de novo mitochondrial mutations in mammalian oocytes and somatic tissues

Mutations create genetic variation for other evolutionary forces to operate on and cause numerous genetic diseases. Nevertheless, how de novo mutations arise remains poorly understood. Progress in the area is hindered by the fact that error rates of conventional sequencing technologies (1 in 100 or 1,000 base pairs) are several orders of magnitude higher than de novo mutation rates (1 in 10,000,000 or 100,000,000 base pairs per generation). Moreover, previous analyses of germline de novo mutations examined pedigrees (and not germ cells) and thus were likely affected by selection. Here, we applied highly accurate duplex sequencing to detect low-frequency, de novo mutations in mitochondrial DNA (mtDNA) directly from oocytes and from somatic tissues (brain and muscle) of 36 mice from two independent pedigrees. We found mtDNA mutation frequencies 2- to 3-fold higher in 10-month-old than in 1-month-old mice, demonstrating mutation accumulation during the period of only 9 mo. Mutation frequencies and patterns differed between germline and somatic tissues and among mtDNA regions, suggestive of distinct mutagenesis mechanisms. Additionally, we discovered a more pronounced genetic drift of mitochondrial genetic variants in the germline of older versus younger mice, arguing for mtDNA turnover during oocyte meiotic arrest. Our study deciphered for the first time the intricacies of germline de novo mutagenesis using duplex sequencing directly in oocytes, which provided unprecedented resolution and minimized selection effects present in pedigree studies. Moreover, our work provides important information about the origins and accumulation of mutations with aging/maturation and has implications for delayed reproduction in modern human societies. Furthermore, the duplex sequencing method we optimized for single cells opens avenues for investigating low-frequency mutations in other studies.

B Arbeithuber, J Hester, MA Cremona, N Stoler, A Zaidi, B Higgins, K Anthony, F Chiaromonte, FJ Diaz, KD Makova

PLoS Biol, 2020

DOI

Ampliconic Genes on the Great Ape Y Chromosomes: Rapid Evolution of Copy Number but Conservation of Expression Levels

Multicopy ampliconic gene families on the Y chromosome play an important role in spermatogenesis. Thus, studying their genetic variation in endangered great ape species is critical. We estimated the sizes (copy number) of nine Y ampliconic gene families in population samples of chimpanzee, bonobo, and orangutan with droplet digital polymerase chain reaction, combined these estimates with published data for human and gorilla, and produced genome-wide testis gene expression data for great apes. Analyzing this comprehensive data set within an evolutionary framework, we, first, found high inter- and intraspecific variation in gene family size, with larger families exhibiting higher variation as compared with smaller families, a pattern consistent with random genetic drift. Second, for four gene families, we observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. Third, despite substantial variation in copy number, Y ampliconic gene families’ expression levels did not differ significantly among species, suggesting dosage regulation. Fourth, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression. Our results indicate high variability in size but conservation in gene expression levels in Y ampliconic gene families, significantly advancing our understanding of Y-chromosome evolution in great apes.

R Vegesna, M Tomaszkiewicz, OA Ryder, R Campos-Sánchez, P Medvedev, M DeGiorgio, KD Makova

GENOME BIOL EVOL, 2020

DOI

Pronounced somatic bottleneck in mitochondrial DNA of human hair

Heteroplasmy is the presence of variable mitochondrial DNA (mtDNA) within the same individual. The dynamics of heteroplasmy allele frequency among tissues of the human body is not well understood. Here, we measured allele frequency at heteroplasmic sites in two to eight hairs from each of 11 humans using next-generation sequencing. We observed a high variance in heteroplasmic allele frequency among separate hairs from the same individual—much higher than that for blood and cheek tissues. Our population genetic modelling estimated the somatic bottleneck during embryonic follicle development of separate hairs to be only 11.06 (95% confidence interval 0.6–34.0) mtDNA segregating units. This bottleneck is much more drastic than somatic bottlenecks for blood and cheek tissues (136 and 458 units, respectively), as well as more drastic than, or comparable to, the germline bottleneck (equal to 25–32 or 7–10 units, depending on the study). We demonstrated that hair undergoes additional genetic drift before and after the divergence of mtDNA lineages of individual hair follicles. Additionally, we showed a positive correlation between donor’s age and variance in heteroplasmy allele frequency in hair. These findings have important implications for forensics and for our understanding of mtDNA dynamics in the human body. This article is part of the theme issue ‘Linking the mitochondrial genotype to phenotype: a complex endeavour’.

A Barrett, B Arbeithuber, A Zaidi, P Wilton, IM Paul, R Nielsen, KD Makova

PHILOS T R SOC B, 2020

DOI

On the bias of H-scores for comparing biclusters, and how to correct it

The H-score (or Mean Squared Residue score) underlies Cheng and Church’s (2000) biclustering algorithm, one of the best-known and most widely employed algorithms in bioinformatics and computational biology, and many subsequent algorithms (e.g. FLOC, Yang et al., 2005 and CBEB, Huang et al., 2012). Cheng and Church’s algorithm has ∼2600 citations to date, 650 since 2015 and 230 in 2018–2019 alone. It was the first to be applied to gene microarray data, and it is one of the main tools available in biclustering packages (e.g. the ‘biclust’ R library) and in gene expression data analysis packages (e.g. IRIS-EDA, Monier et al., 2019). In addition, it is widely used as a benchmark: almost all published biclustering algorithms include a comparison with it. Squared residue measures such as H-scores have a double role in biclustering methods. On the one hand, they are employed by many algorithms as merit functions to guide the discovery of biclusters (see e.g. the reviews in Madeira and Oliveira, 2004; Pontes et al., 2015). On the other hand, they are used to assess solutions—in particular, H-scores are used to assess the ‘homogeneity’ of the discovered biclusters. Both uses involve the comparisons of biclusters which may have different numbers of rows and columns. Our findings document a bias that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster—even in the ‘ideal’ (and simplest) case of a single bicluster generated by a constant model plus a white noise. This biases the H-score, and hence all H-score based algorithms, toward small biclusters. Importantly, our analytical proof provides a straightforward way to correct this bias.

J Di Iorio, F Chiaromonte, MA Cremona

BIOINFORMATICS, 2020

DOI

Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes

Author summary The human genome harbors two sex chromosomes—X and Y. Among them, the Y chromosome is present only in males. Deletions of portions of this chromosome have been linked to male infertility, however exactly why the loss of these genes leads to this condition is not well understood. Here we study a group of Y chromosome genes called ampliconic genes, which are expressed in testis and are frequently deleted in males with infertility. These genes are organized in nine gene families, each of which harbors multiple copies of genes highly similar in sequence. In this study, we aimed to establish a baseline of their variation in copy number and in gene expression—one measure of genes’ functional output—by studying 149 healthy men. We found that testis tolerates a wide range of copy number and expression variation of Y ampliconic genes. Additionally, we demonstrated that gene expression within most Y ampliconic gene families depends on the expression levels of gene family members located outside of the Y chromosome, i.e. they undergo dosage regulation.

R Vegesna, M Tomaszkiewicz, P Medvedev, KD Makova

PLOS GENET, 2019

DOI

A High-Resolution View of Adaptive Event Dynamics in a Plasmid

Coadaptation between bacterial hosts and plasmids frequently results in adaptive changes restricted exclusively to host genome leaving plasmids unchanged. To better understand this remarkable stability, we transformed naïve Escherichia coli cells with a plasmid carrying an antibiotic-resistance gene and forced them to adapt in a turbidostat environment. We then drew population samples at regular intervals and subjected them to duplex sequencing—a technique specifically designed for identification of low-frequency mutations. Variants at ten sites implicated in plasmid copy number control emerged almost immediately, tracked consistently across the experiment’s time points, and faded below detectable frequencies toward the end. This variation crash coincided with the emergence of mutations on the host chromosome. Mathematical modeling of trajectories for adaptive changes affecting plasmid copy number showed that such mutations cannot readily fix or even reach appreciable frequencies. We conclude that there is a strong selection against alterations of copy number even if it can provide a degree of growth advantage. This incentive is likely rooted in the complex interplay between mutated and wild-type plasmids constrained within a single cell and underscores the importance of understanding of intracellular plasmid variability.

H Mei, B Arbeithuber, MA Cremona, M DeGiorgio, A Nekrutenko

GENOME BIOL EVOL, 2019

DOI

DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

S Rangavittal, N Stopa, M Tomaszkiewicz, K Sahlin, KD Makova, P Medvedev

BMC GENOMICS, 2019

DOI

High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

M Cechova, RS Harris, M Tomaszkiewicz, B Arbeithuber, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2019

DOI

Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA

The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that the mechanism must be designed with respect to a specific base measure over the output space, such as a Gaussian process. We provide a positive result that establishes a Central Limit Theorem for the exponential mechanism quite broadly. We also provide a negative result, showing that the magnitude of noise introduced for privacy is asymptotically non-negligible relative to the statistical estimation error. We develop an ⲉ-DP mechanism for functional principal component analysis, applicable in separable Hilbert spaces, and demonstrate its performance via simulations and applications to two datasets.

J Awan, A Kenney, M Reimherr, A Slavković

PMLR, 2019

DOI

Functional data analysis for computational biology

MA Cremona, H Xu, KD Makova, M Reimherr, F Chiaromonte, P Madrigal

BIOINFORMATICS, 2019

DOI

Bottleneck and selection in the germline and maternal age influence transmission of mitochondrial DNA in human pedigrees

Mitochondria frequently carry different DNA—a state called heteroplasmy. Heteroplasmic mutations can cause mitochondrial diseases and are involved in cancer and aging, but they are also common in healthy people. Here, we study heteroplasmy in 96 multigenerational healthy families. We show that mothers effectively transmit very few mitochondrial DNA to their offspring. Because of this bottleneck, which intensifies with increasing maternal age at childbirth, mutation frequencies can change dramatically between a mother and her child. Thus, a child might inherit a disease-causing mutation at high frequency from an asymptomatic carrier mother and might develop a disease. We also demonstrate that natural selection acts against disease-causing mutations during germline development. Our study has important implications for genetic counseling of mitochondrial diseases.Heteroplasmy—the presence of multiple mitochondrial DNA (mtDNA) haplotypes in an individual—can lead to numerous mitochondrial diseases. The presentation of such diseases depends on the frequency of the heteroplasmic variant in tissues, which, in turn, depends on the dynamics of mtDNA transmissions during germline and somatic development. Thus, understanding and predicting these dynamics between generations and within individuals is medically relevant. Here, we study patterns of heteroplasmy in 2 tissues from each of 345 humans in 96 multigenerational families, each with, at least, 2 siblings (a total of 249 mother–child transmissions). This experimental design has allowed us to estimate the timing of mtDNA mutations, drift, and selection with unprecedented precision. Our results are remarkably concordant between 2 complementary population-genetic approaches. We find evidence for a severe germline bottleneck (7–10 mtDNA segregating units) that occurs independently in different oocyte lineages from the same mother, while somatic bottlenecks are less severe. We demonstrate that divergence between mother and offspring increases with the mother extquoterights age at childbirth, likely due to continued drift of heteroplasmy frequencies in oocytes under meiotic arrest. We show that this period is also accompanied by mutation accumulation leading to more de novo mutations in children born to older mothers. We show that heteroplasmic variants at intermediate frequencies can segregate for many generations in the human population, despite the strong germline bottleneck. We show that selection acts during germline development to keep the frequency of putatively deleterious variants from rising. Our findings have important applications for clinical genetics and genetic counseling.

AA Zaidi, PR Wilton, MS-W Su, IM Paul, B Arbeithuber, K Anthony, A Nekrutenko, R Nielsen, KD Makova

P NATL ACAD SCI USA, 2019

DOI

Correcting palindromes in long reads after whole-genome amplification.

S Warris, E Schijlen, H van de Geest, R Vegesna, T Hesselink, B Te Lintel Hekkert, G Sanchez Perez, P Medvedev, KD Makova, D de Ridder

BMC GENOMICS, 2018

DOI

Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon.

K Sahlin, M Tomaszkiewicz, KD Makova, P Medvedev

NAT COMMUN, 2018

DOI

Child Weight Gain Trajectories Linked To Oral Microbiota Composition.

SJC Craig, D Blankenberg, ACL Parodi, IM Paul, LL Birch, JS Savage, ME Marini, JL Stokes, A Nekrutenko, M Reimherr, F Chiaromonte, KD Makova

SCI REP-UK, 2018

DOI

High Levels of Copy Number Variation of Ampliconic Genes across Major Human Y Haplogroups

Because of its highly repetitive nature, the human male-specific Y chromosome remains understudied. It is important to investigate variation on the Y chromosome to understand its evolution and contribution to phenotypic variation, including infertility. Approximately 20% of the human Y chromosome consists of ampliconic regions which include nine multi-copy gene families. These gene families are expressed exclusively in testes and usually implicated in spermatogenesis. Here, to gain a better understanding of the role of the Y chromosome in human evolution and in determining sexually dimorphic traits, we studied ampliconic gene copy number variation in 100 males representing ten major Y haplogroups world-wide. Copy number was estimated with droplet digital PCR. In contrast to low nucleotide diversity observed on the Y in previous studies, here we show that ampliconic gene copy number diversity is very high. A total of 98 copy-number-based haplotypes were observed among 100 individuals, and haplotypes were sometimes shared by males from very different haplogroups, suggesting homoplasies. The resulting haplotypes did not cluster according to major Y haplogroups. Overall, only two gene families (RBMY and TSPY) showed significant differences in copy number among major Y haplogroups, and the haplogroup of a male could not be predicted based on his ampliconic gene copy numbers. Finally, we did not find significant correlations either between copy number variation and individual’s height, or between the former and facial masculinity/femininity. Our results suggest rapid evolution of ampliconic gene copy numbers on the human Y, and we discuss its causes.

D Ye, AA Zaidi, M Tomaszkiewicz, K Anthony, C Liebowitz, M Degiorgio, MD Shriver, KD Makova

GENOME BIOL EVOL, 2018

DOI

IWTomics: testing high-resolution sequence-based ‘Omics’ data at multiple locations and scales

Summary

With increased generation of high-resolution sequence-based ‘Omics’ data, detecting statistically significant effects at different genomic locations and scales has become key to addressing several scientific questions. IWTomics is an R/Bioconductor package (integrated in Galaxy) that, exploiting sophisticated Functional Data Analysis techniques (i.e. statistical techniques that deal with the analysis of curves), allows users to pre-process, visualize and test these data at multiple locations and scales. The package provides a friendly, flexible and complete workflow that can be employed in many genomic and epigenomic applications.

Availability and implementation

IWTomics is freely available at the Bioconductor website (http://bioconductor.org/packages/IWTomics) and on the main Galaxy instance (https://usegalaxy.org/).

Supplementary information

Supplementary data are available at Bioinformatics online.

MA Cremona, A Pini, F Cumbo, KD Makova, F Chiaromonte, S Vantini

BIOINFORMATICS, 2018

DOI

Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

DNA conformation may deviate from the classical B-form in ∼13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here, we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule Real-Time (SMRT) technology. We show that polymerization speed differs between non-B and B-DNA: It decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates), and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

WM Guiblet, MA Cremona, M Cechova, RS Harris, I Kejnovská, E Kejnovsky, KA Eckert, F Chiaromonte, KD Makova

GENOME RES, 2018

DOI

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly

Motivation

The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.

Results

We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.

Availability and implementation

Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY.

Supplementary information

Supplementary data are available at Bioinformatics online.

S Rangavittal, RS Harris, M Cechova, M Tomaszkiewicz, R Chikhi, KD Makova, P Medvedev

BIOINFORMATICS, 2017

DOI

Metabolism-related microRNAs in maternal breast milk are influenced by premature delivery

Background

Maternal breast milk (MBM) is enriched in microRNAs, factors that regulate protein translation throughout the human body. MBM from mothers of term and preterm infants differs in nutrient, hormone, and bioactive-factor composition, but the microRNA differences between these groups have not been compared. We hypothesized that gestational age at delivery influences microRNA in MBM, particularly microRNAs involved in immunologic and metabolic regulation.

Methods

MBM from mothers of premature infants (pMBM) obtained 3-4 weeks post delivery was compared with MBM from mothers of term infants obtained at birth (tColostrum) and 3-4 weeks post delivery (tMBM). The microRNA profile in lipid and skim fractions of each sample was evaluated with high-throughput sequencing.

Results

The expression profiles of nine microRNAs in lipid and skim pMBM differed from those in tMBM. Gene targets of these microRNAs were functionally related to elemental metabolism and lipid biosynthesis. The microRNA profile of tColostrum was also distinct from that of pMBM, but it clustered closely with tMBM. Twenty-one microRNAs correlated with gestational age demonstrated limited relationships with method of delivery, but not other maternal-infant factors.

Conclusion

Premature delivery results in a unique MBM microRNA profile with metabolic targets. This suggests that preterm milk may have adaptive functions for growth in premature infants.

MC Carney, A Tarasiuk, SL Diangelo, P Silveyra, A Podany, LL Birch, IM Paul, S Kelleher, SD Hicks

PEDIATR RES, 2017

DOI

Y and W Chromosome Assemblies: Approaches and Discoveries

Hundreds of vertebrate genomes have been sequenced and assembled to date. However, most sequencing projects have ignored the sex chromosomes unique to the heterogametic sex – Y and W – that are known as sex-limited chromosomes (SLCs). Indeed, haploid and repetitive Y chromosomes in species with male heterogamety (XY), and W chromosomes in species with female heterogamety (ZW), are difficult to sequence and assemble. Nevertheless, obtaining their sequences is important for understanding the intricacies of vertebrate genome function and evolution. Recent progress has been made towards the adaptation of next-generation sequencing (NGS) techniques to deciphering SLC sequences. We review here currently available methodology and results with regard to SLC sequencing and assembly. We focus on vertebrates, but bring in some examples from other taxa.

M Tomaszkiewicz, P Medvedev, KD Makova

TRENDS GENET, 2017

DOI

Reverse Transcription Errors and RNA-DNA Differences at Short Tandem Repeats

Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.

A Fungtammasan, M Tomaszkiewicz, R Campos-Sánchez, KA Eckert, M Degiorgio, KD Makova

MOL BIOL EVOL, 2016

DOI

Corrigendum: A genome-wide analysis of common fragile sites: What features determine chromosomal instability in the human genome?

A Fungtammasan, E Walsh, F Chiaromonte, KA Eckert, KD Makova

GENOME RES, 2016

DOI

Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis

Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs’ integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.

R Campos-Sánchez, MA Cremona, A Pini, F Chiaromonte, KD Makova

PLOS COMPUT BIOL, 2016

DOI

A time- and cost-effective strategy to sequence mammalian Y chromosomes: An application to the de novo assembly of gorilla Y

The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.

M Tomaszkiewicz, S Rangavittal, M Cechova, R Campos-Sánchez, HW Fescemyer, R Harris, D Ye, PCM O’Brien, R Chikhi, OA Ryder, MA Ferguson-Smith, P Medvedev, KD Makova

GENOME RES, 2016

DOI

Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data

Background

Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3” half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology.

Results

Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies - and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy.

Conclusions

Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.

Y Liu, F Chiaromonte, H Ross, R Malhotra, D Elleder, M Poss

BMC BIOINFORMATICS, 2015

DOI

Improving the power of structural variation detection by augmenting the reference

The uses of the Genome Reference Consortium’s human reference sequence can be roughly categorized into three related but distinct categories: as a representative species genome, as a coordinate systemfor identifying variants, and as an alignment reference for variation detection algorithms. However, the use of this reference sequence as simultaneously a representative species genome and as an alignment reference leads to unnecessary artifacts for structural variation detection algorithms and limits their accuracy.We show how decoupling these two references and developing a separate alignment reference can significantly improve the accuracy of structural variation detection, lead to improved genotyping of disease related genes, and decrease the cost of studying polymorphismin a population.

J Schröder, S Girirajan, AT Papenfuss, P Medvedev

PLOS ONE, 2015

DOI

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCRfree protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

A Fungtammasan, G Ananda, SE Hile, MSW Su, C Sun, R Harris, P Medvedev, K Eckert, KD Makova

GENOME RES, 2015

DOI

Using Statistics to Shed Light on the Dynamics of the Human Genome: A Review

In this article we review a number of recent studies in which information derived from genomic alignments and data concerning composition, location and biochemical features of the nuclear DNA are used to investigate salient properties and determinants of change (mutations) in the human genome. The studies under review, all conducted by an interdisciplinary group of investigators at The Pennsylvania State University, required the use of a range of statistical techniques—from regression, to multivariate analysis, to the modeling of latent structures.

F Chiaromonte, KD Makova

CONTRIB STAT, 2015

DOI

The effects of chromatin organization on variation in mutation rates in the genome

The variation in local rates of mutations can affect both the evolution of genes and their function in normal and cancer cells. Deciphering the molecular determinants of this variation will be aided by the elucidation of distinct types of mutations, as they differ in regional preferences and in associations with genomic features. Chromatin organization contributes to regional variation in mutation rates, but its contribution differs among mutation types. In both germline and somatic mutations, base substitutions are more abundant in regions of closed chromatin, perhaps reflecting error accumulation late in replication. By contrast, a distinctive mutational state with very high levels of insertions and deletions (indels) and substitutions is enriched in regions of open chromatin. These associations indicate an intricate interplay between the nucleotide sequence of DNA and its dynamic packaging into chromatin, and have important implications for current biomedical research. This Review focuses on recent studies showing associations between chromatin state and mutation rates, including pairwise and multivariate investigations of germline and somatic (particularly cancer) mutations.

KD Makova, RC Hardison

NAT REV GENET, 2015

DOI

Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA

The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000x per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the effective size of the germline mtDNA bottleneck at only ∼30-35 (interquartile range from 9 to 141). Accounting for heteroplasmies, we estimated the mtDNA germ-line mutation rate at 1.3 × 10^-8 (interquartile range from 4.2 × 10^-9 to 4.1 × 10^-8) mutations per site per year, an order of magnitude higher than for nuclear DNA. Notably,we found a positive association between the number of heteroplasmies in a child andmaternal age at fertilization, likely attributable to oocyte aging. This study also took advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novomutation. Our results can be used to predict the transmission of disease-causing mtDNA variants and illuminate evolutionary dynamics of the mitochondrial genome.

B Rebolledo-Jaramillo, MSW Su, N Stoler, JA McElhoe, B Dickins, D Blankenberg, TS Korneliussen, F Chiaromonte, R Nielsen, MM Holland, IM Paul, A Nekrutenko, KD Makova

P NATL ACAD SCI USA, 2014

DOI

The Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) study

Background

Because early life growth has long-lasting metabolic and behavioral consequences, intervention during this period of developmental plasticity may alter long-term obesity risk. While modifiable factors during infancy have been identified, until recently, preventive interventions had not been tested. The Intervention Nurses Starting Infants Growing on Healthy Trajectories (INSIGHT). Study is a longitudinal, randomized, controlled trial evaluating a responsive parenting intervention designed for the primary prevention of obesity. This “parenting” intervention is being compared with a home safety control among first-born infants and their parents. INSIGHT’s central hypothesis is that responsive parenting and specifically responsive feeding promotes self-regulation and shared parent-child responsibility for feeding, reducing subsequent risk for overeating and overweight.

Methods/Design

316 first-time mothers and their full-term newborns were enrolled from one maternity ward. Two weeks following delivery, dyads were randomly assigned to the “parenting” or “safety” groups. Subsequently, research nurses conduct study visits for both groups consisting of home visits at infant age 3-4, 16, 28, and 40 weeks, followed by annual clinic-based visits at 1, 2, and 3 years. Both groups receive intervention components framed around four behavior states: Sleeping, Fussy, Alert and Calm, and Drowsy. The main study outcome is BMI z-score at age 3 years; additional outcomes include those related to patterns of infant weight gain, infant sleep hygiene and duration, maternal responsiveness and soothing strategies for infant/toddler distress and fussiness, maternal feeding style and infant dietary content and physical activity. Maternal outcomes related to weight status, diet, mental health, and parenting sense of competence are being collected. Infant temperament will be explored as a moderator of parenting effects, and blood is collected to obtain genetic predictors of weight status. Finally, second-born siblings of INSIGHT participants will be enrolled in an observation-only study to explore parenting differences between siblings, their effect on weight outcomes, and carryover effects of INSIGHT interventions to subsequent siblings.

Discussion

With increasing evidence suggesting the importance of early life experiences on long-term health trajectories, the INSIGHT trial has the ability to inform future obesity prevention efforts in clinical settings.

Trial registration

NCT01167270. Registered 21 July 2010.

IM Paul, JS Williams, S Anzman-Frasca, JS Beiler, KD Makova, ME Marini, LB Hess, SE Rzucidlo, N Verdiglione, JA Mindell, LL Birch

BMC PEDIATR, 2014

DOI

Microsatellite Interruptions Stabilize Primate Genomes and Exist as Population-Specific Single Nucleotide Polymorphisms within Individual Human Genomes

Interruptions of microsatellite sequences impact genome evolution and can alter disease manifestation. However, human polymorphism levels at interrupted microsatellites (iMSs) are not known at a genome-wide scale, and the pathways for gaining interruptions are poorly understood. Using the 1000 Genomes Phase-1 variant call set, we interrogated mono-, di-, tri-, and tetranucleotide repeats up to 10 units in length. We detected ~26,000-40,000 iMSs within each of four human population groups (African, European, East Asian, and American). We identified population-specific iMSs within exonic regions, and discovered that known disease-associated iMSs contain alleles present at differing frequencies among the populations. By analyzing longer microsatellites in primate genomes, we demonstrate that single interruptions result in a genome-wide average two- to six-fold reduction in microsatellite mutability, as compared with perfect microsatellites. Centrally located interruptions lowered mutability dramatically, by two to three orders of magnitude. Using a biochemical approach, we tested directly whether the mutability of a specific iMS is lower because of decreased DNA polymerase strand slippage errors. Modeling the adenomatous polyposis coli tumor suppressor gene sequence, we observed that a single base substitution interruption reduced strand slippage error rates five- to 50-fold, relative to a perfect repeat, during synthesis by DNA polymerases α, β, or η. Computationally, we demonstrate that iMSs arise primarily by base substitution mutations within individual human genomes. Our biochemical survey of human DNA polymerase α, β, δ, κ, and η error rates within certain microsatellites suggests that interruptions are created most frequently by low fidelity polymerases. Our combined computational and biochemical results demonstrate that iMSs are abundant in human genomes and are sources of population-specific genetic variation that may affect genome stability. The genome-wide identification of iMSs in human populations presented here has important implications for current models describing the impact of microsatellite polymorphisms on gene expression.

G Ananda, SE Hile, A Breski, Y Wang, Y Kelkar, KD Makova, KA Eckert

PLOS GENET, 2014

DOI

Genomic landscape of human, bat, and ex vivo DNA transposon integrations

The integration and fixation preferences of DNA transposons, one of the major classes of eukaryotic transposable elements, have never been evaluated comprehensively on a genome-wide scale. Here, we present a detailed study of the distribution of DNA transposons in the human and bat genomes. We studied three groups of DNA transposons that integrated at different evolutionary times: 1) ancient (>40 My) and currently inactive human elements, 2) younger (<40 My) bat elements, and 3) ex vivo integrations of piggyBat and Sleeping Beauty elements in HeLa cells. Although the distribution of ex vivo elements reflected integration preferences, the distribution of human and (to a lesser extent) bat elements was also affected by selection. We used regression techniques (linear, negative binomial, and logistic regression models with multiple predictors) applied to 20-kb and 1-Mb windows to investigate how the genomic landscape in the vicinity of DNA transposons contributes to their integration and fixation. Our models indicate that genomic landscape explains 16-79% of variability in DNA transposon genome-wide distribution. Importantly, we not only confirmed previously identified predictors (e.g., DNA conformation and recombination hotspots) but also identified several novel predictors (e.g., signatures of double-strand breaks and telomere hexamer). Ex vivo integrations showed a bias toward actively transcribed regions. Older DNA transposons were located in genomic regions scarce in most conserved elements - likely reflecting purifying selection. Our study highlights how DNA transposons are integral to the evolution of bat and human genomes, and has implications for the development of DNA transposon assays for gene therapy and mutagenesis applications.

R Campos-Sánchez, A Kapusta, C Feschotte, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2014

DOI

Development and assessment of an optimized next-generation DNA sequencing approach for the mtgenome using the Illumina MiSeq

The development of molecular tools to detect and report mitochondrial DNA (mtDNA) heteroplasmy will increase the discrimination potential of the testing method when applied to forensic cases. The inherent limitations of the current state-of-the-art, Sanger-based sequencing, including constrictions in speed, throughput, and resolution, have hindered progress in this area. With the advent of next-generation sequencing (NGS) approaches, it is now possible to clearly identify heteroplasmic variants, and at a much lower level than previously possible. However, in order to bring these approaches into forensic laboratories and subsequently as accepted scientific information in a court of law, validated methods will be required to produce and analyze NGS data. We report here on the development of an optimized approach to NGS analysis for the mtDNA genome (mtgenome) using the Illumina MiSeq instrument. This optimized protocol allows for the production of more than 5 gigabases of mtDNA sequence per run, sufficient for detection and reliable reporting of minor heteroplasmic variants down to approximately 0.5-1.0% when multiplexing twelve samples. Depending on sample throughput needs, sequence coverage rates can be set at various levels, but were optimized here for at least 5000 reads. In addition, analysis parameters are provided for a commercially available software package that identify the highest quality sequencing reads and effectively filter out sequencing-based noise. With this method it will be possible to measure the rates of low-level heteroplasmy across the mtgenome, evaluate the transmission of heteroplasmy between the generations of maternal lineages, and assess the drift of variant sequences between different tissue types within an individual.

JA McElhoe, MM Holland, KD Makova, MSW Su, IM Paul, CH Baker, SA Faith, B Young

FORENSIC SCI INT-GEN, 2014

DOI

Controlling for contamination in re-sequencing studies with a reproducible web-based phylogenetic approach

Polymorphism discovery is a routine application of next-generation sequencing technology where multiple samples are sent to a service provider for library preparation, subsequent sequencing, and bioinformatic analyses. The decreasing cost and advances in multiplexing approaches have made it possible to analyze hundreds of samples at a reasonable cost. However, because of the manual steps involved in the initial processing of samples and handling of sequencing equipment, cross-contamination remains a significant challenge. It is especially problematic in cases where polymorphism frequencies do not adhere to diploid expectation, for example, heterogeneous tumor samples, organellar genomes, as well as during bacterial and viral sequencing. In these instances, low levels of contamination may be readily mistaken for polymorphisms, leading to false results. Here we describe practical steps designed to reliably detect contamination and uncover its origin, and also provide new, Galaxy-based, readily accessible computational tools and workflows for quality control. All results described in this report can be reproduced interactively on the web as described at http://usegalaxy.org/contamination.

B Dickins, B Rebolledo-Jaramillo, MSW Su, IM Paul, D Blankenberg, N Stoler, KD Makova, A Nekrutenko

BIOTECHNIQUES, 2014

DOI

Segmenting the human genome based on states of neutral genetic divergence

Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human-orangutan alignments of neutrally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states - each uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine- and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants-including those associated with cancer and other diseases-and to improve computational predictions of noncoding functional elements.

P Kuruppumullage Don, G Ananda, F Chiaromonte, KD Makova

P NATL ACAD SCI USA, 2013

DOI

Mature microsatellites: Mechanisms underlying dinucleotide microsatellite mutational biases in human cells

Dinucleotide microsatellites are dynamic DNA sequences that affect genome stability. Here, we focused on mature microsatellites, defined as pure repeats of lengths above the threshold and unlikely to mutate below it in a single mutational event. We investigated the prevalence and mutational behavior of these sequences by using human genome sequence data, human cells in culture, and purified DNA polymerases. Mature dinucleotides (≥10 units) are present within exonic sequences of >350 genes, resulting in vulnerability to cellular genetic integrity. Mature dinucleotide mutagenesis was examined experimentally using ex vivo and in vitro approaches. We observe an expansion bias for dinucleotide microsatellites up to 20 units in length in somatic human cells, in agreement withprevious computational analyses of germline biases. Using purified DNA polymerases and human cell lines deficient for mismatch repair (MMR), we show that the expansion bias is caused by functional MMR and is not due to DNA polymerase error biases. Specifically, we observe that the MutSα and MutLα complexes protect against expansion mutations. Our data support a model wherein different MMR complexes shift the balance of mutations toward deletionor expansion. Finally, we show that replication fork progression is stalled within long dinucleotides, suggesting that mutational mechanisms within long repeats may be distinct from shorter lengths, depending on the biochemistry of fork resolution. Our work combines computational and experimental approaches to explain the complex mutational behavior of dinucleotide microsatellites in humans.

BA Baptiste, G Ananda, N Strubczewski, A Lutzkanin, SJ Khoo, A Srikanth, N Kim, KD Makova, MM Krasilnikova, KA Eckert

G3-GENES GENOM GENET, 2013

DOI

Distinct mutational behaviors differentiate short tandem repeats from micro satellites in the human genome

A tandem repeat’s (TR) propensity to mutate increases with repeat number, and can become very pronounced beyond a critical boundary, transforming it into a microsatellite (MS). However, a clear understanding of the mutational behavior of different TR classes and motifs and related mechanisms is lacking, as is a consensus on the existence of a boundary separating short TRs (STRs) from MSs. This hinders our understanding of MSs’ mutational properties and their effective use as genetic markers. Using indel calls for 179 individuals from 1000 Genomes Pilot-1 Project, we determined polymorphism incidence for four major TR classes, and formalized its varying relationship with repeat number using segmented regression. We observed a biphasic regime with a transition from a faster to a slower exponential growth at 9, 5, 4, and 4 repeats for mono-, di-, tri-, and tetranucleotide TRs, respectively. We used an in vitro mutagenesis assay to evaluate the contribution of strand slippage errors to mutability. STRs and MSs differ in their absolute polymorphism levels, but more importantly in their rates of mutability growth. Although strand slippage is a major factor driving mononucleotide polymorphism incidence, dinucleotide polymorphism incidence is greater than that expected due to strand slippage alone, indicating that additional cellular factors might be driving dinucleotide mutability in the human genome. Leveraging on hundreds of human genomes, we present the first comprehensive, genome-wide analysis of TR mutational behavior, encompassing several motif sizes and compositions.

G Ananda, E Walsh, KD Jacob, M Krasilnikova, KA Eckert, F Chiaromonte, KD Makova

GENOME BIOL EVOL, 2013

DOI

Rescuing Alu: Recovery of New Inserts Shows LINE-1 Preserves Alu Activity through A-Tail Expansion

Alu elements are trans-mobilized by the autonomous non-LTR retroelement, LINE-1 (L1). Alu-induced insertion mutagenesis contributes to about 0.1% human genetic disease and is responsible for the majority of the documented instances of human retroelement insertion-induced disease. Here we introduce a SINE recovery method that provides a complementary approach for comprehensive analysis of the impact and biological mechanisms of Alu retrotransposition. Using this approach, we recovered 226 de novo tagged Alu inserts in HeLa cells. Our analysis reveals that in human cells marked Alu inserts driven by either exogenously supplied full length L1 or ORF2 protein are indistinguishable. Four percent of de novo Alu inserts were associated with genomic deletions and rearrangements and lacked the hallmarks of retrotransposition. In contrast to L1 inserts, 5′ truncations of Alu inserts are rare, as most of the recovered inserts (96.5%) are full length. De novo Alus show a random pattern of insertion across chromosomes, but further characterization revealed an Alu insertion bias exists favoring insertion near other SINEs, highly conserved elements, with almost 60% landing within genes. De novo Alu inserts show no evidence of RNA editing. Priming for reverse transcription rarely occurred within the first 20 bp (most 5′) of the A-tail. The A-tails of recovered inserts show significant expansion, with many at least doubling in length. Sequence manipulation of the construct led to the demonstration that the A-tail expansion likely occurs during insertion due to slippage by the L1 ORF2 protein. We postulate that the A-tail expansion directly impacts Alu evolution by reintroducing new active source elements to counteract the natural loss of active Alus and minimizing Alu extinction.

BJ Wagstaff, DJ Hedges, RS Derbes, R Campos-Sánchez, F Chiaromonte, KD Makova, AM Roy-Engel

PLOS GENET, 2012

DOI

A genome-wide analysis of common fragile sites: What features determine chromosomal instability in the human genome?

Chromosomal common fragile sites (CFSs) are unstable genomic regions that break under replication stress and are involved in structural variation. They frequently are sites of chromosomal rearrangements in cancer and of viral integration. However, CFSs are undercharacterized at the molecular level and thus difficult to predict computationally. Newly available genome-wide profiling studies provide us with an unprecedented opportunity to associate CFSs with features of their local genomic contexts. Here, we contrasted the genomic landscape of cytogenetically defined aphidicolin-induced CFSs (aCFSs) to that of nonfragile sites, using multiple logistic regression. We also analyzed aCFS breakage frequencies as a function of their genomic landscape, using standard multiple regression. We show that local genomic features are effective predictors both of regions harboring aCFSs (explaining ∼77% of the deviance in logistic regression models) and of aCFS breakage frequencies (explaining ∼45% of the variance in standard regression models). In our optimal models (having highest explanatory power), aCFSs are predominantly located in G-negative chromosomal bands and away from centromeres, are enriched in Alu repeats, and have high DNA flexibility. In alternative models, CpG island density, transcription start site density, H3K4me1 coverage, and mononucleotide microsatellite coverage are significant predictors. Also, aCFSs have high fragility when colocated with evolutionarily conserved chromosomal breakpoints. Our models are predictive of the fragility of aCFSs mapped at a higher resolution. Importantly, the genomic features we identified here as significant predictors of fragility allow us to draw valuable inferences on the molecular mechanisms underlying aCFSs.

A Fungtammasan, E Walsh, F Chiaromonte, KA Eckert, KD Makova

GENOME RES, 2012

DOI

A matter of life or death: How microsatellites emerge in and vanish from the human genome

Microsatellites-tandem repeats of short DNA motifs-are abundant in the human genome and have high mutation rates. While microsatellite instability is implicated in numerous genetic diseases, the molecular processes involved in their emergence and disappearance are still not well understood. Microsatellites are hypothesized to follow a life cycle, wherein they are born and expand into adulthood, until their degradation and death. Here we identified microsatellite births/deaths in human, chimpanzee, and orangutan genomes, using macaque and marmoset as outgroups.We inferred mutations causing births/deaths based on parsimony, and investigated local genomic environments affecting them. We also studied birth/death patterns within transposable elements (Alus and L1s), coding regions, and disease-associated loci. We observed that substitutions were the predominant cause for births of short microsatellites, while insertions and deletions were important for births of longermicrosatellites. Substitutions were the cause for deaths ofmicrosatellites of virtually all lengths. AT-rich L1 sequences exhibited elevated frequency of births/deaths over their entire length, while GC-rich Alus only in their 3′ poly(A) tails and middle A-stretches, with differences depending on transposable element integration timing. Births/deaths were strongly selected against in coding regions. Births/deaths occurred in genomic regions with high substitution rates, protomicrosatellite content, and L1 density, but low GC content and Alu density. The majority of the 17 disease-associated microsatellites examined are evolutionarily ancient (were acquired by the common ancestor of simians). Our genome-wide investigation of microsatellite life cycle has fundamental applications for predicting the susceptibility of birth/death of microsatellites, including many disease-causing loci.

YD Kelkar, KA Eckert, F Chiaromonte, KD Makova

GENOME RES, 2011

DOI

Harnessing cloud computing with Galaxy Cloud

E Afgan, D Baker, N Coraor, H Goto, IM Paul, KD Makova, A Nekrutenko, J Taylor

NAT BIOTECHNOL, 2011

DOI

Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study

Background

Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission.

Results

Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, we devised a set of criteria for detecting polymorphic sites in heterogeneous genetic samples that is resistant to the noise originating from massively parallel sequencing technologies. Application of these criteria to nine human mtDNA samples revealed four heteroplasmic sites.

Conclusions

Our results suggest that the incidence of heteroplasmy may be lower than estimated in some other recent re-sequencing studies, and that mtDNA allelic frequencies differ significantly both between tissues of the same individual and between a mother and her offspring. We designed our study in such a way that the complete analysis described here can be repeated by anyone either at our site or directly on the Amazon Cloud. Our computational pipeline can be easily modified to accommodate other applications, such as viral re-sequencing.

H Goto, B Dickins, E Afgan, IM Paul, J Taylor, KD Makova, A Nekrutenko

GENOME BIOL, 2011

DOI

A genome-wide view of mutation rate co-variation using multivariate analyses

Background

While the abundance of available sequenced genomes has led to many studies of regional heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the associations with multiple genomic features at different genomic scales and phylogenetic distances.

Results

We observe a consistent, largely linear co-variation among rates of nucleotide substitutions, small insertions and small deletions, with some non-linear associations detected among these rates on chromosome X and near autosomal telomeres. This co-variation appears to be shaped by a common set of genomic features, some previously investigated and some novel to this study (nuclear lamina binding sites, methylated non-CpG sites and nucleosome-free regions). Strong non-linear relationships are also detected among genomic features near the centromeres of large chromosomes. Microsatellite mutability co-varies with other mutation rates at finer scales, but not at 1 Mb, and shows varying degrees of association with genomic features at different scales.

Conclusions

Our results allow us to speculate about the role of different molecular mechanisms, such as replication, recombination, repair and local chromatin environment, in mutagenesis. The software tools developed for our analyses are available through Galaxy, an open-source genomics portal, to facilitate the use of multivariate techniques in future large-scale genomics studies.

G Ananda, F Chiaromonte, KD Makova

GENOME BIOL, 2011

DOI

Exploratory spatial analysis of in vitro respiratory syncytial virus co-infections

The cell response to virus infection and virus perturbation of that response is dynamic and is reflected by changes in cell susceptibility to infection. In this study, we evaluated the response of human epithelial cells to sequential infections with human respiratory syncytial virus strains A2 and B to determine if a primary infection with one strain will impact the ability of cells to be infected with the second as a function of virus strain and time elapsed between the two exposures. Infected cells were visualized with fluorescent markers, and location of all cells in the tissue culture well were identified using imaging software. We employed tools from spatial statistics to investigate the likelihood of a cell being infected given its proximity to a cell infected with either the homologous or heterologous virus. We used point processes, K-functions, and simulation procedures designed to account for specific features of our data when assessing spatial associations. Our results suggest that intrinsic cell properties increase susceptibility of cells to infection, more so for RSV-B than for RSV-A. Further, we provide evidence that the primary infection can decrease susceptibility of cells to the heterologous challenge virus but only at the 16 h time point evaluated in this study. Our research effort highlights the merits of integrating empirical and statistical approaches to gain greater insight on in vitro dynamics of virus-host interactions.

I Simeonov, X Gong, O Kim, M Poss, F Chiaromonte, J Fricks

VIRUSES, 2010

DOI

Strong purifying selection at genes escaping X chromosome inactivation

To achieve dosage balance of X-linked genes between mammalian males and females, one female X chromosome becomes inactivated. However, approximately 15% of genes on this inactivated chromosome escape X chromosome inactivation (XCI). Here, using a chromosome-wide analysis of primate X-linked orthologs, we test a hypothesis that such genes evolve under a unique selective pressure. We find that escape genes are subject to stronger purifying selection than inactivated genes and that positive selection does not significantly affect the evolution of these genes. The strength of selection does not differ between escape genes with similar versus different expression levels in males versus females. Intriguingly, escape genes possessing Y homologs evolve under the strongest purifying selection. We also found evidence of stronger conservation in gene expression levels in escape than inactivated genes. We hypothesize that divergence in function and expression between X and Y gametologs is driving such strong purifying selection for escape genes.

C Park, L Carrel, KD Makova

MOL BIOL EVOL, 2010

DOI

Complete Khoisan and Bantu genomes from southern Africa

The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial and small sets of nuclear markers have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans. However, until now, fully sequenced human genomes have been limited to recently diverged populations. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.

SC Schuster, W Miller, A Ratan, LP Tomsho, B Giardine, LR Kasson, RS Harris, DC Petersen, F Zhao, J Qi, C Alkan, JM Kidd, Y Sun, DI Drautz, P Bouffard, DM Muzny, JG Reid, LV Nazareth, Q Wang, R Burhans, C Riemer, NE Wittekindt, P Moorjani, EA Tindall, CG Danko, WS Teo, AM Buboltz, Z Zhang, Q Ma, A Oosthuysen, AW Steenkamp, H Oostuisen, P Venter, J Gajewski, Y Zhang, BF Pugh, KD Makova, A Nekrutenko, ER Mardis, N Patterson, TH Pringle, F Chiaromonte, JC Mullikin, EE Eichler, RC Hardison, RA Gibbs, TT Harkins, VM Hayes

NATURE, 2010

DOI

What is a microsatellite: A computational and experimental definition based upon repeat mutational behavior at A/T and GT/AC repeats

Microsatellites are abundant in eukaryotic genomes and have high rates of strand slippage-induced repeat number alterations. They are popular genetic markers, and their mutations are associated with numerous neurological diseases. However, the minimal number of repeats required to constitute a microsatellite has been debated, and a definition of a microsatellite that considers its mutational behavior has been lacking. To define a microsatellite, we investigated slippage dynamics for a range of repeat sizes, utilizing two approaches. Computationally, we assessed length polymorphism at repeat loci in ten ENCODE regions resequenced in four human populations, assuming that the occurrence of polymorphism reflects strand slippage rates. Experimentally, we determined the in vitro DNA polymerase-mediated strand slippage error rates as a function of repeat number. In both approaches, we compared strand slippage rates at tandem repeats with the background slippage rates. We observed two distinct modes of mutational behavior. At small repeat numbers, slippage rates were low and indistinguishable from background measurements. A marked transition in mutability was observed as the repeat array lengthened, such that slippage rates at large repeat numbers were significantly higher than the background rates. For both mononucleotide and dinucleotide microsatellites studied, the transition length corresponded to a similar number of nucleotides (approximately 10). Thus, microsatellite threshold is determined not by the presence/absence of strand slippage at repeats but by an abrupt alteration in slippage rates relative to background. These findings have implications for understanding microsatellite mutagenesis, standardization of genome-wide microsatellite analyses, and predicting polymorphism levels of individual microsatellite loci.

YD Kelkar, N Strubczewski, SE Hile, F Chiaromonte, KA Eckert, KD Makova

GENOME BIOL EVOL, 2010

DOI

Multivariate statistical analyses demonstrate unique host immune responses to single and dual lentiviral infection

Background

Feline immunodeficiency virus (FIV) and human immunodeficiency virus (HIV) are recently identified lentiviruses that cause progressive immune decline and ultimately death in infected cats and humans. It is of great interest to understand how to prevent immune system collapse caused by these lentiviruses. We recently described that disease caused by a virulent FIV strain in cats can be attenuated if animals are first infected with a feline immunodeficiency virus derived from a wild cougar. The detailed temporal tracking of cat immunological parameters in response to two viral infections resulted in high-dimensional datasets containing variables that exhibit strong co-variation. Initial analyses of these complex data using univariate statistical techniques did not account for interactions among immunological response variables and therefore potentially obscured significant effects between infection state and immunological parameters.

Methodology and Principal Findings

Here, we apply a suite of multivariate statistical tools, including Principal Component Analysis, MANOVA and Linear Discriminant Analysis, to temporal immunological data resulting from FIV superinfection in domestic cats. We investigated the co-variation among immunological responses, the differences in immune parameters among four groups of five cats each (uninfected, single and dual infected animals), and the “immune profiles” that discriminate among them over the first four weeks following superinfection. Dual infected cats mount an immune response by 24 days post superinfection that is characterized by elevated levels of CD8 and CD25 cells and increased expression of IL4 and IFNγ, and FAS. This profile discriminates dual infected cats from cats infected with FIV alone, which show high IL-10 and lower numbers of CD8 and CD25 cells.

Conclusions

Multivariate statistical analyses demonstrate both the dynamic nature of the immune response to FIV single and dual infection and the development of a unique immunological profile in dual infected cats, which are protected from immune decline.

S Roy, J Lavine, F Chiaromonte, J Terwee, S VandeWoude, O Bjornstad, M Poss

PLOS ONE, 2009

DOI

Ride the wavelet: A multiscale analysis of genomic contexts flanking small insertions and deletions

Recent studies have revealed that insertions and deletions (indels) are more different in their formation than previously assumed. What remains enigmatic is how the local DNA sequence context contributes to these differences. To investigate the relative impact of various molecular mechanisms to indel formation, we analyzed sequence contexts of indels in the non protein- or RNA-coding, nonrepetitive (NCNR) portion of the human genome. We considered small (≤30-bp) indels occurring in the human lineage since its divergence from chimpanzee and used wavelet techniques to study, simultaneously for multiple scales, the spatial patterns of short sequence motifs associated with indel mutagenesis. In particular, we focused on motifs associated with DNA polymerase activity, topoisomerase cleavage, double-strand breaks (DSBs), and their repair. We came to the following conclusions. First, many motifs are characterized by unique enrichment profiles in the vicinity of indels vs. indel-free portions of the genome, verifying the importance of sequence context in indel mutagenesis. Second, only limited similarity in motif frequency profiles is evident flanking insertions vs. deletions, confirming differences in their mutagenesis. Third, substantial similarity in frequency profiles exists between pairs of individual motifs flanking insertions (and separately deletions), suggesting “cooperation” among motifs, and thus molecular mechanisms, during indel formation. Fourth, the wavelet analyses demonstrate that all these patterns are highly dependent on scale (the size of an interval considered). Finally, our results depict a model of indel mutagenesis comprising both replication and recombination (via repair of paused replication forks and site-specific recombination).

EM Kvikstad, F Chiaromonte, KD Makova

GENOME RES, 2009

DOI

Human-macaque comparisons illuminate variation in neutral substitution rates

Background

The evolutionary distance between human and macaque is particularly attractive for investigating local variation in neutral substitution rates, because substitutions can be inferred more reliably than in comparisons with rodents and are less influenced by the effects of current and ancient diversity than in comparisons with closer primates. Here we investigate the human-macaque neutral substitution rate as a function of a number of genomic parameters.

Results

Using regression analyses we find that male mutation bias, male (but not female) recombination rate, distance to telomeres and substitution rates computed from orthologous regions in mouse-rat and dog-cow comparisons are prominent predictors of the neutral rate. Additionally, we demonstrate that the previously observed biphasic relationship between neutral rate and GC content can be accounted for by properly combining rates at CpG and non-CpG sites. Finally, we find the neutral rate to be negatively correlated with the densities of several classes of computationally predicted functional elements, and less so with the densities of certain classes of experimentally verified functional elements.

Conclusion

Our results suggest that while female recombination may be mainly responsible for driving evolution in GC content, male recombination may be mutagenic, and that other mutagenic mechanisms acting near telomeres, and mechanisms whose effects are shared across mammalian genomes, play significant roles. We also have evidence that the nonlinear increase in rates at high GC levels may be largely due to hyper-mutability of CpG dinucleotides. Finally, our results suggest that the performance of conservation-based prediction methods can be improved by accounting for neutral rates.

S Tyekucheva, KD Makova, JE Karro, RC Hardison, W Miller, F Chiaromonte

GENOME BIOL, 2008

DOI

The genome-wide determinants of human and chimpanzee microsatellite evolution

Mutation rates of microsatellites vary greatly among loci. The causes of this heterogeneity remain largely enigmatic yet are crucial for understanding numerous human neurological diseases and genetic instability in cancer. In this first genome-wide study, the relative contributions of intrinsic features and regional genomic factors to the variation in mutability among orthologous human-chimpanzee microsatellites are investigated with resampling and regression techniques. As a result, we uncover the intricacies of microsatellite mutagenesis as follows. First, intrinsic features (repeat number, length, and motif size), which all influence the probability and rate of slippage, are the strongest predictors of mutability. Second, mutability increases nonuniformly with length, suggesting that processes additional to slippage, such as faulty repair, contribute to mutations. Third, mutability varies among microsatellites with different motif composition likely due to dissimilarities in secondary DNA structure formed by their slippage intermediates. Fourth, mutability of mononucleotide microsatellites is impacted by their location on sex chromosomes vs. autosomes and inside vs. outside of Alu repeats, the former confirming the importance of replication and the latter suggesting a role for gene conversion. Fifth, transcription status and location in a particular isochore do not influence microsatellite mutability. Sixth, compared with intrinsic features, regional genomic factors have only minor effects. Finally, our regression models explain ∼90% of variation in microsatellite mutability and can generate useful predictions for the studies of human diseases, forensics, and conservation genetics.

YD Kelkar, S Tyekucheva, F Chiaromonte, KD Makova

GENOME RES, 2008

DOI

A macaque's-eye view of human insertions and deletions: Differences in mechanisms

Insertions and deletions (indels) cause numerous genetic diseases and lead to pronounced evolutionary differences among genomes. The macaque sequences provide an opportunity to gain insights into the mechanisms generating these mutations on a genome-wide scale by establishing the polarity of indels occurring in the human lineage since its divergence from the chimpanzee. Here we apply novel regression techniques and multiscale analyses to demonstrate an extensive regional indel rate variation stemming from local fluctuations in divergence, GC content, male and female recombination rates, proximity to telomeres, and other genomic factors. We find that both replication and, surprisingly, recombination are significantly associated with the occurrence of small indels. Intriguingly, the relative inputs of replication versus recombination differ between insertions and deletions, thus the two types of mutations are likely guided in part by distinct mechanisms. Namely, insertions are more strongly associated with factors linked to recombination, while deletions are mostly associated with replication-related features. Indel as a term misleadingly groups the two types of mutations together by their effect on a sequence alignment. However, here we establish that the correct identification of a small gap as an insertion or a deletion (by use of an outgroup) is crucial to determining its mechanism of origin. In addition to providing novel insights into insertion and deletion mutagenesis, these results will assist in gap penalty modeling and eventually lead to more reliable genomic alignments.

EM Kvikstad, S Tyekucheva, F Chiaromonte, KD Makova

PLOS COMPUT BIOL, 2007

DOI

Genomic environment predicts expression patterns on the human inactive X chromosome

What genomic landmarks render most genes silent while leaving others expressed on the inactive X chromosome in mammalian females? To date, signals determining expression status of genes on the inactive X remain enigmatic despite the availability of complete genomic sequences. Long interspersed repeats (L1s), particularly abundant on the X, are hypothesized to spread the inactivation signal and are enriched in the vicinity of inactive genes. However, both L1s and inactive genes are also more prevalent in ancient evolutionary strata. Did L1s accumulate there because of their role in inactivation or simply because they spent more time on the rarely recombining X? Here we utilize an experimentally derived inactivation profile of the entire human X chromosome to uncover sequences important for its inactivation, and to predict expression status of individual genes. Focusing on Xp22, where both inactive and active genes reside within evolutionarily young strata, we compare neighborhoods of genes with different inactivation states to identify enriched oligomers. Occurrences of such oligomers are then used as features to train a linear discriminant analysis classifier. Remarkably, expression status is correctly predicted for 84% and 91% of active and inactive genes, respectively, on the entire X, suggesting that oligomers enriched in Xp22 capture most of the genomic signal determining inactivation. To our surprise, the majority of oligomers associated with inactivated genes fall within L1 elements, even though L1 frequency in Xp22 is low. Moreover, these oligomers are enriched in parts of L1 sequences that are usually underrepresented in the genome. Thus, our results strongly support the role of L1s in X inactivation, yet indicate that a chromatin microenvironment composed of multiple genomic sequence elements determines expression status of X chromosome genes.

L Carrel, C Park, S Tyekucheva, J Dunn, F Chiaromonte, KD Makova

PLOS GENET, 2006

DOI

Strong and weak male mutation bias at different sites in the primate genomes: Insights from the human-chimpanzee comparison

Male mutation bias is a higher mutation rate in males than in females thought to result from the greater number of germ line cell divisions in males. If errors in DNA replication cause most mutations, then the magnitude of male mutation bias, measured as the male-to-female mutation rate ratio (α), should reflect the relative excess of male versus female germ line cell divisions. Evolutionary rates averaged among all sites in a sequence and compared between mammalian sex chromosomes were shown to be indeed higher in males than in females. However, it is presently unknown whether individual classes of substitutions exhibit such bias. To address this issue, we investigated male mutation bias separately at non-CpG and CpG sites using human-chimpanzee whole-genome alignments. We observed strong male mutation bias at non-CpG sites: α in the X-autosome comparison was ∼6-7, which was similar to the male-to-female ratio in the number of germ line cell divisions. In contrast, mutations at CpG sites exhibited weak male mutation bias: α in the X-autosome comparison was only ∼2-3. This is consistent with the methylation-induced and replication-independent mechanism of CpG transitions, which constitute the majority of mutations at CpG sites. Interestingly, our study also indicated weak male mutation bias for transversions at CpG sites, implying a spontaneous mechanism largely not associated with replication. Male mutation bias was equally strong at CpG and non-CpG sites located within unmethylated “CpG islands,” suggesting the replication-dependent origin of these mutations. Thus, we found that the strength of male mutation bias is nonuniform in the primate genomes. Importantly, we discovered that male mutation bias depends on the proportion of CpG sites in the loci compared. This might explain the differences in the magnitude of primate male mutation bias observed among studies.

J Taylor, S Tyekucheva, M Zody, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2006

DOI

Insertions and deletions are male biased too: A whole-genome analysis in rodents

It is presently accepted that, in mammals, due to the greater number of cell divisions in the male germline than in the female germline, nucleotide substitutions occur more frequently in males. The data on mutation bias in insertions and deletions (indels) are contradictory, with some studies indicating no sex bias and others indicating either female or male bias. The sequenced rat and mouse genomes provide a unique opportunity to investigate a potential sex bias for different types of mutations. Indeed, mutation rates can be accurately estimated from a large number of orthologous loci in organisms similar in generation time and in the number of germline cell divisions. Here we compare the mutation rates between chromosome X and autosomes for likely neutral sites in eutherian ancestral interspersed repetitive elements present at orthologous locations in the rat and mouse genomes. We find that small indels are male biased: The male-to-female mutation rate ratio (α) for indels in rodents is ∼2. Similarly, our whole-genome analysis in rodents indicates an approximately twofold excess of nucleotide substitutions originating in males over that in females. This is the same as the male-to-female ratio of the number of germline cell divisions in rat and mouse. Thus, this is consistent with nucleotide substitutions and small indels occurring primarily during DNA replication.

KD Makova, S Yang, F Chiaromonte

GENOME RES, 2004

DOI