Publications | Center for Medical Genomics

Competition in human genetic technologies: The current US legal landscape

Competition plays a crucial role in driving innovation in the industry of human genetics and genomics technologies. However, the US’s policy on competition (such as enforcement of antitrust laws) has shifted over time, affecting the level of regulatory scrutiny that business decisions in the industry will receive. Here, we offer an overview of this changing legal landscape, noting key policy changes at the Federal Trade Commission (FTC) and Department of Justice (DOJ) relevant to the human genetics and genomics industry. Focusing on the regulatory challenge of Illumina’s acquisition of Grail as a case study, we highlight how shifting anti-competition enforcement policy could affect spin-offs, startups, and industry consolidations. Balancing competition and consumer protection policies remains essential for the continued advancement of human genetic and genomic technologies. We offer this perspective in the hopes that it will help inform scientists in the field of the relevant legal considerations and stimulate discussions to shape policy—public and private alike—in ways that promote responsible innovation in genetic and genomic science and technology.

AI Rashid, NA Rincon, N Rihani, JK Wagner

Am J Hum Genet, 2026

DOI

PCR bias impacts microbiome ecological analyses

Polymerase Chain Reaction (PCR) is a critical step in amplicon-based microbial community profiling, allowing the selective amplification of marker genes such as 16S rRNA from environmental or host-associated samples. Despite its widespread use, PCR is known to introduce amplification bias, where some DNA sequences are preferentially amplified over others due to factors such as primer-template mismatches, sequence GC content, and secondary structures. Although these biases are known to affect transcript abundance, their implications for ecological metrics remain poorly understood. In this study, we conduct a comprehensive evaluation of how PCR-bias influences both within-samples (α-diversity) and between-sample (β-diversity) analyses. We show that perturbation-invariant diversity measures remain unaffected by PCR bias, but widely used metrics such as Shannon diversity and Weighted-Unifrac are sensitive. To address this, we provide theoretical and empirical insight into how PCR-induced bias varies across ecological analyses and community structures, and we offer practical guidance on when bias-correction methods should be applied. Our findings highlight the importance of selecting appropriate diversity metrics for PCR-based microbial ecology workflows and offer guidance for improving the reliability of diversity analyses.

DR Rathod, JD Silverman

PLoS Comput Biol, 2026

DOI

A Functional Approach to Testing Overall Effect of Interaction Between DNA Methylation and SNPs

We introduce a test for the overall effect of interaction between DNA methylation and a set of single nucleotide polymorphisms on a quantitative phenotype. The developed inference procedure is based on a functional approach that extend existing regression models in functional data analysis. Through extensive simulations, we show that the proposed test effectively controls type I error rates and highlights increased empirical power over existing methods, particularly when multiple interactions are present. The use of the proposed test is illustrated with an application to data from obesity patients and controls.

Y Gansou, K Oualkacha, MA Cremona, L Lakhal-Chaieb

Stat Med, 2026

DOI

The FDA’s plan to phase out animal testing

Replacing animal testing through ‘New Approach Methodologies’ holds promise for developing cheaper and safer drugs without animal suffering. However, such an approach should be implemented carefully, and it cannot be rushed. We discuss the FDA Modernization Act 2.0 and 3.0 and the FDA’s roadmap to phase out animal testing.

S Gerke, J Balamut, JK Wagner

Trends Biotechnol, 2026

DOI

Family Decision to Immunize Against Respiratory Syncytial Virus and Associations with Seasonal Influenza and COVID-19 Vaccination

Background: Nirsevimab, a monoclonal antibody for respiratory syncytial virus (RSV), reduces medically attended RSV infections. It was introduced in the 2023–24 RSV season. This study examined the association between caregiver vaccination (seasonal influenza vaccine (SIV), COVID-19, and boosters) and intent to immunize infants against RSV. Methods: Data from 118 caregivers with infants ≤ 8 months were analyzed. Chi-squared tests and logistic regression assessed the relationship between caregiver vaccination and intent to immunize against RSV. Results: In total, 74.6% of caregivers intended to immunize their infants against RSV. Intent was positively associated with caregiver receipt of a seasonal influenza vaccine (p < 0.001), COVID-19 vaccine (p < 0.001), and COVID-19 booster (p < 0.001). Intent was also associated with older child seasonal vaccination. Caregiver receipt of both COVID-19 vaccinations and boosters had a strong relationship with RSV immunization intent (OR 7.91 (1.90–33.0, p = 0.004)). Conclusions: Caregiver vaccination behaviors are linked to RSV immunization intent, helping physicians identify hesitant families and prepare for immunization conversations.

LD Kaye, BN Fogel, RE Gardner, BJ Lipsett, KE Shedlock, EW Schaefer, IM Paul, SD Hicks

Vaccines, 2026

DOI

An Overview of the Mechanisms of HPV-Induced Cervical Cancer: The Role of Kinase Targets in Pathogenesis and Drug Resistance

Despite a thorough understanding of the structure of human papillomavirus (HPV) and its genotypic variations (high-risk and low-risk variants), the mechanisms underlying HPV-induced cervical cancer (CC) pathogenesis and the molecular signatures of drug resistance remain to be fully understood. Accumulating evidence has shown the involvement of kinase targets in the induction of drug resistance in high-risk (HR) HPV-CC. Molecularly, the genome of high-risk HPV is reported to control the expression of host kinases. In particular, Aurora kinases A, B, and C (ARKA, ARKB, and ARKC), phosphotidylinositol–trisphosphate kinase (PI3K)-Akt, and Glycogen synthase kinase3-α/β (GSK3 α/β) promote the transformation of infected cells, and also enhance the resistance of cells to various chemotherapeutic agents such as nelfinavir and cisplatin. However, the precise mechanisms through which HPV activates these kinases are yet to be fully elucidated. Furthermore, there is still ambiguity surrounding whether targeting HPV-induced kinases along with HPV-targeted therapies (such as phytopharmaceuticals and PROTAC/CRISPR-CAS-based systems) synergistically inhibit cervical tumor growth. Given the critical role of kinases in the pathogenesis and treatment of CC, a comprehensive review of current evidence is warranted. This review aims to provide key insights into the mechanisms of HPV-induced CC development, the involvement of kinases in drug resistance induction, and the rationale for combination therapies to improve clinical outcomes.

M Karnik, SV Tulimilli, PG Anantharaju, ADS Bettadapura, SM Natraj, HS Mohideen, S Dovat, A Sharma, SV Madhunapantula

Cancers, 2026

DOI

Monitoring diversity in genome-wide association studies requires measuring and reporting on immigration-related factors

Genome-wide association studies (GWAS) have made remarkable progress to date in deciphering the genetic foundations of complex traits, yet persistent gaps remain in how sample heterogeneity is measured and reported. Current practices typically emphasize diversity by broad ancestry categories or stratification by country of recruitment, but these dimensions alone fail to capture the immigration-related factors that contribute to the genetic or environmental origins of heterogeneity. We argue that incorporating variables, such as country of origin, in descriptions and analyses provides essential context for interpreting genetic associations, particularly in increasingly multi-population and trans-national GWAS samples. We highlight how neglected these variables are in the literature using the GWAS Catalog. We provide suggestions for reporting on these data in future studies. By advocating for a more comprehensive view of diversity in GWAS, we aim to address the under-representation of immigrants in GWAS and thereby strengthen the validity and interpretability of future genomic studies.

Y Tu, L Fernandez-Rhodes

Front Genet, 2026

DOI

Translating AI to the Bedside with Physician Buy-In: Recommendations from a Meta-Analysis and Systematic Review of the Literature

Background: Artificial intelligence (AI) is increasingly being used in healthcare. Despite its promise, physicians and trainees remain cautiously optimistic. This systematic review and meta-analysis aimed to assess knowledge and attitudes toward AI and to provide recommendations for AI buy-in by physicians. Methods: Searches of PubMed-OVID-IEEE-Scopus, and Web-of-Science for studies in 2013–2024 identified 11,437 records. One-hundred-and-fifteen met inclusion criteria. Fifty-three studies reported quantitative data on physicians’/trainees’ knowledge and were included in the meta-analysis. Results: Our meta-analysis estimated that only 19.6% of physicians and trainees have high overall AI knowledge, while 36.3% have low knowledge. Fifty-five studies evaluated the depth of AI knowledge. These studies consistently concluded that most physicians or trainees possess only moderate conceptual knowledge of AI, and their technical knowledge is usually limited. Qualitative evaluations also highlighted that a high level of conceptual AI knowledge is associated with greater receptiveness to AI implementation in medicine. We identified five major barriers to translating AI to the bedside with physician buy-in. Conclusion: Although physicians and trainees are generally receptive to AI, many barriers hinder adoption. To address them, we recommend establishing standardized AI education and workforce training, involving clinicians early in AI design, clarifying legal and regulatory issues, leveraging insights from clinical decision support system implementation to reduce workflow challenges, and integrating patient-centered communication principles to enhance trust and transparency.

G Hwang, SS Hejazian, AV Sadr, JK Wagner, P Hao, A Vemuri, Y Kawamura, K Nawab, S Hijjawi, R Zand, V Abedi

Bioengineering, 2025

DOI

Moving the Fine Print to the Front Page: Transparent Communication of Facial Genetics Research

AI-enabled facial genetics research has transformative potential for biomedical and forensic applications, but raises serious ethical, legal, and social challenges. Candor and clarity promote the advancement of science and facilitate the development of evidence-informed and ethically sound policy guardrails for scientific applications. Expanding upon our recent Comment on Difface, here we examine some of the ethical dimensions of AI-enabled facial genetics research, illuminate a lack of practical guidance for AI-enabled facial genetics research, and contextualize potential impacts within the current legal and policy landscape. Given the lack of standards and benchmarks for DNA-based face generation, we use Difface as a case study to stress the need for transparent performance metrics and clear disclosure of data flows across AI training, validation, and testing pipelines—enabling non-experts to assess accuracy meaningfully. Finally, we offer practical guidance for scientists to promote trustworthiness and stimulate further discussion within professional societies—guidance urgently needed in sociopolitically-turbulent and deregulatory environments. Risks of misinformation, disinformation, and discriminatory application of facial genetics research are too serious to ignore.

JK Wagner, N Claessens, CM Maloney, P Claes

Adv Genet, 2025

DOI

Contrasting pre-vaccine COVID-19 waves in Italy through functional data analysis

This study analyzes mortality patterns during two pre-vaccine COVID-19 waves in Italy, using data from its 107 provinces. Mortality is examined alongside information on mobility, governmental restrictions, and socio-demographic, infrastructural, and environmental factors using Functional Data Analysis tools. Publicly available differential mortality data and local mobility data from Google are processed with smoothing splines and aligned through landmark registration. The resulting curves are clustered to identify mortality patterns, and regression models are used to evaluate the impact of mobility, restrictions, and other factors on mortality. We find significant differences between the two waves: the first had higher, more concentrated mortality peaks, while the second was more widespread and asynchronous. Our results also support the effectiveness of timely restrictions in curbing mortality, and a strong positive association between local mobility and mortality in both pre-vaccine waves. Despite data quality limitations, our findings strengthen the evidence for the role of government restrictions and mobility controls during the pandemic.

T Boschi, J Di Iorio, L Testa, MA Cremona, F Chiaromonte

Sci Rep, 2025

DOI

Comment on “De Novo Reconstruction of 3D Human Facial Images from DNA Sequence”

The recent article by Jiao et al., presenting the Difface model for reconstructing faces from deoxyribonucleic acid (DNA) and other artificial intelligence (AI)-enabled facial genetics research, has transformative potential for biomedical and forensic applications. Yet such research raises serious ethical, legal, and social challenges, as such applications are rights and safety impacting. Methodological transparency, algorithmic explainability, and contextualization of findings are essential to ensure that AI-enabled facial genetics research, such as Difface, is conducted responsibly and rigorously and interpreted reasonably. Given the lack of standards and benchmarks for DNA-based face generation, we use Difface as a case study to stress the need for transparent performance metrics and clear disclosure of data flows across AI training, validation, and testing pipelines—enabling non-experts to assess accuracy meaningfully.

JK Wagner, N Claessens, CM Maloney, P Claes

Adv Sci, 2025

DOI

Neurotechnology Governance in the United States: Gaps and Opportunities

Neuroscience’s accelerating advances have reached a pivotal point in the study of the human brain, including neurotechnologies capable of recording large amounts of data and acting with greater precision. However, the use of neurotechnology has raised a number of ethical, legal, and social implications (ELSI). To that end, sufficiently robust policy and governance structures must be considered. To date, no published review of United States policies governing neuroscience and neurotechnology exists. To address this, we review US polices and various ethical frameworks overseeing neuroscience and neurotechnology. This policy review highlights where gaps in neuroscience and neurotechnology policy and governance might exist. Overall, our review shows that “soft policies” make up the present-day US neurotech-governance universe at the federal level, with neurodata specific state-legislation emerging. The included analysis can aid researchers, technology developers, neuroethicists, research ethicists, legal scholars, and others in facilitating ethically and socially responsible implementation of neuroscience and neurotechnology as they move from “bench to bedside and beyond.

LY Cabrera, N Evereteze, EG Shank, JK Wagner, M Mekel, JB McCormick, MS Wright

Bioethics, 2025

DOI

Integrating axis quantitative trait loci looks beyond cell types and offers insights into brain-related traits

Genome-wide association studies have identified many loci for brain disorders, but most non-coding variants fail to colocalize with bulk expression quantitative trait loci. Single-cell expression quantitative trait loci studies capture cell-type-specific regulation but are often underpowered. We developed Bulk And Single cell expression quantitative trait loci Integration across Cell states (BASIC) to combine bulk and single-cell expression quantitative trait loci through “axis-quantitative trait loci,” which decompose bulk-tissue effects along orthogonal axes of cell-type expression. BASIC better distinguishes shared versus cell-type-specific effects and increases power. Analyzing single-cell expression quantitative trait loci with cortex bulk data from MetaBrain using BASIC identified 5644 additional gene with quantitative trait loci (74.5%), equivalent to a 76.8% increase in sample size. Integrating axis-quantitative trait loci with 12 brain-related traits improved colocalization by 53.5% versus single-cell studies and 111% versus bulk studies, revealing risk genes such as DEDD for Alzheimer’s disease and drug candidates including cabergoline.

L Wang, S Gao, S Chen, H Markus, G Wang, L Carrel, X Zhan, DJ Liu, B Jiang

Nat Commun, 2025

DOI

Immunogenomics Approaches to Studying Antibody Repertoires and Vaccine Responses in Ruminants

Ruminant species are vital for agriculture, ecosystems, and conservation and remain vulnerable to infectious and zoonotic diseases. Advances in genome sequencing and genomics now enable high-resolution analysis of immunoglobulin (IG) loci and antibody repertoires uncovering extensive germline diversity, structural variation, and lineage-specific adaptations, such as ultralong cysteine-rich Abs in cattle. This review summarizes current knowledge of ruminant IG locus organization and repertoire generation and discusses the evolutionary origins of ultralong Abs. It also examines the challenges highly repetitive IG loci pose for assembly, annotation, and nomenclature and highlights emerging solutions. Finally, it describes genomic approaches for linking immune genotypes to phenotypes that create promise for improving ruminant health.

Y Safonova, A Collins, BM Murdoch, BD Rosen, TPL Smith, CT Watson

Annu Rev Anim Biosci, 2025

DOI

Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans

Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for potentially millions of sites, is computationally demanding, as it involves tracking counts of unique haplotypes iteratively over long genomic distances and across many individuals. Because of these computational challenges, existing tools do not scale well when applied to large-scale population datasets, such as the 1000 Genomes Project, or the UK Biobank with 500,000 individuals. Optimizing computation becomes crucial when data sets grow large, especially when handling large sample sizes or generating training data for machine learning algorithms. Here, we propose a dynamic programming algorithm that substantially improves runtime and memory usage over existing tools on both real and simulated data. On real phased data, we achieve 5-50x speedup with minimal memory footprint. Our simulations show an even more pronounced performance gap with large populations (up to 15x speedup and 46x memory reduction). EHH-based statistics designed for unphased genotypes run an order of magnitude faster, and multi-parameter support results in 20x runtime improvement. Source code and binaries are available at https://github.com/szpiech/selscan as selscan v2.1.

A Rahman, TQ Smith, ZA Szpiech

Mol Biol Evol, 2025

DOI

Defective but promising: evaluating the utility of currently available bioinformatic pipelines for detecting defective viral genomes in RNA-Seq data

Defective viral genomes (DVGs) affect viral dynamics, pathogenicity and evolution, have been found in many in vivo viral infections, and in theory can be detected from sequencing data. We explored the utility of the currently available bioinformatic programs ViReMa, DI-tector, DVGfinder, DG-Seq and VODKA2 for identifying junction points in plant virus high-throughput sequencing data, looking at whether the outputs from these bioinformatic tools generally agree and exploring the possibility of using these tools to help us understand whether DVGs are consistently generated and maintained in a specific virus-host combination. We conducted a meta-analysis of eight previously published RNA sequencing datasets utilizing all five programs and compared the degree of output overlap, the most common junctions present in each output and whether these junctions match previously reported junctions for that virus. Our results demonstrate a low degree of agreement regarding identified junctions between programs, including the most frequently identified one, although the most frequently identified junctions typically corresponded to large, disruptive deletions. We found preliminary support for our prevalence hypothesis, although we ultimately conclude that a more robust dataset generated expressly for testing this hypothesis will be required for a convincing answer. Finally, we suggest that when using bioinformatic programs to search for DVGs, it is best to run the same dataset through multiple programs and look at the overlap to inform decisions on downstream characterization.

A Taylor, C Rosa, M Archetti

J Gen Virol, 2025

DOI

Discovery of obesity genes through cross-ancestry analysis

Gene discoveries in obesity have largely relied on homogeneous populations, limiting their generalizability across ancestries. Here, we conduct a gene-based rare variant association study of BMI on 839,110 individuals from six ancestries across two population-scale biobanks. A cross-ancestry meta-analysis identifies 13 genes, including YLPM1, RIF1, GIGYF1, SLC5A3, and GRM7, that confer about three-fold risk for severe obesity, are expressed in the brain and adipose tissue, and are linked to obesity traits such as body-fat percentage. While YLPM1, MC4R, and SLTM show consistent effects, GRM7 and APBA1 show significant ancestral heterogeneity. Polygenic risk additively increases obesity penetrance, and phenome-wide studies reveal additional associations, including YLPM1 with altered mental status. These genes also influence cardiometabolic comorbidities, including GIGYF1 and SLTM towards type 2 diabetes with or without BMI as a mediator, and altered levels of plasma proteins, such as LECT2 and NCAN, which in turn affect BMI. Our findings provide insights into the genetic basis of obesity and its related comorbidities across ancestries and ascertainments.

D Banerjee, S Girirajan

Nat Commun, 2025

DOI

Wastewater surveillance of SARS-CoV-2 and influenza in a dynamic university community: understanding how wastewater measurements correspond to reported cases

Wastewater surveillance is increasingly an effective public health tool for responding to epidemics and preparing for annual cycles of respiratory illnesses. We measured genetic markers from Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), influenza A virus (IAV) and influenza B virus (IBV) in untreated wastewater of a university campus and its local residential community over a four-year period using digital Polymerase Chain Reaction (PCR) methods. These data were then analyzed and compared to clinical case data reported to the state by zip code. The campus community is contrasted to the surrounding community by its high degree of congregate living, fluctuating population due to semester breaks, and a narrower age distribution. We show that semester breaks, when students depopulate campus, drive subsequent peaks in the transmission of SARS-CoV-2 upon their return. SARS-CoV-2 and IAV and IBV exhibit significant correlations in the level of campus and local community concentrations at 0.48, 0.63, and 0.45, respectively, demonstrating connectivity between the two populations. While IAV and IBV concentrations are highly correlated with one another, we find no relationship between influenza and SARS-CoV-2 concentrations. Like previous studies, we found a high degree of correlation between wastewater concentrations and clinical case data, with strong alignment in peaks and hence no evidence of leading or lagging indicators. Our data highlights the effectiveness of wastewater surveillance, a passive monitoring method, to estimate the current trajectory of the epidemic curve.

MJ Jones, R Ibrahim, S Clark, YM Brooks, HE Preisendanz, TL Richard, AF Read, CJ Robinson, S Noorali, MJ Shreve, NL Dennington, JD Silverman, C Van Oost, EA McGraw

Sci Total Environ, 2025

DOI

Sociodemographic Factors, Intent-Uptake Disparities, and Nirsevimab Availability in Infant RSV Immunoprophylaxis

Background/Objectives: Respiratory syncytial virus (RSV) is the most common cause of bronchiolitis and infant hospitalization in the US. RSV prevention evolved in 2023 as nirsevimab and maternal RSV pre-fusion vaccine became available for healthy newborns and infants. This study investigates sociodemographic characteristics associated with RSV immunoprophylaxis. Methods: A cross-sectional survey was conducted from November 2023 through March 2024 among a convenience sample of parents of infants aged <8 months who received newborn care or pediatric ambulatory care at a single academic institution in Central Pennsylvania, USA. Logistic regression examined sociodemographic factors associated with RSV immunoprophylaxis uptake. Given the nirsevimab shortage during the 2023–2024 RSV season, a sensitivity analysis was completed for intended immunoprophylaxis. Results: Among 118 participants, 66.9% received RSV immunoprophylaxis while 74.5% intended to receive nirsevimab. Higher income, private insurance, out-of-home childcare, and an adult/partner working in healthcare were associated with intended nirsevimab receipt. Participation in the Women, Infants and Children program was associated with lower rates of intended nirsevimab receipt. Out-of-home childcare was associated with both RSV immunoprophylaxis uptake and intended nirsevimab receipt. Conclusions: Sociodemographic factors significantly influence the intent to receive nirsevimab and RSV immunoprophylaxis uptake. Having an adult/partner in healthcare was the most significant predictor for intent, suggesting that greater health literacy drives immunization intention. Enrollment in out-of-home childcare was the sole predictor of RSV immunoprophylaxis uptake. These findings highlight the importance of policy initiatives that promote equitable access to RSV immunoprophylaxis, including strategies to address socioeconomic barriers, improve health literacy, and ensure consistent availability of preventive agents for all infants.

BJ Lipsett, BN Fogel, KE Shedlock, IM Paul, EW Schaefer, RE Gardner, LD Kaye, SD Hicks

Pediatr Rep, 2025

DOI

Female fruit flies use social cues to make egg-clustering decisions

Background: The ability to respond plastically to environmental variation is a key determinant of fitness. Females may use cues to strategically place their eggs, for example adjusting the number or location of eggs according to whether other females are present and driving the dynamics of local competition or cooperation. The expression of plasticity in egg-laying patterns within individual patches (i.e. in contact clusters or not) represents an additional, under-researched, and potentially important opportunity for fitness gains. Clustered eggs might benefit from increased protection or defence, and clustering could facilitate cooperative feeding. However, increased clustering is also expected to increase the risk of overexploitation through direct competition. These potential benefits and costs likely covary with the number of individuals present; hence, egg-clustering behaviour within resource patches should be socially responsive. We investigate this new topic using the fruit fly Drosophila melanogaster. Results: Our mathematical model, parameterised by data, verified that females cluster their eggs non-randomly and increase clustering as group size increases. We also showed that as the density of adult females increased, females laid more eggs, laid them faster, and laid more eggs in clusters. Females also preferred to place eggs within existing clusters. Most egg clusters were of mixed maternity. Conclusions: Collectively, the results reveal that females express plasticity in egg clustering according to social environment cues and prefer to lay in clusters of mixed maternity, despite the potential for increased competition. These findings are consistent with egg-clustering plasticity being selected due to cooperative benefits.

ER Churchill, EK Fowler, LA Friend, M Archetti, DW Yu, AFG Bourke, T Chapman, A Bretman

BMC Biol, 2025

DOI

Genetic modifiers and ascertainment drive variable expressivity of complex disorders

Variable expressivity of disease-associated variants implies a role for secondary variants that modify clinical features. We assessed the effects of modifier variants on the clinical outcomes of 2,455 individuals with primary variants. Among 124 families with the 16p12.1 deletion, distinct rare and common variant classes conferred risks for specific developmental features, including short tandem repeats for neurological defects. Network analysis suggested distinct mechanisms involving 16p12.1 genes and secondary variants specific to each proband. Within disease and population cohorts of 976 individuals with the 16p12.1 deletion, we found opposing effects of secondary variants on clinical features across ascertainments. Additional analysis of 1,479 probands with other primary variants, such as the 16p11.2 deletion and CHD8 variants, and 1,528 probands without primary variants showed that phenotypic associations differed by primary variant context and were influenced by synergistic interactions between primary and secondary variants. Our study provides a paradigm to dissect the personalized genomic architecture of complex disorders.

M Jensen, C Smolen, A Tyryshkina, L Pizzo, J Sun, S Noss, D Banerjee, M Oetjens, H Shimelis, CM Taylor, VK Pounraja, H Song, L Rohan, E Huber, LE Khattabi, I van de Laar, R Tadros, CR Bezzina, M van Slegtenhorst, J Kammeraad, P Prontera, JH Caberg, H Fraser, S Banka, A Van Dijck, C Schwartz, EVoorhoeve, P Callier, AL Mosca-Boidron, N Marle, M Lefebvre, K Pope, P Snell, A Boys, PJ Lockhart, M Ashfaq, E McCready, M Nowacyzk, L Castiglia, O Galesi, E Avola, T Mattina, M Fichera, MG Bruccheri, GML Mandarà, F Mari, F Privitera, I Longo, A Curró, A Renieri, B Keren, P Charles, S Cuinat, M Nizon, O Pichon, C Bénéteau, R Stoeva, D Martin-Coignard, S Blesson, C Le Caignec, S Mercier, M Vincent, CL Martin, K Mannik, A Reymond, L Faivre, E Sistermans, RF Kooy, DJ Amor, C Romano, J Andrieux, S Girirajan

Cell, 2025

DOI

Accessible, realistic genome simulation with selection using stdpopsim

Selection is a fundamental evolutionary force that shapes patterns of genetic variation across species. However, simulations incorporating realistic selection along heterogeneous genomes in complex demographic histories are challenging, limiting our ability to benchmark statistical methods aimed at detecting selection and to explore theoretical predictions. stdpopsim is a community-maintained simulation library that already provides an extensive catalog of species-specific population genetic models. Here, we present a major extension to the stdpopsim framework that enables simulation of various modes of selection, including background selection, selective sweeps, and arbitrary distributions of fitness effects (DFE) acting on annotated subsets of the genome (for instance, exons). This extension maintains stdpopsim’s core principles of reproducibility and accessibility while adding support for species-specific genomic annotations and published DFE estimates. We demonstrate the utility of this framework by comparing methods for demographic inference, DFE estimation, and selective sweep detection across several species and scenarios. Our results demonstrate the robustness of demographic inference methods to selection on linked sites, reveal the sensitivity of DFE-inference methods to model assumptions, and show how genomic features, like recombination rate and functional sequence density, influence power to detect selective sweeps. This extension to stdpopsim provides a powerful new resource for the population genetics community to explore the interplay between selection and other evolutionary forces in a reproducible, user-friendly framework.

G Gower, NS Pope, MF Rodrigues, S Tittes, LN Tran, O Alam, MIA Cavassim, PD Fields, BC Haller, X Huang, B Jeffrey, K Korfmann, CC Kyriazis, J Min, I Rebollo, CT Rehmann, ST Small, CCR Smith, G Tsambos, Y Wong, Y Zhang, CD Huber, G Gorjanc, AP Ragsdale, I Gronau, RN Gutenkunst, J Kelleher, KE Lohmueller, DR Schrider, PL Ralph, AD Kern

Mol Biol Evol, 2025

DOI

Gut fungal profiles reveal phylosymbiosis and codiversification across humans and nonhuman primates

Fungi in the gut microbiome, collectively known as the mycobiome, are a prevalent yet neglected component of the human holobiont. A major question in the study of gut microbial communities is whether fungi exhibit eco-evolutionary patterns that are consistent with partner fidelity and long-term associations. We compared gut fungal profiles across natural populations of humans and nonhuman primates and identified significant degrees of primate-mycobiome phylosymbiosis as well as human-enriched fungal taxa. Notably, subsets of fungi are cophylogenetic and exhibit cospeciation patterns in hominids. These findings cautiously originate a new view on the eco-evolutionary potential that can shape the composition of human and primate gut mycobiomes.

EP Van Syoc, A Gomez, ER Davenport, SR Bordenstein

PLoS Biol, 2025

DOI

Parent-focused behavioural interventions for the prevention of early childhood obesity (TOPCHILD): a systematic review and individual participant data meta-analysis

Background: Childhood obesity is a global public health issue, which has prompted governments to invest in prevention programmes. We aimed to investigate the effectiveness of parent-focused early childhood obesity prevention interventions globally. Methods:We did a systematic review and individual participant data meta-analysis. We searched databases and trial registries (MEDLINE, Embase, CENTRAL, CINAHL, PsycInfo, ClinicalTrials.gov, and WHO International Clinical Trials Registry Platform) from inception until Sept 30, 2024, for randomised controlled trials commencing before 12 months of age examining parent-focused behavioural interventions to prevent obesity in children, compared with usual care, no intervention, or attention control. Individual participant data were checked, harmonised, and assessed for integrity and risk of bias. We excluded trials that were quasi-randomised, investigated pregnancy-only interventions, or did not collect any child weight-related outcomes. The primary outcome was BMI Z score at age 24 months (±6 months). We did an intention-to-treat, two-stage, random effects meta-analysis to examine effects overall and for prespecified subgroups. We assessed certainty of evidence using Grading of Recommendations Assessment, Development, and Evaluation. This study is registered with PROSPERO, CRD42020177408. Findings: Of 19 990 identified records, 47 (0·24%) trials were completed and eligible. Of these, 18 (38%) assessed our primary outcome, BMI Z score. We obtained individual participant data for 17 (94%; n=9128) of these 18 trials (n=9383), representing 97% of eligible participants. Of these 9128 participants, 4549 (50%) were boys, 4415 (48%) were girls, and 164 (2%) had unknown sex. We found no evidence of an effect of interventions on BMI Z score at age 24 months (±6 months; mean difference –0·01 [95% CI –0·08 to 0·05]; high certainty evidence, τ2=0·01; n=6505; 2623 missing). Findings were robust to prespecified sensitivity analyses (eg, different analysis methods and missing data), and we found no evidence of differential intervention effects for prespecified subgroups including priority populations and trial-level factors. Interpretation:These findings indicate that examined parent-focused behavioural interventions are insufficient to prevent obesity at age 24 months (±6 months). This evidence highlights a need to re-think childhood obesity prevention approaches.

KE Hunter, D Nguyen, S Libesman, JG Williams, M Aberoumand, J Aagerup, BJ Johnson, RK Golley, A Barba, JX Sotiropoulos, N Shrestha, T Palacios, SJ Pryde, L Wolfenden, RW Taylor, PJ Godolphin, K Matvienko-Sikar, LM Sanders, KP Robledo, V Brown, CT Wood, S Taki, HS Yin, AJ Hayes, DA O’Connor, W Smith, DE Espinoza, L Askie, PM Chadwick, C Rissel, AC Webster, KD Hesketh, M Bryant, JL Thomson, R Lakshman, AG Fiks, C Helle, CO Stough, KK Ong, EM Perrin, L Karssen, JK Larsen, AM Linares, MJ Messito, LM Wen, E Oken, NC Øverby, C Palacios, IM Paul, FE Rasmussen, EA Reifsnider, RL Rothman, RA Byrne, TM Rybak, SJ Salvy, HM Wasser, AL Thompson, A Ghaderi, BJ Taylor, C Maffeis, H Xu, JS Savage, KJ Joshipura, K de la Haye, M Røed, B Copsey, N Golova, RS Gross, S Anzman-Frasca, J Banna, LA Baur, AL Seidler, TOPCHILD Collaboration

Lancet, 2025

DOI

PatchWorkPlot: simultaneous visualization of local alignments across multiple sequences

Motivation: Revealing structural variations within and across populations is crucial for understanding their diversification mechanisms and roles. Existing tools for visualization of structural variations often require labor-intensive figure preparation and are limited in their ability to integrate annotations. Results: We developed PatchWorkPlot, a tool for automated visualization of pairwise alignments of multiple annotated sequences as dot plots combined into a single matrix. PatchWorkPlot enables exploration of positions, breakpoints, and architectures of structural variations across two or more sequences. The tool supports customization of visualization parameters and produces high-resolution, publication-ready figures. PatchWorkPlot significantly reduces manual work and simplifies the generation of complex plots for various cases, from individual loci to large-scale comparative projects. Availability: PatchWorkPlot is implemented using Python 3 and is publicly available at GitHub: github.com/yana-safonova/PatchWorkPlot. Supplementary information:Supplementary data are available at Bioinformatics online.

M Pospelova, Y Safonova

Bioinformatics, 2025

DOI

Gut fungi are associated with human genetic variation and disease risk

Human genetic determinants of the gut mycobiome remain uninvestigated despite decades of research highlighting tripartite relationships between gut bacteria, genetic background, and disease. Here, we present the first genome-wide association study on the number and types of human genetic loci influencing gut fungi relative abundance. We detect 148 fungi-associated variants (FAVs) across 7 chromosomes that statistically associate with 9 fungal taxa. Of these FAVs, several occur in the protein-coding genes PTPRC, ANAPC10, NAV2, and CDH13. Additional FAVs link to tissue-specific gene expression as fungi-associated expression quantitative trait loci. Notably, the relative abundance of gut yeast Kazachstania associates with genetic variation in CDH13 encoding T-cadherin, a protein linked to cardiovascular disease. Kazachstania forms a causal relationship with cardiovascular disease risk in a mendelian two-sample randomization analysis. These findings establish previously unrecognized connections between human genetics, gut fungi, and chronic disease, broadening the paradigm of human-microbe interactions in the gut to the mycobiome.

EP Van Syoc, ER Davenport, SR Bordenstein

PLoS Biol, 2025

DOI

The microbiome and volatile organic compounds reflecting the state of decomposition in an indoor environment

Given that a variety of factors can affect the decomposition process, it can be difficult to determine the post-mortem interval (PMI). The process is highly dependent on microbial activity, and volatile organic compounds (VOCs) are a by-product of this activity. Given both have been proposed to assist in PMI determination, a deeper understanding of this relationship is needed. The current study investigates the temporal evolution of the microbiome and VOC profile of a decomposing human analog (swine) in a controlled, indoor environment. Microbial communities were sampled at six-time points up to the active decay phase (72 swabs in total). VOC headspace samples were collected every six hours with six sampling times in common with the swab times. Sampling locations included the abdominal area, anus, right ear canal, and right nostril. Bacterial communities were found to significantly change during decomposition (p < 0.001), and communities shifted differently based on sample location. The families Moraxellaceae, Planococcaceae, Lactobacillaceae, and Staphylococcaceae drove these community shifts. From random forest analysis, the nostril sampling location was determined to be the best location to predict stage of decomposition. Individual VOCs exhibited large temporal shifts through decomposition stage in contrast to smaller shifts when evaluated based on functional groups. Finally, pairwise linear regression models between abdominal area bacteria and selected VOCs were assessed; Planococcaceae and Tissierellaceae were significantly correlated to indole. Overall, this study provides an exploratory analysis to support the connection between the microbiome, VOCs, and their relationship throughout decomposition.

VM Cappas, R Roy, ER Davenport, DG Sykes

Sci Justice, 2025

DOI

Factors Influencing Parental Decisions on Respiratory Syncytial Virus Immunoprophylaxis

Objectives: New respiratory syncytial virus (RSV) immunizations for infants and pregnant mothers recently became available to prevent severe RSV disease in infants. We aimed to determine the primary reasons for parental RSV immunization decisions. We further sought to evaluate the associations between vaccine receipt and source of health care information and trust in one’s health care provider. Study design: A convenience sample of parents and guardians of infants were surveyed during the 2023-2024 RSV season in one newborn nursery and three affiliated clinics that are part of an academic health system. Results: Among the 118 respondents, 79 (66.9%) chose to receive an RSV vaccine themselves (n = 42) or consented for infant immunization (n = 37). Thirty-nine (92.9%) parents who consented to maternal vaccination and 35 (87.5%) who consented to infant immunization stated a primary reason was protection for their infant. Among those that did not receive the maternal vaccine, the most common reasons were nonavailability (39.7%) or no provider immunization offer (22.2%). Infant immunoprophylaxis was most commonly refused due to the immunization being too new (66.7%). There were no significant associations between vaccine receipt and reported source of health information or between vaccine receipt and degree of trust in the health care provider. Conclusions: The desire to protect their infant from illness was the primary reason for parental RSV immunization intent, while the primary reasons for not immunizing were lack of availability, lack of provider recommendation, and the perception that the immunizations are too new. Ensuring availability and strong recommendations may improve immunization uptake.

KE Shedlock, SD Hicks, RE Gardner, LD Kaye, BJ Lipsett, EW Schaefer, IM Paul, BN Fogel

J Pediatr Clin Pract, 2025

DOI

Creation and Evaluation of New Growth Charts With a Gradual Transition From WHO to CDC Values

Background and objectives: At age 2 years, the Centers for Disease Control and Prevention (CDC) recommends switching from the World Health Organization (WHO) Growth Standards to the CDC 2000 Growth Reference. This abrupt switch may affect growth assessment, such as causing a clinically important change in growth z score in a child with a stable growth pattern. We sought to create growth charts that gradually transition between WHO and CDC and to evaluate their impact on growth assessment of young children using real-world data. Methods: We iteratively developed methods to create new charts for body mass index (BMI)-, weight-, and length/height-for-age. We performed a retrospective cohort study comparing the mean change in z scores for these parameters between 1.5 and 2 years for the CDC-recommended vs the new, gradually transitioned (gradual) charts. We also compared the prevalence of a large change in z score: <-1 (drop) or >1 (rise). Results: We transitioned between the charts using a weighted average from ages 2 to 5 years. In 7429 children, there was a large decrease in mean body mass index-for-age z score (BMIz) and weight-for-age z score (WTz) in the CDC-recommended charts (BMIz -0.59, WTz -0.35) that was not seen in the gradual charts (BMIz -0.09, WTz -0.01). Correspondingly, using the CDC-recommended charts, the proportion with a drop in BMIz or WTz was much higher for the CDC-recommended (BMIz 28.3%, WTz 6.0%) than the gradual charts (BMIz 11.6%, WTz 0.85%). Conclusions: We created growth charts that gradually transition between WHO and CDC and may reduce overidentification of slow weight gain. These charts may be useful in clinical care and research.

C Daymont, W Hwang, IM Paul, N Shur, DS Freedman

Pediatrics, 2025

DOI

Modeling the European Neolithic expansion suggests predominant within-group mating and limited cultural transmission

The Neolithic Revolution initiated a pivotal change in human society, marking the shift from foraging to farming. Historically, the underlying mechanisms of agricultural expansion have been a topic of debate, centered around two primary models: cultural diffusion, involving the transfer of knowledge and practices, and demic diffusion, characterized by the migration and replacement of populations. More recently, ancient DNA analyses have revealed significant ancestry changes during Europe’s Neolithic transition, suggesting a primarily demic expansion. Nonetheless, the presence of 10-15% hunter-gatherer ancestry in modern Europeans indicates cultural transmission and between-group mating were additional contributing factors. Here, we integrate mathematical models, agent-based simulations, and ancient DNA analysis to dissect and quantify the roles of cultural diffusion and between-group mating in farming’s expansion. Our findings indicate limited cultural transmission and predominantly within-group mating. Additionally, we challenge the assumption that demic expansion always leads to ancestry turnover. These results offer insights into early agricultural society through the integration of ancient DNA with archaeological models.

TM LaPolice, MP Williams, CD Huber

Nat Commun, 2025

DOI

Leveraging sequences missing from the human genome to diagnose cancer

Background: Cancer diagnosis using cell-free DNA (cfDNA) has the potential to improve treatment and survival but has several technical limitations. Methods: In this study, we developed a prediction model based on neomers, DNA sequences 13–17 nucleotides in length that are predominantly absent from the genomes of healthy individuals and are created by tumor-associated mutations. Results: We show that neomer-based classifiers can accurately detect cancer, including early stages, and distinguish subtypes and features. Analysis of 2577 cancer genomes from 21 cancer types shows that neomers can distinguish tumor types with higher accuracy than state-of-the-art methods. Generation and analysis of 465 cfDNA whole-genome sequences demonstrates that neomers can precisely detect lung and ovarian cancer, including early stages, with an area under the curve ranging from 0.89 to 0.94. By testing various promoters or over 9000 candidate enhancer sequences with massively parallel reporter assays, we show that neomers can identify cancer-associated mutations that alter regulatory activity. Conclusions: Combined, our results identify a sensitive, specific, and simple cancer diagnostic tool that can also identify cancer-associated mutations in gene regulatory elements.

I Georgakopoulos-Soares, O Yizhar-Barnea, I Mouratidis, CSY Chan, M Patsakis, A Nayak, R Bradley, M Mahajan, J Sims, DL Cintron, R Easterlin, JS Kim, E Chen, G Pineda, GE Parada, JS Witte, CA Maher, F Feng, I Vathiotis, N Syrigos, E Panagiotou, A Charpidou, K Syrigos, J Chapman, M Kvale, M Hemberg, N Ahituv

Commun Med, 2025

DOI

Explicit Scale Simulation for analysis of RNA-sequencing count data with ALDEx2

In high-throughput sequencing (HTS) studies, sample-to-sample variation in sequencing depth is driven by technical factors, and not by variation in the scale (size) of the biological system. Typically a statistical normalization removes unwanted technical variation in the data or the parameters of the model to enable differential abundance analyses. We recently showed that all normalizations make implicit assumptions about the unmeasured system scale and that errors in these assumptions can dramatically increase false positive and false negative rates. We demonstrated that these errors can be mitigated by accounting for uncertainty using a scale model, which we integrated into the ALDEx2 R package. This article provides new insights focusing on the application to transcriptomic analysis. We provide transcriptomic case studies demonstrating how scale models, rather than traditional normalizations, can reduce false positive and false negative rates in practice while enhancing the transparency and reproducibility of analyses. These scale models replace the need for dual cutoff approaches often used to address the disconnect between practical and statistical significance. We demonstrate the utility of scale models built based on known housekeeping genes in complex metatranscriptomic datasets. Thus this work provides guidance on how to incorporate scale into transcriptomic data sets.

GB Gloor, MP Nixon, JD Silverman

NAR Genom Bioinform, 2025

DOI

smoothEM: A new approach for the simultaneous assessment of smooth patterns and spikes

We consider functional data where an underlying smooth curve is composed not just with errors, but also with irregular spikes. We propose an approach that, combining regularized spline smoothing and an Expectation-Maximization (EM) algorithm, allows one to both identify spikes and estimate the smooth component. Imposing some assumptions on the error distribution, we prove consistency of EM estimates. Next, we demonstrate the performance of our proposal on finite samples and its robustness to assumptions violations through simulations. Finally, we apply it to data on the annual heatwave index in the US and on weekly electricity consumption in Ireland. In both data sets, we are able to characterize underlying smooth trends and to pinpoint irregular/extreme behaviors.

H Dang, MA Cremona, F Chiaromonte

Electron J Statist, 2025

DOI

TransferTWAS: A transfer learning framework for cross-tissue transcriptome-wide association study

Transcriptome-wide association studies (TWASs) utilize gene-expression data to explore the genetic basis of complex traits. A key challenge in TWASs is developing robust imputation models for tissues with limited sample sizes. This paper introduces transfer learning-assisted TWAS (TransferTWAS), a framework that adaptively transfers information from multiple tissues to improve gene-expression prediction in the target tissue. TransferTWAS employs a data-driven strategy that assigns higher weights to genetically similar external tissues. It outperforms other multi-tissue TWAS methods, such as the Unified Test for Molecular Signatures (UTMOST), which neglects tissue similarity, and Joint-Tissue Imputation (JTI), which relies on functional annotations to represent tissue similarity. Simulation studies demonstrate that TransferTWAS achieves the highest imputation accuracy, and analyses using the ROS/MAP and GEUVADIS datasets show a substantial power gain while maintaining control over type-I errors. Furthermore, analysis of the low-density lipoprotein cholesterol GWAS dataset and other complex traits demonstrates that TransferTWAS effectively identifies more associations compared with existing methods.

D Lai, H Wang, T Gu, S Wu, DJ Liu, PC Sham, YD Zhang

Am J Hum Genet, 2025

DOI

Allele frequency selection and no age-related increase in human oocyte mitochondrial mutations

Mitochondria, cellular powerhouses, harbor DNA [mitochondrial DNA (mtDNA)] inherited from the mothers. mtDNA mutations can cause diseases, yet whether they increase with age in human oocytes remains understudied. Here, using highly accurate duplex sequencing, we detected de novo mutations in single oocytes, blood, and saliva in women 20 to 42 years of age. We found that, with age, mutations increased in blood and saliva but not in oocytes. In oocytes, mutations with high allele frequencies were less prevalent in coding than noncoding regions, whereas mutations with low allele frequencies were more uniformly distributed along the mtDNA, suggesting frequency-dependent purifying selection. Thus, mtDNA in human oocytes is protected against accumulation of mutations with aging and having functional consequences. These findings are particularly timely as humans tend to reproduce later in life.

B Arbeithuber, K Anthony, B Higgins, P Oppelt, O Shebl, I Tiemann-Boege, F Chiaromonte, T Ebner, KD Makova

_sciadv, 2025

DOI

Multiplexed Salivary miRNA Quantification for Predicting Severe COVID-19 Symptoms in Children Using Ligation-RPA Amplification Assay

While most children with COVID-19 experience mild symptoms or remain asymptomatic, some may develop severe complications. Early identification of children at risk for severe outcomes is essential to ensuring timely and effective intervention. Recent studies have identified alterations in salivary microRNA (miRNA) expression levels as promising biomarkers for predicting severe complications in children. However, there remains a need for a rapid, noninvasive, and quantitative method to detect miRNA expression level changes, as their upregulation or downregulation serves as a hallmark of various diseases, providing an alternative to sequencing-based methods. Here, we developed a highly specific and sensitive ligation-coupled recombinase polymerase amplification (RPA) assay for quantitatively detecting multiplex severe and nonsevere miRNAs on a portable platform. The assay begins with an miRNA-templated annealing and ligation reaction of miR-1273, miR-296, and miR-29, followed by an RPA reaction. We quantified 100 pM to 1 fM, resolving 1 fM, with 100% specificity. Next, we validated portable extraction against benchtop extraction, achieving R2 > 0.85 and r > 0.92 in clinical samples. Finally, testing 154 clinical samples revealed severe miRNA downregulation compared to nonsevere cases. The assay achieved high diagnostic accuracy with an area under the curve (AUC) of 0.98. This platform would empower clinicians to make informed decisions, optimize resource allocation, and improve outcomes, particularly in point-of-care (POC) settings.

MA Ahamed, Z Zhang, A Kshirsagar, AJ Politza, U Sethuraman, S Suresh, S Hicks, F Guo, W Guan

ACS Sens, 2025

DOI

Polygenic prediction of body mass index and obesity through the life course and across ancestries

Polygenic scores (PGSs) for body mass index (BMI) may guide early prevention and targeted treatment of obesity. Using genetic data from up to 5.1 million people (4.6% African ancestry, 14.4% American ancestry, 8.4% East Asian ancestry, 71.1% European ancestry and 1.5% South Asian ancestry) from the GIANT consortium and 23andMe, Inc., we developed ancestry-specific and multi-ancestry PGSs. The multi-ancestry score explained 17.6% of BMI variation among UK Biobank participants of European ancestry. For other populations, this ranged from 16% in East Asian-Americans to 2.2% in rural Ugandans. In the ALSPAC study, children with higher PGSs showed accelerated BMI gain from age 2.5 years to adolescence, with earlier adiposity rebound. Adding the PGS to predictors available at birth nearly doubled explained variance for BMI from age 5 onward (for example, from 11% to 21% at age 8). Up to age 5, adding the PGS to early-life BMI improved prediction of BMI at age 18 (for example, from 22% to 35% at age 5). Higher PGSs were associated with greater adult weight gain. In intensive lifestyle intervention trials, individuals with higher PGSs lost modestly more weight in the first year (0.55 kg per s.d.) but were more likely to regain it. Overall, these data show that PGSs have the potential to improve obesity prediction, particularly when implemented early in life.

RAJ Smit, et al.

Nat Med, 2025

DOI

Brain-wide connectivity and novelty response of the dorsal endopiriform nucleus in mice

The dorsal endopiriform nucleus (EPd) is a cortical subplate structure within the piriform cortex that shares similar developmental origins to those of the claustrum. Although implicated in epilepsy and olfaction, the EPd’s connectivity and function remain largely unclear due to the lack of specific molecular markers. Our recent mapping study identifies the oxytocin receptor (Oxtr) as highly enriched in the EPd. Immunohistochemical and spatial transcriptomic analyses confirm Oxtr enrichment and a distinct molecular profile of the EPd compared to the claustrum. Whole-brain input-output mapping of EPd-Oxtr neurons unveils extensive bidirectional connections with ventral brain regions, orchestrating circuits regulating olfaction, internal state, and emotion. Furthermore, in vivo miniscope recordings reveal that EPd-Oxtr neurons exhibit high baseline activity during exploration, with a sharp decrease in response to novel stimuli. Together, these findings suggest that EPd-Oxtr neurons integrate interoceptive and exteroceptive signals, contributing to internal state regulation and behavioral adaptation to novel environmental cues.

SB Manjila, S Son, D Parmaksiz, H Kline, R Betty, YT Wu, HJ Pi, D Shin, JK Liwang, FN Kronman, IE Bjerke, K McGovern, J Silverman, A Paul, Y Kim

Cell Rep, 2025

DOI

A side-by-side comparison of variant function measurements using deep mutational scanning and base editing

Variant annotation is a crucial objective in mammalian functional genomics. Deep mutational scanning (DMS) using saturation libraries of complementary DNAs (cDNAs) is a well-established method for annotating human gene variants, but CRISPR base editing (BE) is emerging as an alternative. However, questions remain about how well high-throughput BE measurements can annotate variant function and the extent of downstream experimental validation required. This study is the first direct comparison of cDNA DMS and BE in the same lab and cell line. We focus on how well short guide RNA (sgRNA) depletion or enrichment is explained by the predicted edits within the editing “window” defined by the sgRNA. The most likely predicted edits enhance the agreement between a “gold standard” DMS dataset and a BE screen. A simple filter for sgRNAs making single edits in their window could sufficiently annotate a large proportion of variants directly from sgRNA sequencing of large pools. When multi-edit guides are unavoidable, directly measuring edits in medium-sized validation pools can recover high-quality variant annotation data. Our data show a surprisingly high degree of correlation between base editor data and gold standard DMS. We suggest that the main variable measured in base editor screens is the desired base edits.

I Sokirniy, H Inam, M Tomaszkiewicz, J Reynolds, D McCandlish, J Pritchard

Nucleic Acids Res, 2025

DOI

The genomic footprints of migration: how ancient DNA reveals our history of mobility

Ancient DNA has emerged as a powerful tool for studying human migration through the detection of admixture signatures. Here, we present the theoretical principles and methodologies for admixture analysis, with an emphasis on f-statistics and qpAdm. We review case studies from the literature demonstrating how these methods uncover patterns of human mobility, and discuss challenges related to data quality, demographic complexity, and sample representativeness on admixture and migration inferences. Finally, we highlight promising advancements in admixture analysis and underscore the importance of integrating genetic, archaeological, and historical data to achieve a more interdisciplinary and nuanced reconstruction of human history.

MP Williams, CD Huber

Genome Biol, 2025

DOI

Characterization of extensive diversity in immunoglobulin light chain variable germline genes across biomedically important mouse strains

The light chain immunoglobulin (IG) genes of inbred mouse strains are poorly documented in current gene databases. We previously showed that IG heavy chain (IGH) loci of wild-derived mouse strains, representing the major mouse subspecies, contained 247 IGH variable (V) sequences not curated in the International ImMunoGeneTics (IMGT) information system database, commonly used for adaptive immune receptor repertoire sequencing (AIRR-seq) analysis. Despite containing levels of polymorphism similar to the IGH locus, the germline gene content and diversity of the light chain loci (kappa, IGK; lambda, IGL) have not been comprehensively cataloged. To explore the extent of germline light chain repertoire diversity across mouse strains commonly used in the biomedical sciences, we performed AIRR-seq analysis and germline gene inference for 18 inbred mouse strains, including 4 wild-derived strains with diverse sub-species origins. We inferred 1582 IGKV and 63 IGLV sequences, representing 459 and 22 unique IGKV and IGLV germline alleles. Of the unique germline IGKV and IGLV sequences, 67.8% and 59%, respectively, were undocumented in IMGT. Across strains we observed germline IGKV sequences shared by three distinct IGK haplotypes and a more conserved IGLV germline repertoire. In addition, joining (J) gene inference indicated a novel IGKJ2 allele shared between PWD/PhJ and MSM/MsJ, a novel IGLJ1 allele for LEWES/EiJ, and a novel IGLJ2 allele for MSM/MsJ. Finally, combined IGHV, IGKV, and IGLV phylogenetic analysis of wild-derived germline sets revealed reduced diversity for light chain sequences compared to the heavy chain, suggesting potential evolutionary differences between heavy and light chain loci.

JT Kos, Y Safonova, K Shields, CA Silver, WD Lees, AM Collins, CT Watson

ImmunoHorizons, 2025

DOI

Genetic ancestry influences gene-environment interactions with sociocultural factors: Results from the Hispanic Community Health Study/Study of Latinos

Often, studies will aggregate all participants identified as Hispanic/Latino, despite genetic and environmental substructures, preventing the meaningful interrogation of the roles of genetics and environment in human health. Using the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), we examined how self-identified background group and genetic ancestry influence gene-environment interactions between body mass index (BMI) and a polygenic score for BMI (PGSBMI). Participants (n = 7,075) identified with six background groups: Central American, Cuban, Dominican, Mexican, Puerto Rican, and South American. Generalized linear models incorporating complex survey weighting were used to model BMI through joint and stratified (background group, estimated Amerindigenous [AME] ancestry) analyses including PGSBMI and other health-related variables. Interaction effects were modeled between PGSBMI and diet and age at immigration. Comparing pooled to background group-stratified analyses, we observe heterogeneous distributions of environmental and sociocultural variables, as well as differing associations with AME ancestry. Within the multivariate model, PGSBMI performance decreased with increasing AME ancestry. After stratification, PGS-age-at-immigration interactions remained statistically significant in some strata: Mexican background individuals born in the US (50 states/DC) (β = 1.33, p < 0.01), Dominican background individuals 6-12 years old (β = 4.38, p < 0.001), and Cuban background individuals 0-5 years old (β = 2.20, p = 0.015) relative to those ≥ 21 years old at migration. It is vital to understand populations of interest to model them appropriately and prevent possible confounding or misinterpretation. While this work focuses specifically on Hispanic/Latino groups, these lessons are relevant to other groups as we diversify work to better understand gene-environment interactions.

J Sharma, CE McArdle, M Graff, C Cordero, M Daviglus, LC Gallo, CR Isasi, TN Kelly, KM Perreira, GA Talavera, J Cai, KE North, L Fernández-Rhodes, GL Wojcik

HGG Adv., 2025

DOI

Patterns of X-linked inheritance: A new approach for the genome era

Purpose: The concepts of X-linked (XL) dominant and recessive inheritance originated long before dosage compensation for X chromosome genes was understood, but now have no scientific basis. However, misunderstanding of the underlying biology persists, prompting our reassessment of XL inheritance. Methods: We reviewed data on penetrance, expressivity, and X chromosome inactivation (XCI) for 55 XL genes and 57 XL disorders, and examined variations in inheritance based on disease severity, XCI status, cell selection, and other factors. Results: Our analysis demonstrated widely varying penetrance among heterozygous females that was related to severity of the phenotype particularly in males, the degree of cell selection shown by XCI patterns, cell autonomous or non-cell autonomous function of the gene product, and rare cellular interference. Conclusion: The conventional classification of XL inheritance into dominant and recessive subtypes is biologically flawed and should be retired. A more nuanced framework for understanding XL disorders is needed that accounts for the underlying biological complexity, and we propose 4 new groups of XL disorders with different patterns that should improve genetic diagnosis and counseling in families with XL disorders.

S Basava, CJ Billington Jr, L Carrel, LG Biesecker, WB Dobyns

Genet Med, 2025

DOI

IRF7 controls spontaneous autoimmune germinal center and plasma cell checkpoints

How IRF7 promotes autoimmune B cell responses and systemic autoimmunity is unclear. Analysis of spontaneous SLE-prone mice deficient in IRF7 uncovered the IRF7 role in regulating autoimmune germinal center (GC), plasma cell (PC), and autoantibody responses and disease. IRF7, however, was dispensable for foreign antigen-driven GC, PC, and antibody responses. Competitive bone marrow (BM) chimeras highlighted the importance of IRF7 in hematopoietic cells in spontaneous GC and PC differentiation. Single-cell RNAseq of SLE-prone B cells indicated IRF7-mediated B cell differentiation through GC and PC fates. Mechanistic studies revealed that IRF7 promoted B cell differentiation through GC and PC fates by regulating the transcriptome, translation, and metabolism of SLE-prone B cells. Mixed BM chimeras demonstrated a requirement for B cell-intrinsic IRF7 in IgG autoantibody production but not in the regulation of spontaneous GC and PC responses. Altogether, we delineate previously unknown B cell-intrinsic and -extrinsic mechanisms of IRF7-promoted spontaneous GC and PC responses, loss of tolerance, autoantibody production, and SLE development.

AJ Fike, KN Bricker, MV Gonzalez, A Maharjan, T Bui, K Nuon, SM Emrich, JL Weber, SA Luckenbill, NM Choi, R Sauteraud, DJ Liu, NJ Olsen, R Caricchio, M Trebak, SB Chodisetti, ZSM Rahman

J Exp Med, 2025

DOI

FDA Draft Guidelines for AI and the Need for Ethical Frameworks

N/A

AD Deshmukh, JK Wagner

JAMA Pediatr, 2025

DOI

Comparative Analysis of Mammalian Adaptive Immune Loci Revealed Spectacular Divergence and Common Genetic Patterns

Adaptive immune responses are mediated by the production of adaptive immune receptors, antibodies, and T-cell receptors, which bind antigens, thus causing their neutralization. Unlike other proteins, adaptive immune receptors are not fully encoded in the germline genome and result from a complex of somatic processes collectively called V(D)J recombination affecting germline immunoglobulin (IG) and T-cell receptor (TR) loci consisting of template genes. While various existing studies report extreme diversity of antibodies and T-cell receptors, little is known about the diversity of germline IG and TR loci. To overcome this gap, the first comparative analysis of full-length sequences of IG/TR loci across 46 mammalian species from 13 taxonomic orders was performed. First, germline gene counts were shown to correlate in immunoglobulin heavy chain immunoglobulin heavy chain (IGH)/immunoglobulin lambda (IGL) loci and T-cell receptor alpha (TRA)/T-cell receptor beta (TRB) and anticorrelate in immunoglobulin kappa (IGK)/IGL, possibly indicating coevolution between corresponding chains. Second, structures of IG/TR loci were analyzed, and it was shown that IG/TR loci formed by long arrays of high multiplicity repeats are more common for species that have experienced population bottlenecks. Finally, haplotypes of IG/TR loci with little or no sequence similarity within a species were found, suggesting that they may have a limited potential for homologous recombination. These results demonstrate that IG/TR loci are rapidly evolving genomic regions whose structural variation is shaped by the population history of the species and open new perspectives for immunogenomics studies.

M Pospelova, K Voss, A Zamyatin, CT Watson, KP Koepfli, A Bankevich, M Pennell, Y Safonova

Mol Biol Evol, 2025

DOI

Replacing normalizations with interval assumptions enhances differential expression and differential abundance analyses

Background: Methods for differential expression and differential abundance analysis often rely on normalization to address sample-to-sample variation in sequencing depth. However, normalizations imply strict, unrealistic assumptions about the unmeasured scale of biological systems (e.g., microbial load or total cellular transcription). Even slight errors in these assumptions introduce bias, leading to elevated false positive and negative rates. Results: We introduce interval assumptions as a generalization of normalizations. Unlike normalizations, our interval methods allow researchers to account for potential errors in assumptions about the system scale. Interval assumptions are also customizable and allow researchers to express more biologically plausible assumptions about scale. Interval assumptions even generalize Quantitative Microbiome Profiling (QMP), allowing researchers to account for errors in flow cytometry-based measurements of total cellular concentration. We develop a novel hypothesis testing framework that allows us to integrate interval assumptions into existing tools. We develop a modified version of the popular ALDEx2 method using interval assumptions rather than normalizations. Through real and simulated data analyses, we find that interval assumptions can dramatically decrease false positive rates (i.e., from 45% to 5%) while retaining or increasing statistical power. We also study interval assumptions under misspecification and show they still improve on normalizations. Conclusions: Interval assumptions enhance the rigor and reproducibility of differential expression and differential abundance analyses. Our results add to a growing body of literature arguing that normalizations should be replaced with alternative methods that allow researchers to account for scale uncertainty. However, compared to recent alternatives like scale models and sensitivity analyses, interval assumptions are easier to use, are more robust to misspecification, and have stronger and more interpretable inferential guarantees.

KC McGovern, JD Silverman

BMC Bioinformatics, 2025

DOI

Nested Admixture During and After the Trans-Atlantic Slave Trade on the Island of São Tomé

Human genetic admixture, involving the contact between two or more previously isolated populations, can be a complex process influenced by social dynamics. In this study, we aim to reconstruct complex admixture histories in São Tomé, an island in the Gulf of Guinea where the Portuguese established one of the first plantation-based slave societies. Since the 15th century, migration waves from Africa and Europe, slavery, marooning, and indentured labour led to profound demographic shifts and social stratification on the island. Examining 2.5 million SNPs newly genotyped in 96 São Toméans, we observed patterns of genetic differentiation that were more complex than those of other populations descended from enslaved Africans on either side of the Atlantic. Using local ancestry inference and Identical-by-Descent methods, we identified five genetic clusters in São Tomé and reconstructed shared ancestries between each cluster and 70 African and European population samples, including an extensive sample from the Cabo Verde archipelago. Our findings align with historical records, retracing the major slave trade routes and labour-driven migrations after the abolition of slavery. We also identified gene flow between recently admixed groups that were previously isolated on the island. We call this process, creating multiple layers of genetic ancestry in admixed genomes, nested admixture. We suggest that changing social structures in São Tomé transformed the genetic structure of its population and influenced the admixture process. This study demonstrates how successive admixture and isolation events during and after the Trans-Atlantic Slave Trade shaped extant genetic diversity patterns at local scale in Africa.

M Ciccarella, R Laurent, ZA Szpiech, E Patin, F Dessarps-Freichey, J Utgé, L Lémée, A Semo, J Rocha, P Verdu

Mol Biol Evol, 2025

DOI

Apollo: a comprehensive GPU-powered within-host simulator for viral evolution and infection dynamics across population, tissue, and cell

Modern sequencing instruments bring unprecedented opportunity to study within-host viral evolution in conjunction with viral transmissions between hosts. However, no computational simulators are available to assist the characterization of within-host dynamics. This limits our ability to interpret epidemiological predictions incorporating within-host evolution and to validate computational inference tools. To fill this need we developed Apollo, a GPU-accelerated, out-of-core tool for within-host simulation of viral evolution and infection dynamics across population, tissue, and cellular levels. Apollo is scalable to hundreds of millions of viral genomes and can handle complex demographic and population genetic models. Apollo can replicate real within-host viral evolution; accurately recapturing observed viral sequences from HIV and SARS-CoV-2 cohorts derived from initial population-genetic configurations. For practical applications, using Apollo-simulated viral genomes and transmission networks, we validated and uncovered the limitations of a widely used viral transmission inference tool.

D Perera, E Li, PM Gordon, F van der Meer, T Lynch, J Gill, DL Church, APJ de Koning, CD Huber, G van Marle, A Platt, Q Long

Nat Commun, 2025

DOI

Critical roles of IKAROS and HDAC1 in regulation of heterochromatin and tumor suppression in T-cell acute lymphoblastic leukemia

The IKZF1 gene encodes IKAROS - a DNA binding protein that acts as a tumor suppressor in T-cell acute lymphoblastic leukemia (T-ALL). IKAROS can act as a transcriptional repressor via recruitment of histone deacetylase 1 (HDAC1) and chromatin remodeling, however the mechanisms through which IKAROS exerts its tumor suppressor function via heterochromatin in T-ALL are largely unknown. We studied human and mouse T-ALL using a loss-of-function and IKZF1 re-expression approach, along with primary human T-ALL, and normal human and mouse thymocytes to establish the role of IKAROS and HDAC1 in global regulation of facultative heterochromatin and transcriptional repression in T-ALL. Results identified novel IKAROS and HDAC1 functions in T-ALL: Both IKAROS and HDAC1 are essential for EZH2 histone methyltransferase activity and formation of facultative heterochromatin; recruitment of HDAC1 by IKAROS is critical for establishment of H3K27me3 histone modification and repression of active enhancers; and IKAROS-HDAC1 complexes promote formation and expansion of H3K27me3 Large Organized Chromatin lysine (K) domains (LOCKs) and Broad Genic Repression Domains (BGRDs) in T-ALL. Our results establish the central role of IKAROS and HDAC1 in activation of EZH2, global regulation of the facultative heterochromatin landscape, and silencing of active enhancers that regulate oncogene expression.

Y Ding, B He, D Bogush, J Schramm, C Singh, K Dovat, J Randazzo, D Tukaramrao, J Hengst, C Annageldiyev, A Kudva, D Desai, A Sharma, VS Spiegelman, S Huang, CT Viet, G Dorsam, GS Scholler, J Broach, F Yue, S Dovat

Leukemia, 2025

DOI

CK2α Deletion in the Hematopoietic Compartment Shows a Mild Alteration in Terminally Differentiated Cells and the Expansion of Stem Cells

Casein Kinase II (CK2) is a ubiquitously present serine/threonine kinase essential for mammalian development. CK2 holoenzyme is a tetramer with two highly related catalytic subunits (α or α’) and two regulatory ß subunits. Global deletion of the α or β subunit in mice is embryonically lethal. We and others have shown that CK2 is overexpressed in leukemia cells and plays an important role in cell cycle, survival, and resistance to the apoptosis of leukemia stem cells (LSCs). To study the role of CK2α in adult mouse hematopoiesis, we generated hematopoietic cell-specific CK2α-conditional knockout mice (Vav-iCreCK2 f/f). Here we report the generation and validation of a novel mouse model that lacks CK2α in the hematopoietic compartment. Vav-iCreCK2α f/f mice were viable without dysmorphic features and showed a mild phenotype under baseline conditions. In Vav-iCreCK2α f/f mice, the blood count showed a significant decrease in total red blood cells and platelets. The spleen was enlarged in Vav-iCreCK2α f/f mice with evidence of extramedullary hematopoiesis. HSC and early progenitor cell compartments showed expansion in CK2α-null bone marrow, suggesting that the absence of CK2α impaired their proliferation and differentiation. Given the established roles of CK2 in cell cycle regulation and the findings reported here, further functional studies are warranted to investigate the role of CK2α in HSC self-renewal and differentiation. This mouse model serves as a valuable tool for understanding the role of CK2α in normal and malignant hematopoiesis.

R Rajaiah, M Daniyal, MP Shanmugam, H Valensi, K Duke, K Mercer, M Klink, M Lanza, Y Uzun, S Huang, S Dovat, CG Behura

Cells, 2025

DOI

Analyzing Taiwanese Traffic Patterns on Consecutive Holidays Through Forecast Reconciliation and Prediction-Based Anomaly Detection Techniques

This study explores traffic patterns on Taiwanese highways during consecutive holidays, with a focus on understanding the behavior of Taiwanese highway traffic. We propose a prediction-based detection method for identifying highway traffic anomalies using reconciled ordinary least squares (OLS) forecasts and bootstrap prediction intervals. Two fundamental features of traffic flow time series – seasonality and spatial autocorrelation – are captured by adding Fourier terms in OLS models, spatial aggregation (as a hierarchical structure mimicking the geographical division into regions, cities, and stations), and a reconciliation step. Our approach, although simple, is capable of modeling complex traffic datasets with reasonable accuracy. Being based on OLS, it is efficient and permits avoiding the computational burden of more complex methods. Analyses of Taiwan’s consecutive holidays in 2019, 2020, and 2021 (73 days) showed strong variations in anomalies across different directions and highways. Specifically, we detected some areas and highways comprising a high number of traffic anomalies (north direction-central and southern regions-highways No. 1 and 3, south direction-southern region-highway No.3), and others with generally normal traffic (east and west direction). These results could provide important decision-support information to traffic authorities.

M Ashouri, FKH Phoa, MA Cremona

IEEE, 2025

DOI

Polycystic Ovary Syndrome, Metabolic Syndrome, and Inflammation in the Hispanic Community Health Study/Study of Latinos

Context:Polycystic ovary syndrome (PCOS) is a multifaceted endocrine disorder with reproductive and metabolic dysregulation. PCOS has been associated with inflammation and metabolic syndrome (MetS); however, the moderating effects of inflammation as measured by C-reactive protein (CRP) and menopause on the PCOS-MetS association have not been studied in Hispanic/Latinas with PCOS who have a higher metabolic burden. Objective: We studied the cross-sectional association between PCOS and (1) MetS in 7316 females of the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), (2) subcomponents of MetS including impaired fasting glucose (IFG) and elevated triglycerides (TGL), and (3) effect modification by menopausal status and CRP. Design: The HCHS/SOL is a multicenter, longitudinal, and observational study of US Hispanic/Latinos. Our study sample included females from visit 2 with self-reported PCOS and MetS (ages 23-82 years). Results: PCOS (prevalence = 18.8%) was significantly associated with MetS prevalence [odds ratio [odds ratio (OR) = 1.41 (95% confidence interval: 1.13-1.76)], IFG and TGL (OR = 1.42 (1.18-1.72), OR = 1.48 (1.20-1.83), respectively]. We observed effect modification by menopausal status (ORpre = 1.46, Pint= .02; ORpost = 1.34, Pint= .06) and CRP (ORelevated = 1.41, Pint= .04; ORnormal = 1.26, Pint= .16) on the PCOS-MetS association. We also observed a superadditive interaction between CRP and PCOS, adjusting for which resulted in an attenuated effect of PCOS on MetS (OR = 1.29 [0.93-1.78]). Conclusion: Hispanic/Latino females with PCOS had higher odds of MetS, IFG, and elevated TGL than their peers without PCOS. Interaction analyses revealed that the odds of MetS are higher among PCOS females who have premenopausal status or high inflammation. Interventions in Hispanic/Latinas should target these outcomes for effective management of the disease.

HC Rao, ML Meyer, MA Kominiarek, ML Daviglus, LC Gallo, C Cordero, R Syan, KM Perreira, GA Talavera, L Fernández-Rhodes

JCEM, 2025

DOI

Epitranscriptomics Regulation of CD70, CD80, and TIGIT in Cancer Immunity

Tumor development is mainly marked by the gradual transformation of cells that acquire capacities such as sustained growth signaling, evasion of growth suppression, resistance to cell death, and induction of angiogenesis, achieving replicative immortality and activating invasion and metastasis. How different epigenetic alterations like m1A, m5C, and m6A contribute to tumor development is a field that still needs to be investigated. The immune modulators, CD70, CD80, and TIGIT, mainly regulate T-cell activation and consequently the immune evasion of tumors. Here, we explored the presence and the potential consequences of RNA modifications in these regulators in pan-cancer. Our findings highlight the critical role of the m6A, m5C, and m1A in regulating CD70, CD80, and TIGIT across multiple solid tumors. By combining epitranscriptomics data with functional enrichment and survival modeling, we show that RNA modification enzymes not only modulate immune-related gene expression but also serve as potential biomarkers for patient prognosis. By constructing a robust four-gene prognostic signature involving YTHDF3, RBM15B, IGF2BP2, and TRMT61A, we demonstrate that RNA modification profiles can accurately stratify patients into risk groups with distinct overall survival outcomes. The performance of this model across eight cancer types underscores the translational promise of epitranscriptomic markers in both mechanistic understanding and personalized oncology. Altogether, our study bridges the gap between the mechanistic regulation of immune checkpoints and their clinical utility, offering novel insights into how the epitranscriptome can be leveraged to improve cancer prognosis and potentially enhance immunotherapeutic strategies.

CP Rigopoulos, M Gkoris, I Georgakopoulos-Soares, I Boulalas, A Zaravinos

Int J Mol Sci, 2025

DOI

Darling (v2.0): Mining disease-related databases for the detection of biomedical entity associations

Darling is a web application that employs literature mining to detect disease-related biomedical entity associations. Darling can detect sentence-based cooccurrences of biomedical entities such as genes, proteins, chemicals, functions, tissues, diseases, environments, and phenotypes from biomedical literature found in six disease-centric databases. In this version, we deploy additional query channels focusing on COVID-19, GWAS studies, cardiovascular, neurodegenerative, and cancer diseases. Compared to its predecessor, users now have extended query options including searches with PubMed identifiers, disease records, entity names, titles, single nucleotide polymorphisms, or the Entrez syntax. Furthermore, after applying named entity recognition, one can retrieve and mine the relevant literature from recognized terms for a free input text. Term associations are captured in customizable networks which can be further filtered by either term or co-occurrence frequency and visualized in 2D as weighted graphs or in 3D as multi-layered networks. The fetched terms are organized in searchable tables and clustered annotated documents. The reported genes can be further analyzed for functional enrichment using external applications called from within Darling. The Darling databases, including terms and their associations, are updated annually. Darling is available at: https://www.darling-miner.org/.

FA Baltoumas, E Karatzas, NK Venetsianou, E Aplakidou, K Giatras, MN Chasapi, IN Chasapi, I Iliopoulos, VA Iconomidou, IP Trougakos, F Psomopoulos, A Giannakakis, I Georgakopoulos-Soares, P Kontou, PG Bagos, GA Pavlopoulos

Comput Struct Biotechnol J., 2025

DOI

Representation and inclusion among members and affiliates of the Society for Epidemiologic Research: findings from the 2021 diversity and inclusion survey

Representation and inclusion are stated priorities for many scientific and professional organizations, including the Society for Epidemiologic Research (SER), which was founded in 1967 with the intention of bringing together epidemiologists across career stages and specialties. Representation and inclusion are necessary for fostering safe and equitable educational and professional environments, recruiting future generations of researchers and practitioners, and addressing critical public health questions. However, there has been persistent underrepresentation and systemic exclusion of marginalized groups in the sciences, including epidemiology which are symptoms of interpersonal and structural racism, classism, sexism, ableism, heteronormativity, religious-based discrimination, and other dimensions of marginalization. To advance SER’s goals, the Diversity and Inclusion Committee has sought to characterize representation and inclusion among SER members and affiliates through surveys conducted in 2018 and 2021. In this study, we assessed trends in representation within SER, made comparisons with relevant benchmarks, and discussed barriers to inclusion. In the 2018 baseline survey, many groups were underrepresented relative to the US population, particularly transgender individuals, Black/African American and Hispanic/Latinx people, and first-generation college students. Moreover, women and people with certain racial/ethnic and religious identities were less likely to participate in SER activities or to report feeling welcomed. This letter provides primary findings from an updated assessment of representation and inclusion among SER members and affiliates and situates the experiences of SER members in broader literatures on diversity and inclusion. The complete report, including a more detailed discussion of recommendations informed in part by survey respondents, is available on the SER website and eScholarship Repository.

DJX González, BS Staley, SB Andrea, EA DeVilbiss, DS Fink, C Peña, DM Reed, MVD Santana, LO Fasehun, AJ Alvero, O Babalola, V Puac-Polanco, CA Thompson, CL Frankenfeld, L Fernández-Rhodes, DS Lopez, HSA Magid, Society for Epidemiologic Research Diversity and Inclusion Committee

Am J Epidemiol, 2025

DOI

Evolutionary dynamics of predicted G-quadruplexes in human and other great apes

Background: G-quadruplexes (G4s) are non-canonical DNA structures that can form at approximately 1% of the human genome. They facilitate genomic instability by increasing point mutations and structural variation. Numerous G4s participate in telomere maintenance and regulating transcription and replication, and evolve under purifying selection. Despite these important functions, G4s have remained under-studied in human and ape genomes due to incomplete assemblies. Results: Here, we conduct a comprehensive analysis of predicted G4s (pG4s) in the recently released, telomere-to-telomere (T2T) genomes of human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan. We annotate 41,232–174,442 new pG4s in these T2T compared to previous ape genome assemblies (5%–21% increase). Analyzing inter-species whole-genome alignments, we identify pG4s shared across apes (approximately one-third of all pG4s) and thousands of species-specific pG4s. pG4s accumulate and diverge at rates consistent with divergence times between species, following molecular clock. pG4s shared across apes are enriched and hypomethylated at regulatory regions—enhancers, promoters, UTRs, and origins of replication—suggesting their conserved formation and functions. Species-specific pG4s (constituting 11–27% of all pG4s) are located in regulatory regions, potentially contributing to adaptations, and in repeats, likely driving genome expansions. Conclusions: Our findings illuminate the evolutionary dynamics of G4s, conservation of their role in gene regulation, and their contributions to ape genome evolution. Our study highlights the utility of high-resolution T2T genomes in revealing elusive yet likely functionally relevant genomic features previously hidden by incomplete assemblies.

SK Mohanty, F Chiaromonte, KD Makova

Genome Biol, 2025

DOI

Trans amplifying mRNA vaccine expressing consensus spike elicits broad neutralization of SARS CoV 2 variants

SARS-CoV-2 continues to evolve and evade vaccine immunity necessitating vaccines that offer broad protection across variants. Conventional mRNA vaccines face cost and scalability challenges, prompting the exploration of alternative platforms like trans-amplifying (TA) mRNA that offer advantages in safety, manufacturability, and antigen dose optimization. Using consensus sequence of immunodominant antigens is a promising antigen design strategy for board cross-protection. Combining these two features, we designed and evaluated a TA mRNA vaccine encoding a consensus spike protein from SARS-CoV-2. Mice receiving the TA mRNA vaccine produced neutralizing antibody levels comparable to a conventional mRNA vaccine using 40 times less antigen mRNA. In hACE2 transgenic mice challenged with the Omicron BA.1 variant, the TA mRNA vaccine reduced lung viral titers by over 10-fold and induced broadly cross-neutralizing antibodies against multiple variants. These findings highlight the potential of TA mRNA vaccines with consensus antigen design, to improve efficacy and adaptability against SARS-CoV-2 variants.

A Gontu, S Misra, SK Chothe, S Ramasamy, P Jakka, M Byukusenge, LC LaBella, MS Nair, BM Jayarao, M Archetti, RH Nissly, SV Kuchipudi

npj Vaccines, 2025

DOI

ZSeeker: an optimized algorithm for Z-DNA detection in genomic sequences

Z-deoxyribonucleic acid (Z-DNA) is an alternative left-handed DNA structure with a zigzag-shaped backbone that differs from the right-handed canonical B-DNA helix. Z-DNA has been implicated in various biological processes, including transcription, replication, and DNA repair, and can induce genetic instability. Repetitive sequences of alternating purines and pyrimidines have the potential to adopt Z-DNA structures. ZSeeker is a novel computational tool developed for the accurate detection of potential Z-DNA-forming sequences in genomes, addressing key limitations of prior methods, such as computational inefficiency, difficult interpretability and usability, and lack of experimentally generated data. By introducing a novel methodology informed and validated by experimental data, ZSeeker enables the refined detection of potential Z-DNA-forming sequences. Built both as a standalone Python package and as an accessible web interface, ZSeeker allows users to input genomic sequences, adjust detection parameters, and view potential Z-DNA sequence distributions and Z-scores via downloadable visualizations. Our web platform provides a no-code solution for Z-DNA identification, with a focus on accessibility, user-friendliness, speed, and customizability. By providing efficient, high-throughput analysis, and enhanced detection accuracy, ZSeeker has the potential to support significant advancements in understanding the roles of Z-DNA in normal cellular functions, genetic instability, and its implications in human diseases.

G Wang, I Mouratidis, K Provatas, N Chantzi, M Patsakis, I Georgakopoulos-Soares, KM Vasquez

Brief Bioinform, 2025

DOI

MAFcounter: an efficient tool for counting the occurrences of k-mers in MAF files

Motivation: With the rapid expansion of large-scale biological datasets, DNA and protein sequence alignments have become essential for comparative genomics and proteomics. These alignments facilitate the exploration of sequence similarity patterns, providing valuable insights into sequence conservation, evolutionary relationships and for functional analyses. Typically, sequence alignments are stored in formats such as the Multiple Alignment Format (MAF). Counting k-mer occurrences is a crucial task in many computational biology applications, but currently, there is no algorithm designed for k-mer counting in alignment files. Results: We have developed MAFcounter, the first k-mer counter dedicated to alignment files. MAFcounter is multithreaded, fast, and memory efficient, enabling k-mer counting in DNA and protein sequence alignment files with a wide variety of features for k-mer analysis. Availability: MAFcounter is released under GPL license as a suite of binary C++ applications and is available at: https://github.com/Georgakopoulos-Soares-lab/MAFcounter. Keywords: Genomics; Multiple sequence alignment; Proteomics; k-mer counting.

M Patsakis, K Provatas, A Karatzikos, C Koilakos, I Mouratidis, I Georgakopoulos-Soares

BMC Bioinformatics, 2025

DOI

Letter: Are Antispasmodics Truly Ineffective in IBD? Considerations on Nuanced Interpretation and Stratified Analysis. Authors' Reply

This article relates to The Impact of Antispasmodic Use on Abdominal Pain and Opioid Use in Inflammatory Bowel Disease: A Population-Based Study

C Khunsriraksakul, O Ziegler, DJ Liu, AS Kulaylat, MD Coates

Aliment Pharmacol Ther, 2025

DOI

The Impact of Antispasmodic Use on Abdominal Pain and Opioid Use in Inflammatory Bowel Disease: A Population-Based Study

Background: Patients with inflammatory bowel disease (IBD) are often prescribed antispasmodics for chronic abdominal pain. Large-scale data regarding efficacy and impact on clinical outcomes are lacking. Aim: To examine the association between antispasmodic use and outcomes of abdominal pain and opioid use before and after propensity matching key demographic and clinical characteristics. Methods:We used TriNetX Diamond Network, a medical and claims database. Patients were stratified by baseline abdominal pain and opioid use. Secondary outcomes were corticosteroid use, IBD-related complications and surgeries, emergency room (ER) visits, hospitalisation and mortality. Results: We included 85,859 patients (median age 50; 53.8% female) with IBD; 5661 used antispasmodics. On follow-up, those with antispasmodic use had higher rates of abdominal pain and opioid use (p < 0.001) regardless of baseline abdominal pain or opioid use. After matching, 5629 patients remained per group. Patients who used antispasmodics had higher rates of abdominal pain at 1 month, regardless of baseline abdominal pain. Opioid-naïve patients who used antispasmodics had higher rates of opioid use at follow-up (1.1% vs. 0.2%; p < 0.001). The likelihood of corticosteroid use, clinic visits, ER visits and hospitalisation were higher in those with antispasmodic use. No differences in IBD-related complications, surgery or mortality were observed. Conclusions: Antispasmodic use in patients with IBD was associated with increased abdominal pain and opioid use in opioid-naïve patients. Antispasmodic use was associated with increased likelihood of corticosteroid use, clinic and ER visits and hospitalisation.

C Khunsriraksakul, O Ziegler, DJ Liu, AS Kulaylat, MD Coates

Aliment Pharmacol Ther, 2025

DOI

Catalytic-Dependent Role of DNA Polymerase κ in Nucleotide Excision Repair

DNA polymerase kappa (pol κ) is an error-prone Y-family polymerase primarily associated with translesion DNA synthesis (TLS), a DNA damage tolerance mechanism that prevents replication fork stalling. Pol κ has been implicated in other DNA repair and tolerance pathways such as nucleotide excision repair (NER). However, the role of error-prone pol κ in the NER pathway remains unclear. We sought to investigate if pol κ had a catalytic role in NER by using the pol κ selective nucleoside analogue, N2-(4-ethynylbenzyl)-2’-deoxyguanosine (EBndG). Here, we identified robust, cell cycle-independent catalytic activity of pol κ in cells not treated with DNA-damaging agents. We identified approximately 40% of pol κ catalytic activity was reduced with loss of either XPC or XPA, but not CSB, indicating pol κ has a role in global genome-NER. We monitored pol κ catalytic activity after treatment with benzo(a)pyrene diol epoxide and UVB radiation, and we observed that pol κ catalytic activity increased in an NER-dependent manner. Our study highlights that pol κ is consistently active in cells and possesses a key catalytic role in NER.

A Rebok, MC Torres, JR Ambrose, TE Spratt

Chem Res Toxicol, 2025

DOI

Massively parallel reporter assays and mouse transgenic assays provide correlated and complementary information about neuronal enhancer activity

High-throughput massively parallel reporter assays (MPRAs) and phenotype-rich in vivo transgenic mouse assays are two potentially complementary ways to study the impact of noncoding variants associated with psychiatric diseases. Here, we investigate the utility of combining these assays. Specifically, we carry out an MPRA in induced human neurons on over 50,000 sequences derived from fetal neuronal ATAC-seq datasets and enhancers validated in mouse assays. We also test the impact of over 20,000 variants, including synthetic mutations and 167 common variants associated with psychiatric disorders. We find a strong and specific correlation between MPRA and mouse neuronal enhancer activity. Four out of five tested variants with significant MPRA effects affected neuronal enhancer activity in mouse embryos. Mouse assays also reveal pleiotropic variant effects that could not be observed in MPRA. Our work provides a catalog of functional neuronal enhancers and variant effects and highlights the effectiveness of combining MPRAs and mouse transgenic assays.

M Kosicki, DL Cintrón, P Keukeleire, M Schubach, NF Page, I Georgakopoulos-Soares, JA Akiyama, I Plajzer-Frick, CS Novak, M Kato, RD Hunter, K von Maydell, S Barton, P Godfrey, E Beckman, SJ Sanders, M Kircher, LA Pennacchio, N Ahituv

Nat Commun, 2025

DOI

Incorporating scale uncertainty in microbiome and gene expression analysis as an extension of normalization

Statistical normalizations are used in differential analyses to address sample-to-sample variation in sequencing depth. Yet normalizations make strong, implicit assumptions about the scale of biological systems, such as microbial load, leading to false positives and negatives. We introduce scale models as a generalization of normalizations, which allows researchers to model potential errors in these modeling assumptions, thereby enhancing the transparency and robustness of data analyses. In practice, scale models can drastically reduce false positives and false negatives rates. We introduce updates to the popular ALDEx2 software package, available on Bioconductor, facilitating scale model analysis.

MP Nixon, GB Gloor, JD Silverman

Genome Biol, 2025

DOI

Infection risk of rituximab monotherapy versus combination therapy with rituximab and mycophenolic acid in systemic sclerosis: A retrospective cohort study

Key words: complex medical dermatologyimmunomodulationinfectionmycophenolic acidrituximabsystemic sclerosis

JB Kang, KN Smith, EM Meara, M Cho, JD Silverman, AH LaChance, JS Smith

JAAD, 2025

DOI

CloseRead: a tool for assessing assembly errors in immunoglobulin loci applied to vertebrate long-read genome assemblies

Despite tremendous advances in long-read sequencing, some structurally complex and repeat-rich genomic regions remain challenging to assemble. Furthermore, we lack tools to assess local assembly quality, making it hard to identify problems and assess progress. Here we develop a new approach “CloseRead” for visualizing local assembly quality and diagnosing errors using multiple metrics. We apply CloseRead to evaluate how well immunoglobulin loci, paradigmatic cases of structurally complex regions, are assembled in 74 state-of-the-art vertebrate genomes. We then show that targeted, local re-assembly can correct the specific errors identified by CloseRead, highlighting the value of an iterative approach to genome assembly.

Y Zhu, C Watson, Y Safonova, M Pennell, A Bankevich

Genome Biol, 2025

DOI

IGLoo enables comprehensive analysis and assembly of immunoglobulin heavy-chain loci in lymphoblastoid cell lines using PacBio high-fidelity reads

High-quality human genome assemblies derived from lymphoblastoid cell lines (LCLs) provide reference genomes and pangenomes for genomics studies. However, LCLs pose technical challenges for profiling immunoglobulin (IG) genes, as their IG loci contain a mixture of germline and somatically recombined haplotypes, making genotyping and assembly difficult with widely used frameworks. To address this, we introduce IGLoo, a software tool that analyzes sequence data and assemblies derived from LCLs, characterizing somatic V(D)J recombination events and identifying breakpoints and missing IG genes in the assemblies. Furthermore, IGLoo implements a reassembly framework to improve germline assembly quality by integrating information on somatic events and population structural variations in IG loci. Applying IGLoo to the assemblies from the Human Pangenome Reference Consortium, we gained valuable insights into the mechanisms, gene usage, and patterns of V(D)J recombination and the causes of assembly artifacts in the IG heavy-chain (IGH) locus, and we improved the representation of IGH assemblies.

MJ Lin, B Langmead, Y Safonova

Cell Rep Methods, 2025

DOI

invertiaDB: a database of inverted repeats across organismal genomes

Inverted repeats are repetitive elements that can form hairpin and cruciform structures. They are linked to genomic instability; however, they also have various biological functions. Their distribution differs markedly across taxonomic groups in the tree of life, and they exhibit high polymorphism due to their inherent genomic instability. Advances in sequencing technologies and declined costs have enabled the generation of an ever-growing number of complete genomes for organisms across taxonomic groups in the tree of life. However, a comprehensive database encompassing inverted repeats across diverse organismal genomes has been lacking. We present invertiaDB, the first comprehensive database of inverted repeats spanning multiple taxa, featuring repeats identified in the genomes of 118 101 organisms across all major taxonomic groups. For each organism, we derived inverted repeats with arm lengths of at least 10 bp, spacer lengths up to 8 bp, and no mismatches in the arms. The database currently hosts 34 330 450 inverted repeat sequences, serving as a centralized, user-friendly repository to perform searches and interactive visualizations, and download existing inverted repeat data for independent analysis. invertiaDB is implemented as a web portal for browsing, analyzing, and downloading inverted repeat data. invertiaDB is publicly available at https://invertiadb.netlify.app/homepage.html.

K Provatas, N Chantzi, N Amptazi, M Patsakis, A Nayak, I Mouratidis, A Zaravinos, GA Pavlopoulos, I Georgakopoulos-Soares

Nucleic Acids Res, 2025

DOI

Replicative DNA polymerase epsilon and delta holoenzymes show wide-ranging inhibition at G-quadruplexes in the human genome

G-quadruplexes (G4s) are functional elements of the human genome, some of which inhibit DNA replication. We investigated replication of G4s within highly abundant microsatellite (GGGA, GGGT) and transposable element (L1 and SVA) sequences. We found that genome-wide, numerous motifs are located preferentially on the replication leading strand and the transcribed strand templates. We directly tested replicative polymerase ϵ and δ holoenzyme inhibition at these G4s, compared to low abundant motifs. For all G4s, DNA synthesis inhibition was higher on the G-rich than C-rich strand or control sequence. No single G4 was an absolute block for either holoenzyme; however, the inhibitory potential varied over an order of magnitude. Biophysical analyses showed the motifs form varying topologies, but replicative polymerase inhibition did not correlate with a specific G4 structure. Addition of the G4 stabilizer pyridostatin severely inhibited forward polymerase synthesis specifically on the G-rich strand, enhancing G/C strand asynchrony. Our results reveal that replicative polymerase inhibition at every G4 examined is distinct, causing complementary strand synthesis to become asynchronous, which could contribute to slowed fork elongation. Altogether, we provide critical information regarding how replicative eukaryotic holoenzymes navigate synthesis through G4s naturally occurring thousands of times in functional regions of the human genome.

SE Hile, MH Weissensteiner, KG Pytko, J Dahl, E Kejnovsky, I Kejnovská, M Hedglin, I Georgakopoulos-Soares, KD Makova, KA Eckert

Nucleic Acids Res, 2025

DOI

Complete sequencing of ape genomes

We present haplotype-resolved reference genomes and comparative analyses of six ape species, namely: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. We achieve chromosome-level contiguity with unparalleled sequence accuracy (<1 error in 500,000 base pairs), completely sequencing 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, providing more in-depth evolutionary insights. Comparative analyses, including human, allow us to investigate the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference. This includes newly minted gene families within lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes, and subterminal heterochromatin. This resource should serve as a definitive baseline for all future evolutionary studies of humans and our closest living ape relatives.

D Yoo, A Rhie, P Hebbar, F Antonacci, GA Logsdon, SJ Solar, D Antipov, BD Pickett, Y Safonova, F Montinaro, Y Luo, J Malukiewicz, JM Storer, J Lin, AN Sequeira, RJ Mangan, G Hickey, GM Anez, P Balachandran, A Bankevich, CR Beck, A Biddanda, M Borchers, GG Bouffard, E Brannan, SY Brooks, L Carbone, L Carrel, AP Chan, J Crawford, M Diekhans, E Engelbrecht, C Feschotte, G Formenti, GH Garcia, L de Gennaro, D Gilbert, RE Green, A Guarracino, I Gupta, D Haddad, J Han, RS Harris, GA Hartley, WT Harvey, M Hiller, K Hoekzema, ML Houck, H Jeong, K Kamali, M Kellis, B Kille, C Lee, Y Lee, W Lees, AP Lewis, Q Li, M Loftus, YHE Loh, H Loucks, J Ma, Y Mao, JFI Martinez, P Masterson, RC McCoy, B McGrath, S McKinney, BS Meyer, KH Miga, SK Mohanty, KM Munson, K Pal, M Pennell, PA Pevzner, D Porubsky, T Potapova, FR Ringeling, JL Rocha, OA Ryder, S Sacco, S Saha, T Sasaki, MC Schatz, NJ Schork, C Shanks, L Smeds, DR Son, C Steiner, AP Sweeten, MG Tassia, F Thibaud-Nissen, E Torres-González, M Trivedi, W Wei, J Wertz, M Yang, P Zhang, S Zhang, Y Zhang, Z Zhang, SA Zhao, Y Zhu, ED Jarvis, JL Gerton, I Rivas-González, B Paten, ZA Szpiech, CD Huber, TL Lenz, MK Konkel, SV Yi, S Canzar, CT Watson, PH Sudmant, E Molloy, E Garrison, CB Lowe, M Ventura, RJ O’Neill, S Koren, KD Makova, AM Phillippy, EE Eichler

Nature, 2025

DOI

MyD88-mediated signaling in intestinal fibroblasts regulates macrophage antimicrobial defense and prevents dysbiosis in the gut

Fibroblasts that reside in the gut mucosa are among the key regulators of innate immune cells, but their role in the regulation of the defense functions of macrophages remains unknown. MyD88 is suggested to shape fibroblast responses in the intestinal microenvironment. We found that mice lacking MyD88 in fibroblasts showed a decrease in the colonic antimicrobial defense, developing dysbiosis and aggravated dextran sulfate sodium (DSS)-induced colitis. These pathological changes were associated with the accumulation of Arginase 1+ macrophages with low antimicrobial defense capability. Mechanistically, the production of interleukin (IL)-6 and CCL2 downstream of MyD88 was critically involved in fibroblast-mediated support of macrophage antimicrobial function, and IL-6/CCL2 neutralization resulted in the generation of macrophages with decreased production of the antimicrobial peptide cathelicidin and impaired bacterial clearance. Collectively, these findings revealed a critical role of fibroblast-intrinsic MyD88 signaling in regulating macrophage antimicrobial defense under colonic homeostasis, and its disruption results in dysbiosis, predisposing the host to the development of intestinal inflammation.

M Chulkina, H Tran, G Uribe, SB McAninch, C McAninch, A Seideneck, B He, M Lanza, K Khanipov, G Golovko, DW Powell, ER Davenport, IV Pinchuk

Cell Rep, 2025

DOI

Transcriptome signatures of the medial prefrontal cortex underlying GABAergic control of resilience to chronic stress exposure

Analyses of postmortem human brains and preclinical studies of rodents have identified somatostatin (SST)-positive, dendrite-targeting GABAergic interneurons as key elements that regulate the vulnerability to stress-related psychiatric disorders. Conversely, genetically induced disinhibition of SST neurons (induced by Cre-mediated deletion of the γ2 GABAA receptor subunit gene selectively from SST neurons, SSTCre:γ2f/f mice) results in stress resilience. Similarly, chronic chemogenetic activation of SST neurons in the medial prefrontal cortex (mPFC) results in stress resilience but only in male and not in female mice. Here, we used RNA sequencing of the mPFC of SSTCre:γ2f/f mice to characterize the transcriptome changes underlying GABAergic control of stress resilience. We found that stress resilience of male but not female SSTCre:γ2f/f mice is characterized by resilience to chronic stress-induced transcriptome changes in the mPFC. Interestingly, the transcriptome of non-stressed SSTCre:γ2f/f (stress-resilient) male mice resembled that of chronic stress-exposed SSTCre (stress-vulnerable) mice. However, the behavior and the serum corticosterone levels of non-stressed SSTCre:γ2f/f mice showed no signs of physiological stress. Most strikingly, chronic stress exposure of SSTCre:γ2f/f mice was associated with an almost complete reversal of their chronic stress-like transcriptome signature, along with pathway changes suggesting stress-induced enhancement of mRNA translation. Behaviorally, the SSTCre:γ2f/f mice were not only resilient to chronic stress-induced anhedonia — they also showed an inversed, anxiolytic-like behavioral response to chronic stress exposure that mirrored the chronic stress-induced reversal of the chronic stress-like transcriptome signature. We conclude that GABAergic dendritic inhibition by SST neurons exerts bidirectional control over behavioral vulnerability and resilience to chronic stress exposure that is mirrored in bidirectional changes in the expression of putative stress resilience genes, through a sex-specific brain substrate.

M Shao, J Botvinov, D Banerjee, S Girirajan, B Lüscher

Mol Psychiatry, 2025

DOI

Performance of qpAdm-based screens for genetic admixture on graph–shaped histories and stepping stone landscapes

qpAdm is a statistical tool that is often used for testing large sets of alternative admixture models for a target population. Despite its popularity, qpAdm remains untested on 2D stepping stone landscapes and in situations with low prestudy odds (low ratio of true to false models). We tested high-throughput qpAdm protocols with typical properties such as number of source combinations per target, model complexity, model feasibility criteria, etc. Those protocols were applied to admixture graph–shaped and stepping stone simulated histories sampled randomly or systematically. We demonstrate that false discovery rates of high-throughput qpAdm protocols exceed 50% for many parameter combinations since: (1) prestudy odds are low and fall rapidly with increasing model complexity; (2) complex migration networks violate the assumptions of the method; hence, there is poor correlation between qpAdm P-values and model optimality, contributing to low but nonzero false-positive rate and low power; and (3) although admixture fraction estimates between 0 and 1 are largely restricted to symmetric configurations of sources around a target, a small fraction of asymmetric highly nonoptimal models have estimates in the same interval, contributing to the false-positive rate. We also reinterpret large sets of qpAdm models from 2 studies in terms of source–target distance and symmetry and suggest improvements to qpAdm protocols: (1) temporal stratification of targets and proxy sources in the case of admixture graph–shaped histories, (2) focused exploration of few models for increasing prestudy odds; and (3) dense landscape sampling for increasing power and stringent conditions on estimated admixture fractions for decreasing the false-positive rate.

O Flegontova, U Işıldak, E Yüncü, MP Williams, CD Huber, J Kočí, LA Vyazov, P Changmai, P Flegontov

Genetics, 2025

DOI

Differentiating mechanism from outcome for ancestry-assortative mating in admixed human populations

Population genetic theory, and the empirical methods built upon it, often assumes that individuals pair randomly for reproduction. However, natural populations frequently violate this assumption, which may potentially confound genome-wide association studies, selection scans, and demographic inference. Within several recently admixed human populations, empirical genetic studies have reported a correlation in global ancestry proportion between spouses, referred to as ancestry-assortative mating. Here, we use forward genomic simulations to link correlations in global ancestry proportion between mates to the underlying mechanistic mate choice process. We consider the impacts of 2 types of mate choice model, using either ancestry-based preferences or social groups as the basis for mate pairing. We find that multiple mate choice models can produce the same correlations in global ancestry proportion between spouses; however, we also highlight alternative analytic approaches and circumstances in which these models may be distinguished. With this work, we seek to highlight potential pitfalls when interpreting correlations in empirical data as evidence for a particular model of human mating practices and to offer suggestions toward development of new best practices for analysis of human ancestry-assortative mating.

DJ Massey, ZA Szpiech, A Goldberg

Genetics, 2025

DOI

A multi-omics analysis of effector and resting treg cells in pan-cancer

Regulatory T cells (Tregs) are critical for maintaining the stability of the immune system and facilitating tumor escape through various mechanisms. Resting T cells are involved in cell-mediated immunity and remain in a resting state until stimulated, while effector T cells promote immune responses. Here, we investigated the roles of two gene signatures, one for resting Tregs (FOXP3 and IL2RA) and another for effector Tregs (FOXP3, CTLA-4, CCR8 and TNFRSF9) in pan-cancer. Using data from The Cancer Genome Atlas (TCGA), The Cancer Proteome Atlas (TCPA) and Gene Expression Omnibus (GEO), we focused on the expression profile of the two signatures, the existence of single nucleotide variants (SNVs) and copy number variants (CNVs), methylation, infiltration of immune cells in the tumor and sensitivity to different drugs. Our analysis revealed that both signatures are differentially expressed across different cancer types, and correlate with patient survival. Furthermore, both types of Tregs influence important pathways in cancer development and progression, like apoptosis, epithelial-to-mesenchymal transition (EMT) and the DNA damage pathway. Moreover, a positive correlation was highlighted between the expression of gene markers in both resting and effector Tregs and immune cell infiltration in adrenocortical carcinoma, while mutations in both signatures correlated with enrichment of specific immune cells, mainly in skin melanoma and endometrial cancer. In addition, we reveal the existence of widespread CNVs and hypomethylation affecting both Treg signatures in most cancer types. Last, we identified a few correlations between the expression of CCR8 and TNFRSF9 and sensitivity to several drugs, including COL-3, Chlorambucil and GSK1070916, in pan-cancer. Overall, these findings highlight new evidence that both Treg signatures are crucial regulators of cancer progression, providing potential clinical outcomes for cancer therapy.

AM Chalepaki, M Gkoris, I Chondrou, M Kourti, I Georgakopoulos-Soares, A Zaravinos

Comput Biol Med, 2025

DOI

A data-driven personalized approach to predict blood glucose levels in type-1 diabetes patients exercising in free-living conditions

Objective:The development of new technologies has generated vast amount of data that can be analyzed to better understand and predict the glycemic behavior of people living with type 1 diabetes. This paper aims to assess whether a data-driven approach can accurately and safely predict blood glucose levels in patients with type 1 diabetes exercising in free-living conditions. Methods:Multiple machine learning (XGBoost, Random Forest) and deep learning (LSTM, CNN-LSTM, Dual-encoder with Attention layer) regression models were considered. Each deep-learning model was implemented twice: first, as a personalized model trained solely on the target patient’s data, and second, as a fine-tuned model of a population-based training model. The datasets used for training and testing the models were derived from the Type 1 Diabetes Exercise Initiative (T1DEXI). A total of 79 patients in T1DEXI met our inclusion criteria. Our models used various features related to continuous glucose monitoring, insulin pumps, carbohydrate intake, exercise (intensity and duration), and physical activity-related information (steps and heart rate). This data was available for four weeks for each of the 79 included patients. Three prediction horizons (10, 20, and 30 min) were tested and analyzed. Results:For each patient, there always exists either a machine learning or a deep learning model that conveniently predicts BGLs for up to 30 min. The best performing model differs from one patient to another. When considering the best performing model for each patient, the median and the mean Root Mean Squared Error (RMSE) values (across the 79 patients) for predictions made 10 min ahead were 6.99 mg/dL and 7.46 mg/dL, respectively. For predictions made 30 min ahead, the median and mean RMSE values were 16.85 mg/dL and 17.74 mg/dL, respectively. The majority of the predictions output by the best model of each patient fell within the clinically safe zones A and B of the Clarke Error Grid (CEG), with almost no predictions falling into the unsafe zone E. The most challenging patient to predict 30 min ahead achieved an RMSE value of 32.31 mg/dL (with the corresponding best performing model). The best-predicted patient had an RMSE value of 10.48 mg/dL. Predicting blood glucose levels was more difficult during and after exercise, resulting in higher RMSE values on average. Prediction errors during and after physical activity (two hours and four hours after) generally remained within the clinical safe zones of the CEG with less than 0.5% of predictions falling into the harmful zones D and E, regardless of the exercise category. Conclusions:Data-driven approaches can accurately predict blood glucose levels in type 1 diabetes patients exercising in free-living conditions. The best-performing model varies across patients. Approaches in which a population-based model is initially trained and then fine-tuned for each individual patient generally achieve the best performance for the majority of patients. Some patients remain challenging to predict with no straightforward explanation of why a patient is more challenging to predict than another.

A Neumann, Y Zghal, MA Cremona, A Hajji, M Morin, M Rekik

Comput Biol Med, 2025

DOI

A data-driven personalized approach to predict blood glucose levels in type-1 diabetes patients exercising in free-living conditions

Objective: The development of new technologies has generated vast amount of data that can be analyzed to better understand and predict the glycemic behavior of people living with type 1 diabetes. This paper aims to assess whether a data-driven approach can accurately and safely predict blood glucose levels in patients with type 1 diabetes exercising in free-living conditions. Methods: Multiple machine learning (XGBoost, Random Forest) and deep learning (LSTM, CNN-LSTM, Dual-encoder with Attention layer) regression models were considered. Each deep-learning model was implemented twice: first, as a personalized model trained solely on the target patient’s data, and second, as a fine-tuned model of a population-based training model. The datasets used for training and testing the models were derived from the Type 1 Diabetes Exercise Initiative (T1DEXI). A total of 79 patients in T1DEXI met our inclusion criteria. Our models used various features related to continuous glucose monitoring, insulin pumps, carbohydrate intake, exercise (intensity and duration), and physical activity-related information (steps and heart rate). This data was available for four weeks for each of the 79 included patients. Three prediction horizons (10, 20, and 30 min) were tested and analyzed. Results: For each patient, there always exists either a machine learning or a deep learning model that conveniently predicts BGLs for up to 30 min. The best performing model differs from one patient to another. When considering the best performing model for each patient, the median and the mean Root Mean Squared Error (RMSE) values (across the 79 patients) for predictions made 10 min ahead were 6.99 mg/dL and 7.46 mg/dL, respectively. For predictions made 30 min ahead, the median and mean RMSE values were 16.85 mg/dL and 17.74 mg/dL, respectively. The majority of the predictions output by the best model of each patient fell within the clinically safe zones A and B of the Clarke Error Grid (CEG), with almost no predictions falling into the unsafe zone E. The most challenging patient to predict 30 min ahead achieved an RMSE value of 32.31 mg/dL (with the corresponding best performing model). The best-predicted patient had an RMSE value of 10.48 mg/dL. Predicting blood glucose levels was more difficult during and after exercise, resulting in higher RMSE values on average. Prediction errors during and after physical activity (two hours and four hours after) generally remained within the clinical safe zones of the CEG with less than 0.5% of predictions falling into the harmful zones D and E, regardless of the exercise category. Conclusions: Data-driven approaches can accurately predict blood glucose levels in type 1 diabetes patients exercising in free-living conditions. The best-performing model varies across patients. Approaches in which a population-based model is initially trained and then fine-tuned for each individual patient generally achieve the best performance for the majority of patients. Some patients remain challenging to predict with no straightforward explanation of why a patient is more challenging to predict than another.

A Neumann, Y Zghal, MA Cremona, A Hajji, M Morin, M Rekik

Comput. Biol. Med., 2025

DOI

TRANSCENDENT (Transforming Research by Assessing Neuroinformatics across the Spectrum of Concussion by Embedding iNterdisciplinary Data-collection to Enable Novel Treatments): protocol for a prospective observational cohort study of concussion patients with embedded comparative effectiveness research within a network of learning health system concussion clinics in Canada

Introduction: Concussion affects over 400 000 Canadians annually, with a range of causes and impacts on health-related quality of life. Research to date has disproportionately focused on athletes, military personnel and level I trauma centre patients, and may not be applicable to the broader community. The TRANSCENDENT Concussion Research Program aims to address patient- and clinician-identified research priorities, through the integration of clinical data from patients of all ages and injury mechanisms, patient-reported outcomes and objective biomarkers across factors of intersectionality. Seeking guidance from our Community Advisory Committee will ensure meaningful patient partnership and research findings that are relevant to the wider concussion community. Methods and analysis: This prospective observational cohort study will recruit 5500 participants over 5 years from three 360 Concussion Care clinic locations across Ontario, Canada, with a subset of participants enrolling in specific objective assessments including testing of autonomic function, exercise tolerance, vision, advanced neuroimaging and fluid biomarkers. Analysis will be predicated on pre-specified research questions, and data shared with the Ontario Brain Institute’s Brain-CODE database. This work will represent one of the largest concussion databases to date, and by sharing it, we will advance the field of concussion and prevent siloing within brain health research.

R Zemek, LM Albrecht, S Johnston, J Leddy, AA Ledoux, N Reed, N Silverberg, K Yeates, M Lamoureux, C Anderson, N Barrowman, MH Beauchamp, K Chen, A Chintoh, A Cortel-LeBlanc, M Cortel-LeBlanc, DJ Corwin, S Cowle, K Dalton, J Dawson, A Dodd, KE Emam, C Emery, E Fox, P Fuselli, IJ Gagnon, C Giza, S Hicks, DR Howell, SA Kutcher, C Lalonde, RC Mannix, CL Master, AR Mayer, MH Osmond, R Robillard, KJ Schneider, P Tanuseputro, I Terekhov, R Webster, CL Wellington, TRANSCENDENT Concussion Integrated Discovery Program

BMJ Open, 2025

DOI

MPRAbase a Massively Parallel Reporter Assay database

Massively parallel reporter assays (MPRAs) represent a set of high-throughput technologies that measure the functional effects of thousands of sequences/variants on gene regulatory activity. There are several different variations of MPRA technology and they are used for numerous applications, including regulatory element discovery, variant effect measurement, saturation mutagenesis, synthetic regulatory element generation or characterization of evolutionary gene regulatory differences. Despite their many designs and uses, there is no comprehensive database that incorporates the results of these experiments. To address this, we developed MPRAbase, a manually curated database that currently harbors 130 experiments, encompassing 17,718,677 elements tested across 35 cell types and 4 organisms. The MPRAbase web interface serves as a centralized user-friendly repository to examine online the activity of regulatory elements across cell types and organisms, and to download MPRA data for independent analysis.

J Zhao, FA Baltoumas, MA Konnaris, I Mouratidis, Z Liu, J Sims, V Agarwal, GA Pavlopoulos, I Georgakopoulos-Soares, N Ahituv

Genome Res, 2025

DOI

Non-canonical DNA in human and other ape telomere-to-telomere genomes

Non-canonical (non-B) DNA structures, e.g., bent DNA, hairpins, G-quadruplexes (G4s), Z-DNA, etc., which form at certain sequence motifs (e.g., A-phased repeats, inverted repeats, etc.), have emerged as important regulators of cellular processes and drivers of genome evolution. Yet, they have been understudied due to their repetitive nature and potentially inaccurate sequences generated with short-read technologies. Here we comprehensively characterize such motifs in the long-read telomere-to-telomere (T2T) genomes of human, bonobo, chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. Non-B DNA motifs are enriched at the genomic regions added to T2T assemblies, and occupy 9-15%, 9-11%, and 12-38% of autosomes, and chromosomes X and Y, respectively. G4s and Z-DNA are enriched at promoters and enhancers, as well as at origins of replication. Repetitive sequences harbor more non-B DNA motifs than non-repetitive sequences, especially in the short arms of acrocentric chromosomes. Most centromeres and/or their flanking regions are enriched in at least one non-B DNA motif type, consistent with a potential role of non-B structures in determining centromeres. Our results highlight the uneven distribution of predicted non-B DNA structures across ape genomes and suggest their novel functions in previously inaccessible genomic regions.

L Smeds, K Kamali, I Kejnovská, E Kejnovský, F Chiaromonte, KD Makova

Nucleic Acids Res, 2025

DOI

An atlas of single-cell eQTLs dissects autoimmune disease genes and identifies novel drug classes for treatment

Most variants identified from genome-wide association studies (GWASs) are non-coding and regulate gene expression. However, many risk loci fail to colocalize with expression quantitative trait loci (eQTLs), potentially due to limited GWAS and eQTL analysis power or cellular heterogeneity. Population-scale single-cell RNA-sequencing (scRNA-seq) datasets are emerging, enabling mapping of eQTLs in different cell types (sc-eQTLs). Compared to eQTL data from bulk tissues (bk-eQTLs), sc-eQTL datasets are smaller. We propose a joint model of bk-eQTLs as a weighted sum of sc-eQTLs (JOBS) from constituent cell types to improve power. Applying JOBS to One1K1K and eQTLGen data, we identify 586% more eQTLs, matching the power of 4× the sample sizes of OneK1K. Integrating sc-eQTLs with GWAS data creates an atlas for 14 immune-mediated disorders, colocalizing 29.9% or 32.2% more loci than using sc-eQTL or bk-eQTL alone. Extending JOBS, we develop a drug-repurposing pipeline and identify novel drugs validated by real-world data.

L Wang, H Markus, D Chen, S Chen, F Zhang, S Gao, C Khunsriraksakul, F Chen, N Olsen, G Foulke, Bibo Jiang, L Carrel, DJ Liu

Cell Genom, 2025

DOI

Family Functioning and Pubertal Maturation in Hispanic/Latino Children from the HCHS/SOL Youth

Previous studies have examined the association between family dysfunction and pubertal timing in adolescent girls. However, the evidence is lacking on the role of family dysfunction during sensitive developmental periods in both boys and girls from racial and ethnic minority groups. This study aimed to determine the effect of family dysfunction on the timing of pubertal maturation among US Hispanic/Latino children and adolescents. Participants were 1466 youths (50% female; ages 8-16 years) from the Hispanic Community Children’s Health Study/Study of Latino Youth (SOL Youth). Pubertal maturation was measured using self-administered Pubertal Development Scale (PDS) items for boys and girls. Family dysfunction included measures of single-parent family structure, unhealthy family functioning, low parental closeness, and neglectful parenting style. We used multivariable ordinal logistic and linear regression analyses to examine the associations between family dysfunction and pubertal maturation (individual and cumulative measures), with adjustment for childhood BMI and socioeconomic factors, design effects (strata and clustering), and sample weights. Multivariable models of individual PDS items showed that family dysfunction was negatively associated with growth in height (OR = 0.66, 95% CI: 0.44, 0.99) in girls; no associations were found in boys. In the assessment of cumulative PDS scores, family dysfunction was associated with a lower average pubertal maturation score (b = -0.63, 95% CI: -1.21, -0.05) in boys, while no associations were found in girls. Pubertal timing lies at the intersection of associations between childhood adversity and adult health and warrants further investigation to understand the factors affecting timing and differences across sex and sociocultural background.

AK April-Sanders, P Tehranifar, MB Terry, DM Crookes, CR Isasi, LC Gallo, L Fernandez-Rhodes, KM Perreira, ML Daviglus, SF Suglia

Int J Environ Res Public Health, 2025

DOI

Cystic Fibrosis Newborn Screening: A Systematic Review-Driven Consensus Guideline from the United States Cystic Fibrosis Foundation

Newborn screening for cystic fibrosis (CF) has been universal in the US since 2010; however, there is significant variation among newborn screening algorithms. Systematic reviews were used to develop seven recommendations for newborn screening program practices to improve timeliness, sensitivity, and equity in diagnosing infants with CF: (1) The CF Foundation recommends the use of a floating immunoreactive trypsinogen (IRT) cutoff over a fixed IRT cutoff; (2) The CF Foundation recommends using a very high IRT referral strategy in CF newborn screening programs whose variant panel does not include all CF-causing variants in CFTR2 or does not have a variant panel that achieves at least 95% sensitivity in all ancestral groups within the state; (3) The CF Foundation recommends that CF newborn screening algorithms should not limit CFTR variant detection to the F508del variant or variants included in the American College of Medical Genetics-23 panel; (4) The CF Foundation recommends that CF newborn screening programs screen for all CF-causing CFTR variants in CFTR2; (5) The CF Foundation recommends conducting CFTR variant screening twice weekly or more frequently as resources allow; (6) The CF Foundation recommends the inclusion of a CFTR sequencing tier following IRT and CFTR variant panel testing to improve the specificity and positive predictive value of CF newborn screening; (7) The CF Foundation recommends that both the primary care provider and the CF specialist be notified of abnormal newborn screening results. Through implementation, it is anticipated that these recommendations will result in improved sensitivity, equity, and timeliness of CF newborn screening, leading to improved health outcomes for all individuals diagnosed with CF following newborn screening and a decreased burden on families.

ME McGarry, KS Raraigh, P Farrell, F Shropshire, K Padding, C White, MC Dorley, S Hicks, CL Ren, K Tullis, D Freedenberg, QE Wafford, SE Hempstead, MA Taylor, A Faro, MK Sontag, SA McColley

Int J Neonatal Screen, 2025

DOI

A human iPSC-derived midbrain neural stem cell model of prenatal opioid exposure and withdrawal: A proof of concept study

A growing body of clinical literature has described neurodevelopmental delays in infants with chronic prenatal opioid exposure and withdrawal. Despite this, the mechanism of how opioids impact the developing brain remains unknown. Here, we developed an in vitro model of prenatal morphine exposure and withdrawal using healthy human induced pluripotent stem cell (iPSC)-derived midbrain neural progenitors in monolayer. To optimize our model, we identified that a longer neural induction and regional patterning period increases expression of canonical opioid receptors mu and kappa in midbrain neural progenitors compared to a shorter protocol (OPRM1, two-tailed t-test, p = 0.004; OPRK1, p = 0.0003). Next, we showed that the midbrain neural progenitors derived from a longer iPSC neural induction also have scant toll-like receptor 4 (TLR4) expression, a key player in neonatal opioid withdrawal syndrome pathophysiology. During morphine withdrawal, differentiating neural progenitors experience cyclic adenosine monophosphate overshoot compared to cell exposed to vehicle (p = 0.0496) and morphine exposure conditions (p, = 0.0136, 1-way ANOVA). Finally, we showed that morphine exposure and withdrawal alters proportions of differentiated progenitor cell fates (2-way ANOVA, F = 16.05, p < 0.0001). Chronic morphine exposure increased proportions of nestin positive progenitors (p = 0.0094), and decreased proportions of neuronal nuclear antigen positive neurons (NEUN) (p = 0.0047) compared to those exposed to vehicle. Morphine withdrawal decreased proportions of glial fibrillary acidic protein positive cells of astrocytic lineage (p = 0.044), and increased proportions of NEUN-positive neurons (p < 0.0001) compared to those exposed to morphine only. Applications of this paradigm include mechanistic studies underscoring neural progenitor cell fate commitments in early neurodevelopment during morphine exposure and withdrawal.

R Sullivan, Q Ahrens, SL Mills-Huffnagle, IA Elcheva, SD Hicks

PLoS One, 2025

DOI

All Together Now: Data Work to Advance Privacy, Science, and Health in the Age of Synthetic Data

There is a disconnect between data practices in biomedicine and public understanding of those data practices, and this disconnect is expanding rapidly every day (with the emergence of synthetic data and digital twins and more widely adopted Artificial Intelligence (AI)/Machine Learning tools). Transparency alone is insufficient to bridge this gap. Concurrently, there is an increasingly complex landscape of laws, regulations, and institutional/ programmatic policies to navigate when engaged in biocomputing and digital health research, which makes it increasingly difficult for those wanting to ‘get it right’ or ‘do the right thing.’ Mandatory data protection obligations vary widely, sometimes focused on the type of data (and nuanced definition and scope parameters), the actor/entity involved, or the residency of the data subjects. Additional challenges come from attempts to celebrate biocomputing discoveries and digital health innovations, which frequently transform fair and accurate communications into exaggerated hype (e.g., to secure financial investment in future projects or lead to more favorable tenure and promotion decisions). Trust in scientists and scientific expertise can be quickly eroded if, for example, synthetic data is perceived by the public as ‘fake data’ or if digital twins are perceived as ‘imaginary’ patients. Researchers appear increasingly aware of the scientific and moral imperative to strengthen their work and facilitate its sustainability through increased diversity and community engagement. Moreover, there is a growing appreciation for the ‘data work’ necessary to have scientific data become meaningful, actionable information, knowledge, and wisdom–not only for scientists but also for the individuals from whom those data were derived or to whom those data relate. Equity in the process of biocomputing and equity in the distribution of benefits and burdens of biocomputing both demand ongoing development, implementation, and refinement of embedded Ethical, Legal and Social Implications (ELSI) research practices. This workshop is intended to nurture interdisciplinary discussion of these issues and to highlight the skills and competencies all too often considered ‘soft skills’ peripheral to other skills prioritized in traditional training and professional development programs. Data scientists attending this workshop will become better equipped to embed ELSI practices into their research.

L Fernández-Rhodes, JK Wagner

Biocomput, 2025

DOI

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction R² and the resulting PRS yields the strongest correlation with progression prevalence.

C Wang, H Markus, AR Diwadkar, C Khunsriraksakul, L Carrel, B Li, X Zhong, X Wang, X Zhan, GT Foulke, NJ Olsen, DJ Liu, B Jiang

Nat Commun, 2025

DOI

funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores

Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical shapes or patterns that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.

J Di Iorio, MA Cremona, F Chiaromonte

Stat Comput, 2024

DOI

Assessing Assembly Errors in Immunoglobulin Loci: A Comprehensive Evaluation of Long-read Genome Assemblies Across Vertebrates

Long-read sequencing technologies have revolutionized genome assembly producing near-complete chromosome assemblies for numerous organisms, which are invaluable to research in many fields. However, regions with complex repetitive structure continue to represent a challenge for genome assembly algorithms, particularly in areas with high heterozygosity. Robust and comprehensive solutions for the assessment of assembly accuracy and completeness in these regions do not exist. In this study we focus on the assembly of biomedically important antibody-encoding immunoglobulin (IG) loci, which are characterized by complex duplications and repeat structures. High-quality full-length assemblies for these loci are critical for resolving haplotype-level annotations of IG genes, without which, functional and evolutionary studies of antibody immunity across vertebrates are not tractable. To address these challenges, we developed a pipeline, CloseRead, that generates multiple assembly verification metrics for analysis and visualization. These metrics expand upon those of existing quality assessment tools and specifically target complex and highly heterozygous regions. Using CloseRead, we systematically assessed the accuracy and completeness of IG loci in publicly available assemblies of 74 vertebrate species, identifying problematic regions. We also demonstrated that inspecting assembly graphs for problematic regions can both identify the root cause of assembly errors and illuminate solutions for improving erroneous assemblies. For a subset of species, we were able to correct assembly errors through targeted reassembly. Together, our analysis demonstrated the utility of assembly assessment in improving the completeness and accuracy of IG loci across species.

Y Zhu, C Watson, Y Safonova, M Pennell, A Bankevich

bioRxiv, 2024

DOI

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Large national-level electronic health record (EHR) datasets offer new opportunities for disentangling the role of genes and environment through deep phenotype information and approximate pedigree structures. Here we use the approximate geographical locations of patients as a proxy for spatially correlated community-level environmental risk factors. We develop a spatial mixed linear effect (SMILE) model that incorporates both genetics and environmental contribution. We extract EHR and geographical locations from 257,620 nuclear families and compile 1083 disease outcome measurements from the MarketScan dataset. We augment the EHR with publicly available environmental data, including levels of particulate matter 2.5 (PM2.5), nitrogen dioxide (NO2), climate, and sociodemographic data. We refine the estimates of genetic heritability and quantify community-level environmental contributions. We also use wind speed and direction as instrumental variables to assess the causal effects of air pollution. In total, we find PM2.5 or NO2 have statistically significant causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders, where PM2.5 and NO2 tend to affect biologically distinct disease categories. These analyses showcase several robust strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets and will benefit upcoming biobank studies in the era of precision medicine.

D McGuire, H Markus, L Yang, J Xu, A Montgomery, A Berg, Q Li, L Carrel, DJ Liu, B Jiang

Nat Commun, 2024

DOI

The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes

Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.

KD Makova, BD Pickett, RS Harris, GA Hartley, M Cechova, K Pal, S Nurk, D Yoo, Q Li, P Hebbar, BC McGrath, F Antonacci, M Aubel, A Biddanda, M Borchers, E Bomberg, GG Bouffard, SY Brooks, L Carbone, L Carrel, A Carroll, PC Chang, CS Chin, DE Cook, SJC Craig, L de Gennaro, M Diekhans, A Dutra, GH Garcia, PGS Grady, RE Green, D Haddad, P Hallast, WT Harvey, G Hickey, DA Hillis, SJ Hoyt, H Jeong, K Kamali, SLK Pond, TM LaPolice, C Lee, AP Lewis, YE Loh, P Masterson, RC McCoy, P Medvedev, KH Miga, KM Munson, E Pak, B Paten, BJ Pinto, T Potapova, A Rhie, JL Rocha, F Ryabov, OA Ryder, S Sacco, K Shafin, VA Shepelev, V Slon, SJ Solar, JM Storer, PH Sudmant, Sweetalana, A Sweeten, MG Tassia, F Thibaud-Nissen, M Ventura, MA Wilson, AC Young, H Zeng, X Zhang, ZA Szpiech, CD Huber, JL Gerton, SV Yi, MC Schatz, IA Alexandrov, S Koren, RJ O’Neill, E Eichler, AM Phillippy

Nature, 2024

DOI

Integrating single cell expression quantitative trait loci summary statistics to understand complex trait risk genes

Transcriptome-wide association study (TWAS) is a popular approach to dissect the functional consequence of disease associated non-coding variants. Most existing TWAS use bulk tissues and may not have the resolution to reveal cell-type specific target genes. Single-cell expression quantitative trait loci (sc-eQTL) datasets are emerging. The largest bulk- and sc-eQTL datasets are most conveniently available as summary statistics, but have not been broadly utilized in TWAS. Here, we present a new method EXPRESSO (EXpression PREdiction with Summary Statistics Only), to analyze sc-eQTL summary statistics, which also integrates 3D genomic data and epigenomic annotation to prioritize causal variants. EXPRESSO substantially improves existing methods. We apply EXPRESSO to analyze multi-ancestry GWAS datasets for 14 autoimmune diseases. EXPRESSO uniquely identifies 958 novel gene x trait associations, which is 26% more than the second-best method. Among them, 492 are unique to cell type level analysis and missed by TWAS using whole blood. We also develop a cell type aware drug repurposing pipeline, which leverages EXPRESSO results to identify drug compounds that can reverse disease gene expressions in relevant cell types. Our results point to multiple drugs with therapeutic potentials, including metformin for type 1 diabetes, and vitamin K for ulcerative colitis.

L Wang, C Khunsriraksakul, H Markus, D Chen, F Zhang, F Chen, X Zhan, L Carrel, DJ Liu, B Jiang

Nat Commun, 2024

DOI

Methylation profiles at birth linked to early childhood obesity

Childhood obesity represents a significant global health concern and identifying risk factors is crucial for developing intervention programs. Many ‘omics’ factors associated with the risk of developing obesity have been identified, including genomic, microbiomic, and epigenomic factors. Here, using a sample of 48 infants, we investigated how the methylation profiles in cord blood and placenta at birth were associated with weight outcomes (specifically, conditional weight gain, body mass index, and weight-for-length ratio) at age six months. We characterized genome-wide DNA methylation profiles using the Illumina Infinium MethylationEpic chip, and incorporated information on child and maternal health, and various environmental factors into the analysis. We used regression analysis to identify genes with methylation profiles most predictive of infant weight outcomes, finding a total of 23 relevant genes in cord blood and 10 in placenta. Notably, in cord blood, the methylation profiles of three genes (PLIN4, UBE2F, and PPP1R16B) were associated with all three weight outcomes, which are also associated with weight outcomes in an independent cohort suggesting a strong relationship with weight trajectories in the first six months after birth. Additionally, we developed a Methylation Risk Score (MRS) that could be used to identify children most at risk for developing childhood obesity. While many of the genes identified by our analysis have been associated with weight-related traits (e.g., glucose metabolism, BMI, or hip-to-waist ratio) in previous genome-wide association and variant studies, our analysis implicated several others, whose involvement in the obesity phenotype should be evaluated in future functional investigations.

D Lariviere, SJC Craig, IM Paul, EE Hohman, JS Savage, RO Wright, F Chiaromonte, KD Makova, ML Reimherr

J Dev Orig Health Dis, 2024

DOI

In vivo detection of DNA secondary structures using Permanganate/S1 Footprinting with Direct Adapter Ligation and Sequencing (PDAL-Seq)

DNA secondary structures are essential elements of the genomic landscape, playing a critical role in regulating various cellular processes. These structures refer to G-quadruplexes, cruciforms, Z-DNA or H-DNA structures, amongst others (collectively called ‘non-B DN’), which DNA molecules can adopt beyond the B conformation. DNA secondary structures have significant biological roles, and their landscape is dynamic and can rearrange due to various factors, including changes in cellular conditions, temperature, and DNA-binding proteins. Understanding this dynamic nature is crucial for unraveling their functions in cellular processes. Detecting DNA secondary structures remains a challenge. Conventional methods, such as gel electrophoresis and chemical probing, have limitations in terms of sensitivity and specificity. Emerging techniques, including next-generation sequencing and single-molecule approaches, offer promise but face challenges since these techniques are mostly limited to only one type of secondary structure. Here we describe an updated version of a technique permanganate/S1 nuclease footprinting, which uses potassium permanganate to trap single-stranded DNA regions as found in non-B structures, in combination with S1 nuclease digest and adapter ligation to detect genome-wide non-B formation. To overcome technical hurdles, we combined this method with direct adapter ligation and sequencing (PDAL-Seq). Furthermore, we established a user-friendly pipeline available on Galaxy to standardize PDAL-Seq data analysis. This optimized method allows the analysis of many types of DNA secondary structures that form in a living cell and will advance our knowledge of their roles in health and disease.

A Lahnsteiner, SJC Craig, K Kamali, B Weissensteiner, B McGrath, A Risch, KD Makova

Methods in Enzymology, 2024

DOI

Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this data set resulted in several findings. First, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Second, our results suggest that BPY2 transcripts and proteins originate from separate genomic regions in bonobo versus human, which is possibly facilitated by acquiring new promoters. Third, our analysis indicates that the PRY gene family, having the highest representation of noncoding transcripts, has been undergoing pseudogenization. Fourth, we have not detected signatures of selection in the five YAG families shared among great apes, even though we identified many species-specific protein-coding transcripts. Fifth, we predicted consensus disorder regions across most gene families and species, which could be used for future investigations of male infertility. Overall, our work illuminates the YAG isoform landscape and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

M Tomaszkiewicz, K Sahlin, P Medvedev, KD Makova

Genome Biol Evol, 2023

DOI

The complete sequence of a human Y chromosome

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.

A Rhie, S Nurk, M Cechova, SJ Hoyt, DJ Taylor, N Altemose, PW Hook, S Koren, M Rautiainen, IA Alexandrov, J Allen, M Asri, AV Bzikadze, NC Chen, CS Chin, M Diekhans, P Flicek, G Formenti, A Fungtammasan, CG Giron, E Garrison, A Gershman, JL Gerton, PGS Grady, A Guarracino, L Haggerty, R Halabian, NF Hansen, R Harris, GA Hartley, WT Harvey, M Haukness, J Heinz, T Hourlier, RM Hubley, SE Hunt, S Hwang, M Jain, RK Kesharwani , AP Lewis, H Li, GA Logsdon, JK Lucas, W Makalowski, C Markovic, FJ Martin, AMM Cartney, RC McCoy, J McDaniel, BM McNulty, P Medvedev, A Mikheenko, KM Munson , TD Murphy, H Olsen, ND Olson, LF Paulin, D Porubsky, T Potapova, F Ryabov, SL Salzberg, MEG Sauria, FJ Sedlazeck, K Shafin, VA Shepelev, A Shumate, JM Storer, L Surapaneni, AMT Oill , F Thibaud-Nissen, W Timp, M Tomaszkiewicz, MR Vollger, BP Walenz, AC Watwood, MH Weissensteiner, AM Wenger, MA Wilson, S Zarate, Y Zhu, JM Zook, EE Eichler, RJ O’Neill, MC Schatz, KH Miga, KD Makova, AM Phillippy

Nature, 2023

DOI

Native American genetic ancestry and pigmentation allele contributions to skin color in a Caribbean population

Our interest in the genetic basis of skin color variation between populations led us to seek a Native American population with genetically African admixture but low frequency of European light skin alleles. Analysis of 458 genomes from individuals residing in the Kalinago Territory of the Commonwealth of Dominica showed approximately 55% Native American, 32% African, and 12% European genetic ancestry, the highest Native American genetic ancestry among Caribbean populations to date. Skin pigmentation ranged from 20 to 80 melanin units, averaging 46. Three albino individuals were determined to be homozygous for a causative multi-nucleotide polymorphism OCA2NW273KV contained within a haplotype of African origin; its allele frequency was 0.03 and single allele effect size was –8 melanin units. Derived allele frequencies of SLC24A5A111T and SLC45A2L374F were 0.14 and 0.06, with single allele effect sizes of –6 and –4, respectively. Native American genetic ancestry by itself reduced pigmentation by more than 20 melanin units (range 24–29). The responsible hypopigmenting genetic variants remain to be identified, since none of the published polymorphisms predicted in prior literature to affect skin color in Native Americans caused detectable hypopigmentation in the Kalinago.

KC Ang, VA Canfield, TC Foster, TD Harbaugh, KA Early, RL Harter, KP Reid, SL Leong, Y Kawasawa, DJ Liu, JW Hawley, KC Cheng

elife, 2023

DOI

Accurate sequencing of DNA motifs able to form alternative (non-B) structures

Approximately 13% of the human genome at certain motifs have the potential to form noncanonical (non-B) DNA structures (e.g., G-quadruplexes, cruciforms, and Z-DNA), which regulate many cellular processes but also affect the activity of polymerases and helicases. Because sequencing technologies use these enzymes, they might possess increased errors at non-B structures. To evaluate this, we analyzed error rates, read depth, and base quality of Illumina, Pacific Biosciences (PacBio) HiFi, and Oxford Nanopore Technologies (ONT) sequencing at non-B motifs. All technologies showed altered sequencing success for most non-B motif types, although this could be owing to several factors, including structure formation, biased GC content, and the presence of homopolymers. Single-nucleotide mismatch errors had low biases in HiFi and ONT for all non-B motif types but were increased for G-quadruplexes and Z-DNA in all three technologies. Deletion errors were increased for all non-B types but Z-DNA in Illumina and HiFi, as well as only for G-quadruplexes in ONT. Insertion errors for non-B motifs were highly, moderately, and slightly elevated in Illumina, HiFi, and ONT, respectively. Additionally, we developed a probabilistic approach to determine the number of false positives at non-B motifs depending on sample size and variant frequency, and applied it to publicly available data sets (1000 Genomes, Simons Genome Diversity Project, and gnomAD). We conclude that elevated sequencing errors at non-B DNA motifs should be considered in low-read-depth studies (single-cell, ancient DNA, and pooled-sample population sequencing) and in scoring rare variants. Combining technologies should maximize sequencing accuracy in future studies of non-B DNA.

MH Weissensteiner, MA Cremona, WM Guiblet, N Stoler, RS Harris, M Cechova, KA Eckert, F Chiaromonte, YF Huang, KD Makova

Genome Res, 2023

DOI

Whole-genome sequence and assembly of the Javan gibbon (Hylobates moloch)

The Javan gibbon, Hylobates moloch, is an endangered gibbon species restricted to the forest remnants of western and central Java, Indonesia, and one of the rarest of the Hylobatidae family. Hylobatids consist of 4 genera (Holoock, Hylobates, Symphalangus, and Nomascus) that are characterized by different numbers of chromosomes, ranging from 38 to 52. The underlying cause of this karyotype plasticity is not entirely understood, at least in part, due to the limited availability of genomic data. Here we present the first scaffold-level assembly for H. moloch using a combination of whole-genome Illumina short reads, 10X Chromium linked reads, PacBio, and Oxford Nanopore long reads and proximity-ligation data. This Hylobates genome represents a valuable new resource for comparative genomics studies in primates.

M Escalona, J VanCampen, NW Maurer, M Haukness, M Okhovat, RS Harris, A Watwood, GA Hartley, RJ O’Neill, P Medvedev, KD Makova, C Vollmers, L Carbone, RE Green

J Hered, 2023

DOI

Probabilistic K-means with Local Alignment for Clustering and Motif Discovery in Functional Data

We develop a new method to locally cluster curves and discover functional motifs, that is, typical shapes that may recur several times along and across the curves capturing important local characteristics. In order to identify these shared curve portions, our method leverages ideas from functional data analysis (joint clustering and alignment of curves), bioinformatics (local alignment through the extension of high similarity seeds) and fuzzy clustering (curves belonging to more than one cluster, if they contain more than one typical shape). It can employ various dissimilarity measures and incorporate derivatives in the discovery process, thus exploiting complex facets of shapes. We demonstrate the performance of our method with an extensive simulation study, and show how it generalizes other clustering methods for functional data. Finally, we provide real data applications to Italian Covid-19 death curves and Omics data related to mutagenesis.

MA Cremona, F Chiaromonte

JCGS, 2023

DOI

Multi-ancestry and multi-trait genome-wide association meta-analyses inform clinical risk prediction for systemic lupus erythematosus

Systemic lupus erythematosus is a heritable autoimmune disease that predominantly affects young women. To improve our understanding of genetic etiology, we conduct multi-ancestry and multi-trait meta-analysis of genome-wide association studies, encompassing 12 systemic lupus erythematosus cohorts from 3 different ancestries and 10 genetically correlated autoimmune diseases, and identify 16 novel loci. We also perform transcriptome-wide association studies, computational drug repurposing analysis, and cell type enrichment analysis. We discover putative drug classes, including a histone deacetylase inhibitor that could be repurposed to treat lupus. We also identify multiple cell types enriched with putative target genes, such as non-classical monocytes and B cells, which may be targeted for future therapeutics. Using this newly assembled result, we further construct polygenic risk score models and demonstrate that integrating polygenic risk score with clinical lab biomarkers improves the diagnostic accuracy of systemic lupus erythematosus using the Vanderbilt BioVU and Michigan Genomics Initiative biobanks.

C Khunsriraksakul, Q Li, H Markus, MT Patrick, R Sauteraud, D McGuire, X Wang, C Wang, L Wang, S Chen, G Shenoy, B Li, X Zhong, NJ Olsen, L Carrel, LC Tsoi, B Jiang, DJ Liu

Nat Commun, 2023

DOI

Constructing a polygenic risk score for childhood obesity using functional data analysis

Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain—a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.

SJC Craig, AM Kenney, J Lin, IM Paul, LL Birch, JS Savage, ME Marini, F Chiaromonte, ML Reimherr, KD Makova

Econom Stat, 2023

DOI

Variation in G-quadruplex sequence and topology differentially impacts human DNA polymerase fidelity

G-quadruplexes (G4s), a type of non-B DNA, play important roles in a wide range of molecular processes, including replication, transcription, and translation. Genome integrity relies on efficient and accurate DNA synthesis, and is compromised by various stressors, to which non-B DNA structures such as G4s can be particularly vulnerable. However, the impact of G4 structures on DNA polymerase fidelity is largely unknown. Using an in vitro forward mutation assay, we investigated the fidelity of human DNA polymerases delta (δ4, four-subunit), eta (η), and kappa (κ) during synthesis of G4 motifs representing those in the human genome. The motifs differ in sequence, topology, and stability, features that may affect DNA polymerase errors. Polymerase error rate hierarchy (δ4 < κ < η) is largely maintained during G4 synthesis. Importantly, we observed unique polymerase error signatures during synthesis of VEGF G4 motifs, stable G4s which form parallel topologies. These statistically significant errors occurred within, immediately flanking, and encompassing the G4 motif. For pol δ4, the errors were deletions, insertions and complex errors within the G4 or encompassing the G4 motif and surrounding sequence. For pol η, the errors occurred in 3’ sequences flanking the G4 motif. For pol κ, the errors were frameshift mutations within G-tracts of the G4. Because these error signatures were not observed during synthesis of an antiparallel G4 and, to a lesser extent, a hybrid G4, we suggest that G4 topology and/or stability could influence polymerase fidelity. Using in silico analyses, we show that most polymerase errors are predicted to have minimal effects on predicted G4 stability. Our results provide a unique view of G4s not previously elucidated, showing that G4 motif heterogeneity differentially influences polymerase fidelity within the motif and flanking sequences. Thus, our study advances the understanding of how DNA polymerase errors contribute to G4 mutagenesis.

ME Stein, SE Hile, MH Weissensteiner, M Lee, S Zhang, E Kejnovský, I Kejnovská, KD Makova, KA Eckert

DNA Repair, 2022

DOI

Construction and Application of Polygenic Risk Scores in Autoimmune Diseases

Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with autoimmune diseases and provided unique mechanistic insights and informed novel treatments. These individual genetic variants on their own typically confer a small effect of disease risk with limited predictive power; however, when aggregated (e.g., via polygenic risk score method), they could provide meaningful risk predictions for a myriad of diseases. In this review, we describe the recent advances in GWAS for autoimmune diseases and the practical application of this knowledge to predict an individual’s susceptibility/severity for autoimmune diseases such as systemic lupus erythematosus (SLE) via the polygenic risk score method. We provide an overview of methods for deriving different polygenic risk scores and discuss the strategies to integrate additional information from correlated traits and diverse ancestries. We further advocate for the need to integrate clinical features (e.g., anti-nuclear antibody status) with genetic profiling to better identify patients at high risk of disease susceptibility/severity even before clinical signs or symptoms develop. We conclude by discussing future challenges and opportunities of applying polygenic risk score methods in clinical care.

C Khunsriraksakul, H Markus, NJ Olsen, L Carrel, B Jiang, DJ Liu

Front Immunol, 2022

DOI

Integrating 3D genomic and epigenomic data to enhance target gene discovery and drug repurposing in transcriptome-wide association studies

Transcriptome-wide association studies (TWAS) are popular approaches to test for association between imputed gene expression levels and traits of interest. Here, we propose an integrative method PUMICE (Prediction Using Models Informed by Chromatin conformations and Epigenomics) to integrate 3D genomic and epigenomic data with expression quantitative trait loci (eQTL) to more accurately predict gene expressions. PUMICE helps define and prioritize regions that harbor cis-regulatory variants, which outperforms competing methods. We further describe an extension to our method PUMICE +, which jointly combines TWAS results from single- and multi-tissue models. Across 79 traits, PUMICE + identifies 22% more independent novel genes and increases median chi-square statistics values at known loci by 35% compared to the second-best method, as well as achieves the narrowest credible interval size. Lastly, we perform computational drug repurposing and confirm that PUMICE + outperforms other TWAS methods.

C Khunsriraksakul, D McGuire, R Sauteraud, F Chen, L Yang, L Wang, J Hughey, S Eckert, JD Weissenkampen, G Shenoy, O Marx, L Carrel, B Jiang, DJ Liu

Nat Commun, 2022

DOI

Advanced age increases frequencies of de novo mitochondrial mutations in macaque oocytes and somatic tissues

Mutations in mitochondrial DNA (mtDNA) contribute to multiple diseases. However, how new mtDNA mutations arise and accumulate with age remains understudied because of the high error rates of current sequencing technologies. Duplex sequencing reduces error rates by several orders of magnitude via independently tagging and analyzing each of the two template DNA strands. Here, using duplex sequencing, we obtained high-quality mtDNA sequences for somatic tissues (liver and skeletal muscle) and single oocytes of 30 unrelated rhesus macaques, from 1 to 23 y of age. Sequencing single oocytes minimized effects of natural selection on germline mutations. In total, we identified 17,637 tissue-specific de novo mutations. Their frequency increased ∼3.5-fold in liver and ∼2.8-fold in muscle over the ∼20 y assessed. Mutation frequency in oocytes increased ∼2.5-fold until the age of 9 y, but did not increase after that, suggesting that oocytes of older animals maintain the quality of their mtDNA. We found the light-strand origin of replication (OriL) to be a hotspot for mutation accumulation with aging in liver. Indeed, the 33-nucleotide-long OriL harbored 12 variant hotspots, 10 of which likely disrupt its hairpin structure and affect replication efficiency. Moreover, in somatic tissues, protein-coding variants were subject to positive selection (potentially mitigating toxic effects of mitochondrial activity), the strength of which increased with the number of macaques harboring variants. Our work illuminates the origins and accumulation of somatic and germline mtDNA mutations with aging in primates and has implications for delayed reproduction in modern human societies.

B Arbeithuber, MA Cremona, J Hester, A Barrett, B Higgins, K Anthony, F Chiaromonte, FJ Diaz, KD Makova

P Natl Acad Sci, 2022

DOI

Metabolomic profiling of stool of two-year old children from the INSIGHT study reveals links between butyrate and child weight outcomes

Background: Metabolomic analysis is commonly used to understand the biological underpinning of diseases such as obesity. However, our knowledge of gut metabolites related to weight outcomes in young children is currently limited. Objectives: To (1) explore the relationships between metabolites and child weight outcomes, (2) determine the potential effect of covariates (e.g., child’s diet, maternal health/habits during pregnancy, etc.) in the relationship between metabolites and child weight outcomes, and (3) explore the relationship between selected gut metabolites and gut microbiota abundance. Methods: Using 1 H-NMR, we quantified 30 metabolites from stool samples of 170 two-year-old children. To identify metabolites and covariates associated with children’s weight outcomes (BMI [weight/height2 ], BMI z-score [BMI adjusted for age and sex], and growth index [weight/height]), we analysed the 1 H-NMR data, along with 20 covariates recorded on children and mothers, using LASSO and best subset selection regression techniques. Previously characterized microbiota community information from the same stool samples was used to determine associations between selected gut metabolites and gut microbiota. Results: At age 2 years, stool butyrate concentration had a significant positive association with child BMI (p-value = 3.58 × 10-4 ), BMI z-score (p-value = 3.47 × 10-4 ), and growth index (p-value = 7.73 × 10-4 ). Covariates such as maternal smoking during pregnancy are important to consider. Butyrate concentration was positively associated with the abundance of the bacterial genus Faecalibacterium (p-value = 9.61 × 10-3 ). Conclusions: Stool butyrate concentration is positively associated with increased child weight outcomes and should be investigated further as a factor affecting childhood obesity.

D Nandy, SJC Craig, J Cai, Y Tian, IM Paul, JS Savage, ME Marini, EE Hohman, ML Reimherr, AD Patterson, KD Makova, F Chiaromonte

Pediatr Obes, 2022

DOI

INSIGHT responsive parenting educational intervention for firstborns is associated with growth of second-born siblings

Objective: The aim of this study was to test whether the Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) responsive parenting (RP) intervention, delivered to parents of firstborn children, is associated with the BMI of first- and second-born siblings during infancy.Methods: Participants included 117 firstborn infants enrolled in a randomized controlled trial and their second-born siblings enrolled in an observation-only ancillary study. The RP curriculum for firstborn children included guidance on feeding, sleep, interactive play, and emotion regulation. The control curriculum focused on safety. Anthropometrics were measured in both siblings at ages 3, 16, 28, and 52 weeks. Growth curve models for BMI by child age were fit.Results: Second-born children were delivered 2.5 (SD 0.9) years after firstborns. Firstborn and second-born children whose parents received the RP intervention with their first child had BMI that was 0.44 kg/m2 (95% CI: -0.82 to 0.06) and 0.36 kg/m2 (95% CI: -0.75 to 0.03) lower than controls, respectively. Linear and quadratic growth rates for BMI for firstborn and second-born cohorts were similar, but second-born children had a greater average BMI at 1 year of age (difference = -0.33 [95% CI: -0.52 to -0.15]).Conclusions: A RP educational intervention for obesity prevention delivered to parents of firstborns appears to spill over to second-born siblings.

JS Savage, AK Hochgraf, E Loken, ME Marini, SJC Craig, KD Makova, LL Birch, IM Paul

Obesity, 2022

DOI

Associations between stool micro-transcriptome, gut microbiota, and infant growth

Rapid infant growth increases the risk for adult obesity. The gut microbiome is associated with early weight status; however, no study has examined how interactions between microbial and host ribonucleic acid (RNA) expression influence infant growth. We hypothesized that dynamics in infant stool micro-ribonucleic acids (miRNAs) would be associated with both microbial activity and infant growth via putative metabolic targets. Stool was collected twice from 30 full-term infants, at 1 month and again between 6 and 12 months. Stool RNA were measured with high-throughput sequencing and aligned to human and microbial databases. Infant growth was measured by weight-for-length z-score at birth and 12 months. Increased RNA transcriptional activity of Clostridia (R = 0.55; Adj p = 3.7E-2) and Burkholderia (R = -0.820, Adj p = 2.62E-3) were associated with infant growth. Of the 25 human RNAs associated with growth, 16 were miRNAs. The miRNAs demonstrated significant target enrichment (Adj p < 0.05) for four metabolic pathways. There were four associations between growth-related miRNAs and growth-related phyla. We have shown that longitudinal trends in gut microbiota activity and human miRNA levels are associated with infant growth and the metabolic targets of miRNAs suggest these molecules may regulate the biosynthetic landscape of the gut and influence microbial activity.

MC Carney, X Zhan, A Rangnekar, MZ Chroneos, SJC Craig, KD Makova, IM Paul, SD Hicks

J Dev Orig Health Dis, 2021

DOI

Inferring genes that escape X-Chromosome inactivation reveals important contribution of variable escape genes to sex-biased diseases

The X Chromosome plays an important role in human development and disease. However, functional genomic and disease association studies of X genes greatly lag behind autosomal gene studies, in part owing to the unique biology of X-Chromosome inactivation (XCI). Because of XCI, most genes are only expressed from one allele. Yet, ∼30% of X genes ‘escape’ XCI and are transcribed from both alleles, many only in a proportion of the population. Such interindividual differences are likely to be disease relevant, particularly for sex-biased disorders. To understand the functional biology for X-linked genes, we developed X-Chromosome inactivation for RNA-seq (XCIR), a novel approach to identify escape genes using bulk RNA-seq data. Our method, available as an R package, is more powerful than alternative approaches and is computationally efficient to handle large population-scale data sets. Using annotated XCI states, we examined the contribution of X-linked genes to the disease heritability in the United Kingdom Biobank data set. We show that escape and variable escape genes explain the largest proportion of X heritability, which is in large part attributable to X genes with Y homology. Finally, we investigated the role of each XCI state in sex-biased diseases and found that although XY homologous gene pairs have a larger overall effect size, enrichment for variable escape genes is significantly increased in female-biased diseases. Our results, for the first time, quantitate the importance of variable escape genes for the etiology of sex-biased disease, and our pipeline allows analysis of larger data sets for a broad range of phenotypes.

R Sauteraud, JM Stahl, J James, M Englebright, F Chen, X Zhan, L Carrel, DJ Liu

Genome Res, 2021

DOI

Functional data analysis characterizes the shapes of the first COVID-19 epidemic wave in Italy

We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an “exponential” one unfolding in Lombardia and the worst hit areas of the north, and a milder, “flat(tened)” one in the rest of the country—including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID-19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and workplaces to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data.

T Boschi, J Di Iorio, L Testa, MA Cremona, F Chiaromonte

Sci Rep, 2021

DOI

Selection and thermostability suggest G-quadruplexes are novel functional elements of the human genome

Approximately 1% of the human genome has the ability to fold into G-quadruplexes (G4s)-noncanonical strand-specific DNA structures forming at G-rich motifs. G4s regulate several key cellular processes (e.g., transcription) and have been hypothesized to participate in others (e.g., firing of replication origins). Moreover, G4s differ in their thermostability, and this may affect their function. Yet, G4s may also hinder replication, transcription, and translation and may increase genome instability and mutation rates. Therefore, depending on their genomic location, thermostability, and functionality, G4 loci might evolve under different selective pressures, which has never been investigated. Here we conducted the first genome-wide analysis of G4 distribution, thermostability, and selection. We found an overrepresentation, high thermostability, and purifying selection for G4s within genic components in which they are expected to be functional-promoters, CpG islands, and 5’ and 3’ UTRs. A similar pattern was observed for G4s within replication origins, enhancers, eQTLs, and TAD boundary regions, strongly suggesting their functionality. In contrast, G4s on the nontranscribed strand of exons were underrepresented, were unstable, and evolved neutrally. In general, G4s on the nontranscribed strand of genic components had lower density and were less stable than those on the transcribed strand, suggesting that the former are avoided at the RNA level. Across the genome, purifying selection was stronger at stable G4s. Our results suggest that purifying selection preserves the sequences of functional G4s, whereas nonfunctional G4s are too costly to be tolerated in the genome. Thus, G4s are emerging as fundamental, functional genomic elements.

WM Guiblet, M DeGiorgio, X Cheng, F Chiaromonte, KA Eckert, YF Huang, KD Makova

Genome Res, 2021

DOI

Prothrombotic variants as modifiers of clinical phenotype in four related individuals with haemophilia A

Haemophilia A (HA) is an X-linked bleeding disorder that results from coagulation factor VIII deficiency. Residual factor VIII activity levels (FVIII:C) largely reflect F8 gene mutations and are used to classify HA as severe (<1%), moderate (1–5%) or mild (>5%). However, FVIII:C may differ among individuals carrying the same F8 mutation.1 Furthermore, bleeding phenotypes and FVIII:C can be discordant,2 which poses a particular challenge for mild/moderate individuals.3 Identification of variants in additional genes involved in haemostasis is important for improving classification and treatment guidelines for individuals with HA. Here, we describe four related males carrying F8 mutation c.494C>T (p.Pro146Leu) with moderate FVIII:C levels. However, clinical severity differs: two are mild and two are severe. The aim of this study was to identify gene variants in these individuals that may explain discordant bleeding phenotypes.

L Carrel, S Arnold-Croop, T Achtermann, F Chen, Y Cheng, D Liu, ME Eyster

haemophilia, 2021

DOI

Towards complete and error-free genome assemblies of all vertebrate species

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

A Rhie, SA McCarthy, O Fedrigo, J Damas, G Formenti, S Koren, M Uliano-Silva, W Chow, A Fungtammasan, J Kim, C Lee, BJ Ko, M Chaisson, GL Gedman, LJ Cantin, F Thibaud-Nissen, L Haggerty, I Bista, M Smith, B Haase, J Mountcastle, S Winkler, S Paez, J Howard, SC Vernes, TM Lama, F Grutzner, WC Warren, CN Balakrishnan, D Burt, JM George, MT Biegler, D Iorns, A Digby, D Eason, B Robertson, T Edwards, M Wilkinson, G Turner, Axel Meyer, Andreas F Kautt, P Franchini, HW Detrich 3rd, H Svardal, M Wagner, GJP Naylor, M Pippel, M Malinsky, M Mooney, M Simbirsky, BT Hannigan, T Pesout, M Houck, A Misuraca, SB Kingan, R Hall, Z Kronenberg, I Sović, C Dunn, Z Ning, A Hastie, J Lee, S Selvaraj, RE Green, NH Putnam, I Gut, J Ghurye, E Garrison, Y Sims, J Collins, S Pelan, J Torrance, A Tracey, J Wood, RE Dagnew, D Guan, SE London, DF Clayton, CV Mello, SR Friedrich, PV Lovell, E Osipova, FO Al-Ajli, S Secomandi, H Kim, C Theofanopoulou, M Hiller, Y Zhou, RS Harris, KD Makova, P Medvedev, J Hoffman, P Masterson, K Clark, F Martin, K Howe, P Flicek, BP Walenz, W Kwak, H Clawson, M Diekhans, L Nassar, B Paten, RHS Kraus, AJ Crawford, MTP Gilbert, G Zhang, B Venkatesh, RW Murphy, KP Koepfli, B Shapiro, WE Johnson, F Di Palma, T Marques-Bonet, EC Teeling, T Warnow, JM Graves, OA Ryder, D Haussler, SJ O’Brien, J Korlach, HA Lewin, K Howe, EW Myers, R Durbin, AM Phillippy, ED Jarvis

Nature, 2021

DOI

Model-based assessment of replicability for genome-wide association meta-analysis

Genome-wide association meta-analysis (GWAMA) is an effective approach to enlarge sample sizes and empower the discovery of novel associations between genotype and phenotype. Independent replication has been used as a gold-standard for validating genetic associations. However, as current GWAMA often seeks to aggregate all available datasets, it becomes impossible to find a large enough independent dataset to replicate new discoveries. Here we introduce a method, MAMBA (Meta-Analysis Model-based Assessment of replicability), for assessing the ‘posterior-probability-of-replicability’ for identified associations by leveraging the strength and consistency of association signals between contributing studies. We demonstrate using simulations that MAMBA is more powerful and robust than existing methods, and produces more accurate genetic effects estimates. We apply MAMBA to a large-scale meta-analysis of addiction phenotypes with 1.2 million individuals. In addition to accurately identifying replicable common variant associations, MAMBA also pinpoints novel replicable rare variant associations from imputation-based GWAMA and hence greatly expands the set of analyzable variants.

D McGuire, Y Jiang, M Liu, JD Weissenkampen, S Eckert, L Yang, F Chen, A Berg, S Vrieze, B Jiang, Q Li, DJ Liu

Nat Commun, 2021

DOI

Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome

Approximately 13% of the human genome can fold into non-canonical (non-B) DNA structures (e.g. G-quadruplexes, Z-DNA, etc.), which have been implicated in vital cellular processes. Non-B DNA also hinders replication, increasing errors and facilitating mutagenesis, yet its contribution to genome-wide variation in mutation rates remains unexplored. Here, we conducted a comprehensive analysis of nucleotide substitution frequencies at non-B DNA loci within noncoding, non-repetitive genome regions, their ±2 kb flanking regions, and 1-Megabase windows, using human-orangutan divergence and human single-nucleotide polymorphisms. Functional data analysis at single-base resolution demonstrated that substitution frequencies are usually elevated at non-B DNA, with patterns specific to each non-B DNA type. Mirror, direct and inverted repeats have higher substitution frequencies in spacers than in repeat arms, whereas G-quadruplexes, particularly stable ones, have higher substitution frequencies in loops than in stems. Several non-B DNA types also affect substitution frequencies in their flanking regions. Finally, non-B DNA explains more variation than any other predictor in multiple regression models for diversity or divergence at 1-Megabase scale. Thus, non-B DNA substantially contributes to variation in substitution frequencies at small and large scales. Our results highlight the role of non-B DNA in germline mutagenesis with implications to evolution and genetic diseases.

WM Guiblet, MA Cremona, RS Harris, D Chen, KA Eckert, F Chiaromonte, YF Huang, KD Makova

Nucleic Acids Res, 2021

DOI

Selective synthetic augmentation with HistoGAN for improved histopathology image classification

Histopathological analysis is the present gold standard for precancerous lesion diagnosis. The goal of automated histopathological classification from digital images requires supervised training, which requires a large number of expert annotations that can be expensive and time-consuming to collect. Meanwhile, accurate classification of image patches cropped from whole-slide images is essential for standard sliding window based histopathology slide classification methods. To mitigate these issues, we propose a carefully designed conditional GAN model, namely HistoGAN, for synthesizing realistic histopathology image patches conditioned on class labels. We also investigate a novel synthetic augmentation framework that selectively adds new synthetic image patches generated by our proposed HistoGAN, rather than expanding directly the training set with synthetic images. By selecting synthetic images based on the confidence of their assigned labels and their feature similarity to real labeled images, our framework provides quality assurance to synthetic augmentation. Our models are evaluated on two datasets: a cervical histopathology image dataset with limited annotations, and another dataset of lymph node histopathology images with metastatic cancer. Here, we show that leveraging HistoGAN generated images with selective augmentation results in significant and consistent improvements of classification performance ( and higher accuracy, respectively) for cervical histopathology and metastatic cancer datasets.

Y Xue, J Ye, Q Zhou, LR Long, S Antani, Z Xue, C Cornwell, R Zaino, KC Cheng, X Huang

MED IMAGE ANAL, 2021

DOI

Human L1 Transposition Dynamics Unraveled with Functional Data Analysis

Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features—proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.—in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection—depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.

D Chen, MA Cremona, Z Qi, RD Mitra, F Chiaromonte, KD Makova

Mol Biol Evol, 2020

DOI

Dynamic evolution of great ape Y chromosomes

The mammalian male-specific Y chromosome plays a critical role in sex determination and male fertility. However, because of its repetitive and haploid nature, it is frequently absent from genome assemblies and remains enigmatic. The Y chromosomes of great apes represent a particular puzzle: their gene content is more similar between human and gorilla than between human and chimpanzee, even though human and chimpanzee share a more recent common ancestor. To solve this puzzle, here we constructed a dataset including Ys from all extant great ape genera. We generated assemblies of bonobo and orangutan Ys from short and long sequencing reads and aligned them with the publicly available human, chimpanzee, and gorilla Y assemblies. Analyzing this dataset, we found that the genus Pan, which includes chimpanzee and bonobo, experienced accelerated substitution rates. Pan also exhibited elevated gene death rates. These observations are consistent with high levels of sperm competition in Pan Furthermore, we inferred that the great ape common ancestor already possessed multicopy sequences homologous to most human and chimpanzee palindromes. Nonetheless, each species also acquired distinct ampliconic sequences. We also detected increased chromatin contacts between and within palindromes (from Hi-C data), likely facilitating gene conversion and structural rearrangements. Our results highlight the dynamic mode of Y chromosome evolution and open avenues for studies of male-specific dispersal in endangered great ape species.

M Cechova, R Vegesna, M Tomaszkiewicz, RS Harris, D Chen, S Rangavittal, P Medvedev, KD Makova

Proc Natl Acad Sci, 2020

DOI

Investigation of discordant phenotype in mild Hemophilia A using whole exome sequencing

Keywords: Factor V; Hemophilia A; Prothrombotic gene variants; Rebalanced hemostasis; von Willebrand factor.

PH Cygan, SE Arnold-Croop, EA Weidman, F Chen, DJ Liu , ME Eyster, L Carrel

Thromb Res, 2020

DOI

Age-related accumulation of de novo mitochondrial mutations in mammalian oocytes and somatic tissues

Mutations create genetic variation for other evolutionary forces to operate on and cause numerous genetic diseases. Nevertheless, how de novo mutations arise remains poorly understood. Progress in the area is hindered by the fact that error rates of conventional sequencing technologies (1 in 100 or 1,000 base pairs) are several orders of magnitude higher than de novo mutation rates (1 in 10,000,000 or 100,000,000 base pairs per generation). Moreover, previous analyses of germline de novo mutations examined pedigrees (and not germ cells) and thus were likely affected by selection. Here, we applied highly accurate duplex sequencing to detect low-frequency, de novo mutations in mitochondrial DNA (mtDNA) directly from oocytes and from somatic tissues (brain and muscle) of 36 mice from two independent pedigrees. We found mtDNA mutation frequencies 2- to 3-fold higher in 10-month-old than in 1-month-old mice, demonstrating mutation accumulation during the period of only 9 mo. Mutation frequencies and patterns differed between germline and somatic tissues and among mtDNA regions, suggestive of distinct mutagenesis mechanisms. Additionally, we discovered a more pronounced genetic drift of mitochondrial genetic variants in the germline of older versus younger mice, arguing for mtDNA turnover during oocyte meiotic arrest. Our study deciphered for the first time the intricacies of germline de novo mutagenesis using duplex sequencing directly in oocytes, which provided unprecedented resolution and minimized selection effects present in pedigree studies. Moreover, our work provides important information about the origins and accumulation of mutations with aging/maturation and has implications for delayed reproduction in modern human societies. Furthermore, the duplex sequencing method we optimized for single cells opens avenues for investigating low-frequency mutations in other studies.

B Arbeithuber, J Hester, MA Cremona, N Stoler, A Zaidi, B Higgins, K Anthony, F Chiaromonte, FJ Diaz, KD Makova

PLoS Biol, 2020

DOI

Ampliconic Genes on the Great Ape Y Chromosomes: Rapid Evolution of Copy Number but Conservation of Expression Levels

Multicopy ampliconic gene families on the Y chromosome play an important role in spermatogenesis. Thus, studying their genetic variation in endangered great ape species is critical. We estimated the sizes (copy number) of nine Y ampliconic gene families in population samples of chimpanzee, bonobo, and orangutan with droplet digital polymerase chain reaction, combined these estimates with published data for human and gorilla, and produced genome-wide testis gene expression data for great apes. Analyzing this comprehensive data set within an evolutionary framework, we, first, found high inter- and intraspecific variation in gene family size, with larger families exhibiting higher variation as compared with smaller families, a pattern consistent with random genetic drift. Second, for four gene families, we observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. Third, despite substantial variation in copy number, Y ampliconic gene families’ expression levels did not differ significantly among species, suggesting dosage regulation. Fourth, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression. Our results indicate high variability in size but conservation in gene expression levels in Y ampliconic gene families, significantly advancing our understanding of Y-chromosome evolution in great apes.

R Vegesna, M Tomaszkiewicz, OA Ryder, R Campos-Sánchez, P Medvedev, M DeGiorgio, KD Makova

GENOME BIOL EVOL, 2020

DOI

Pronounced somatic bottleneck in mitochondrial DNA of human hair

Heteroplasmy is the presence of variable mitochondrial DNA (mtDNA) within the same individual. The dynamics of heteroplasmy allele frequency among tissues of the human body is not well understood. Here, we measured allele frequency at heteroplasmic sites in two to eight hairs from each of 11 humans using next-generation sequencing. We observed a high variance in heteroplasmic allele frequency among separate hairs from the same individual—much higher than that for blood and cheek tissues. Our population genetic modelling estimated the somatic bottleneck during embryonic follicle development of separate hairs to be only 11.06 (95% confidence interval 0.6–34.0) mtDNA segregating units. This bottleneck is much more drastic than somatic bottlenecks for blood and cheek tissues (136 and 458 units, respectively), as well as more drastic than, or comparable to, the germline bottleneck (equal to 25–32 or 7–10 units, depending on the study). We demonstrated that hair undergoes additional genetic drift before and after the divergence of mtDNA lineages of individual hair follicles. Additionally, we showed a positive correlation between donor’s age and variance in heteroplasmy allele frequency in hair. These findings have important implications for forensics and for our understanding of mtDNA dynamics in the human body. This article is part of the theme issue ‘Linking the mitochondrial genotype to phenotype: a complex endeavour’.

A Barrett, B Arbeithuber, A Zaidi, P Wilton, IM Paul, R Nielsen, KD Makova

PHILOS T R SOC B, 2020

DOI

On the bias of H-scores for comparing biclusters, and how to correct it

The H-score (or Mean Squared Residue score) underlies Cheng and Church’s (2000) biclustering algorithm, one of the best-known and most widely employed algorithms in bioinformatics and computational biology, and many subsequent algorithms (e.g. FLOC, Yang et al., 2005 and CBEB, Huang et al., 2012). Cheng and Church’s algorithm has ∼2600 citations to date, 650 since 2015 and 230 in 2018–2019 alone. It was the first to be applied to gene microarray data, and it is one of the main tools available in biclustering packages (e.g. the ‘biclust’ R library) and in gene expression data analysis packages (e.g. IRIS-EDA, Monier et al., 2019). In addition, it is widely used as a benchmark: almost all published biclustering algorithms include a comparison with it. Squared residue measures such as H-scores have a double role in biclustering methods. On the one hand, they are employed by many algorithms as merit functions to guide the discovery of biclusters (see e.g. the reviews in Madeira and Oliveira, 2004; Pontes et al., 2015). On the other hand, they are used to assess solutions—in particular, H-scores are used to assess the ‘homogeneity’ of the discovered biclusters. Both uses involve the comparisons of biclusters which may have different numbers of rows and columns. Our findings document a bias that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster—even in the ‘ideal’ (and simplest) case of a single bicluster generated by a constant model plus a white noise. This biases the H-score, and hence all H-score based algorithms, toward small biclusters. Importantly, our analytical proof provides a straightforward way to correct this bias.

J Di Iorio, F Chiaromonte, MA Cremona

BIOINFORMATICS, 2020

DOI

Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes

Author summary The human genome harbors two sex chromosomes—X and Y. Among them, the Y chromosome is present only in males. Deletions of portions of this chromosome have been linked to male infertility, however exactly why the loss of these genes leads to this condition is not well understood. Here we study a group of Y chromosome genes called ampliconic genes, which are expressed in testis and are frequently deleted in males with infertility. These genes are organized in nine gene families, each of which harbors multiple copies of genes highly similar in sequence. In this study, we aimed to establish a baseline of their variation in copy number and in gene expression—one measure of genes’ functional output—by studying 149 healthy men. We found that testis tolerates a wide range of copy number and expression variation of Y ampliconic genes. Additionally, we demonstrated that gene expression within most Y ampliconic gene families depends on the expression levels of gene family members located outside of the Y chromosome, i.e. they undergo dosage regulation.

R Vegesna, M Tomaszkiewicz, P Medvedev, KD Makova

PLOS GENET, 2019

DOI

A High-Resolution View of Adaptive Event Dynamics in a Plasmid

Coadaptation between bacterial hosts and plasmids frequently results in adaptive changes restricted exclusively to host genome leaving plasmids unchanged. To better understand this remarkable stability, we transformed naïve Escherichia coli cells with a plasmid carrying an antibiotic-resistance gene and forced them to adapt in a turbidostat environment. We then drew population samples at regular intervals and subjected them to duplex sequencing—a technique specifically designed for identification of low-frequency mutations. Variants at ten sites implicated in plasmid copy number control emerged almost immediately, tracked consistently across the experiment’s time points, and faded below detectable frequencies toward the end. This variation crash coincided with the emergence of mutations on the host chromosome. Mathematical modeling of trajectories for adaptive changes affecting plasmid copy number showed that such mutations cannot readily fix or even reach appreciable frequencies. We conclude that there is a strong selection against alterations of copy number even if it can provide a degree of growth advantage. This incentive is likely rooted in the complex interplay between mutated and wild-type plasmids constrained within a single cell and underscores the importance of understanding of intracellular plasmid variability.

H Mei, B Arbeithuber, MA Cremona, M DeGiorgio, A Nekrutenko

GENOME BIOL EVOL, 2019

DOI

DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

S Rangavittal, N Stopa, M Tomaszkiewicz, K Sahlin, KD Makova, P Medvedev

BMC GENOMICS, 2019

DOI

High Satellite Repeat Turnover in Great Apes Studied with Short- and Long-Read Technologies

Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.

M Cechova, RS Harris, M Tomaszkiewicz, B Arbeithuber, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2019

DOI

Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA

The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that the mechanism must be designed with respect to a specific base measure over the output space, such as a Gaussian process. We provide a positive result that establishes a Central Limit Theorem for the exponential mechanism quite broadly. We also provide a negative result, showing that the magnitude of noise introduced for privacy is asymptotically non-negligible relative to the statistical estimation error. We develop an ⲉ-DP mechanism for functional principal component analysis, applicable in separable Hilbert spaces, and demonstrate its performance via simulations and applications to two datasets.

J Awan, A Kenney, M Reimherr, A Slavković

PMLR, 2019

DOI

Functional data analysis for computational biology

MA Cremona, H Xu, KD Makova, M Reimherr, F Chiaromonte, P Madrigal

BIOINFORMATICS, 2019

DOI

Bottleneck and selection in the germline and maternal age influence transmission of mitochondrial DNA in human pedigrees

Mitochondria frequently carry different DNA—a state called heteroplasmy. Heteroplasmic mutations can cause mitochondrial diseases and are involved in cancer and aging, but they are also common in healthy people. Here, we study heteroplasmy in 96 multigenerational healthy families. We show that mothers effectively transmit very few mitochondrial DNA to their offspring. Because of this bottleneck, which intensifies with increasing maternal age at childbirth, mutation frequencies can change dramatically between a mother and her child. Thus, a child might inherit a disease-causing mutation at high frequency from an asymptomatic carrier mother and might develop a disease. We also demonstrate that natural selection acts against disease-causing mutations during germline development. Our study has important implications for genetic counseling of mitochondrial diseases.Heteroplasmy—the presence of multiple mitochondrial DNA (mtDNA) haplotypes in an individual—can lead to numerous mitochondrial diseases. The presentation of such diseases depends on the frequency of the heteroplasmic variant in tissues, which, in turn, depends on the dynamics of mtDNA transmissions during germline and somatic development. Thus, understanding and predicting these dynamics between generations and within individuals is medically relevant. Here, we study patterns of heteroplasmy in 2 tissues from each of 345 humans in 96 multigenerational families, each with, at least, 2 siblings (a total of 249 mother–child transmissions). This experimental design has allowed us to estimate the timing of mtDNA mutations, drift, and selection with unprecedented precision. Our results are remarkably concordant between 2 complementary population-genetic approaches. We find evidence for a severe germline bottleneck (7–10 mtDNA segregating units) that occurs independently in different oocyte lineages from the same mother, while somatic bottlenecks are less severe. We demonstrate that divergence between mother and offspring increases with the mother extquoterights age at childbirth, likely due to continued drift of heteroplasmy frequencies in oocytes under meiotic arrest. We show that this period is also accompanied by mutation accumulation leading to more de novo mutations in children born to older mothers. We show that heteroplasmic variants at intermediate frequencies can segregate for many generations in the human population, despite the strong germline bottleneck. We show that selection acts during germline development to keep the frequency of putatively deleterious variants from rising. Our findings have important applications for clinical genetics and genetic counseling.

AA Zaidi, PR Wilton, MS-W Su, IM Paul, B Arbeithuber, K Anthony, A Nekrutenko, R Nielsen, KD Makova

P NATL ACAD SCI USA, 2019

DOI

Correcting palindromes in long reads after whole-genome amplification.

S Warris, E Schijlen, H van de Geest, R Vegesna, T Hesselink, B Te Lintel Hekkert, G Sanchez Perez, P Medvedev, KD Makova, D de Ridder

BMC GENOMICS, 2018

DOI

Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon.

K Sahlin, M Tomaszkiewicz, KD Makova, P Medvedev

NAT COMMUN, 2018

DOI

Child Weight Gain Trajectories Linked To Oral Microbiota Composition.

SJC Craig, D Blankenberg, ACL Parodi, IM Paul, LL Birch, JS Savage, ME Marini, JL Stokes, A Nekrutenko, M Reimherr, F Chiaromonte, KD Makova

SCI REP-UK, 2018

DOI

High Levels of Copy Number Variation of Ampliconic Genes across Major Human Y Haplogroups

Because of its highly repetitive nature, the human male-specific Y chromosome remains understudied. It is important to investigate variation on the Y chromosome to understand its evolution and contribution to phenotypic variation, including infertility. Approximately 20% of the human Y chromosome consists of ampliconic regions which include nine multi-copy gene families. These gene families are expressed exclusively in testes and usually implicated in spermatogenesis. Here, to gain a better understanding of the role of the Y chromosome in human evolution and in determining sexually dimorphic traits, we studied ampliconic gene copy number variation in 100 males representing ten major Y haplogroups world-wide. Copy number was estimated with droplet digital PCR. In contrast to low nucleotide diversity observed on the Y in previous studies, here we show that ampliconic gene copy number diversity is very high. A total of 98 copy-number-based haplotypes were observed among 100 individuals, and haplotypes were sometimes shared by males from very different haplogroups, suggesting homoplasies. The resulting haplotypes did not cluster according to major Y haplogroups. Overall, only two gene families (RBMY and TSPY) showed significant differences in copy number among major Y haplogroups, and the haplogroup of a male could not be predicted based on his ampliconic gene copy numbers. Finally, we did not find significant correlations either between copy number variation and individual’s height, or between the former and facial masculinity/femininity. Our results suggest rapid evolution of ampliconic gene copy numbers on the human Y, and we discuss its causes.

D Ye, AA Zaidi, M Tomaszkiewicz, K Anthony, C Liebowitz, M Degiorgio, MD Shriver, KD Makova

GENOME BIOL EVOL, 2018

DOI

IWTomics: testing high-resolution sequence-based ‘Omics’ data at multiple locations and scales

Summary

With increased generation of high-resolution sequence-based ‘Omics’ data, detecting statistically significant effects at different genomic locations and scales has become key to addressing several scientific questions. IWTomics is an R/Bioconductor package (integrated in Galaxy) that, exploiting sophisticated Functional Data Analysis techniques (i.e. statistical techniques that deal with the analysis of curves), allows users to pre-process, visualize and test these data at multiple locations and scales. The package provides a friendly, flexible and complete workflow that can be employed in many genomic and epigenomic applications.

Availability and implementation

IWTomics is freely available at the Bioconductor website (http://bioconductor.org/packages/IWTomics) and on the main Galaxy instance (https://usegalaxy.org/).

Supplementary information

Supplementary data are available at Bioinformatics online.

MA Cremona, A Pini, F Cumbo, KD Makova, F Chiaromonte, S Vantini

BIOINFORMATICS, 2018

DOI

Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

DNA conformation may deviate from the classical B-form in ∼13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here, we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule Real-Time (SMRT) technology. We show that polymerization speed differs between non-B and B-DNA: It decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates), and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

WM Guiblet, MA Cremona, M Cechova, RS Harris, I Kejnovská, E Kejnovsky, KA Eckert, F Chiaromonte, KD Makova

GENOME RES, 2018

DOI

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly

Motivation

The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.

Results

We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.

Availability and implementation

Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY.

Supplementary information

Supplementary data are available at Bioinformatics online.

S Rangavittal, RS Harris, M Cechova, M Tomaszkiewicz, R Chikhi, KD Makova, P Medvedev

BIOINFORMATICS, 2017

DOI

Metabolism-related microRNAs in maternal breast milk are influenced by premature delivery

Background

Maternal breast milk (MBM) is enriched in microRNAs, factors that regulate protein translation throughout the human body. MBM from mothers of term and preterm infants differs in nutrient, hormone, and bioactive-factor composition, but the microRNA differences between these groups have not been compared. We hypothesized that gestational age at delivery influences microRNA in MBM, particularly microRNAs involved in immunologic and metabolic regulation.

Methods

MBM from mothers of premature infants (pMBM) obtained 3-4 weeks post delivery was compared with MBM from mothers of term infants obtained at birth (tColostrum) and 3-4 weeks post delivery (tMBM). The microRNA profile in lipid and skim fractions of each sample was evaluated with high-throughput sequencing.

Results

The expression profiles of nine microRNAs in lipid and skim pMBM differed from those in tMBM. Gene targets of these microRNAs were functionally related to elemental metabolism and lipid biosynthesis. The microRNA profile of tColostrum was also distinct from that of pMBM, but it clustered closely with tMBM. Twenty-one microRNAs correlated with gestational age demonstrated limited relationships with method of delivery, but not other maternal-infant factors.

Conclusion

Premature delivery results in a unique MBM microRNA profile with metabolic targets. This suggests that preterm milk may have adaptive functions for growth in premature infants.

MC Carney, A Tarasiuk, SL Diangelo, P Silveyra, A Podany, LL Birch, IM Paul, S Kelleher, SD Hicks

PEDIATR RES, 2017

DOI

Y and W Chromosome Assemblies: Approaches and Discoveries

Hundreds of vertebrate genomes have been sequenced and assembled to date. However, most sequencing projects have ignored the sex chromosomes unique to the heterogametic sex – Y and W – that are known as sex-limited chromosomes (SLCs). Indeed, haploid and repetitive Y chromosomes in species with male heterogamety (XY), and W chromosomes in species with female heterogamety (ZW), are difficult to sequence and assemble. Nevertheless, obtaining their sequences is important for understanding the intricacies of vertebrate genome function and evolution. Recent progress has been made towards the adaptation of next-generation sequencing (NGS) techniques to deciphering SLC sequences. We review here currently available methodology and results with regard to SLC sequencing and assembly. We focus on vertebrates, but bring in some examples from other taxa.

M Tomaszkiewicz, P Medvedev, KD Makova

TRENDS GENET, 2017

DOI

Reverse Transcription Errors and RNA-DNA Differences at Short Tandem Repeats

Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.

A Fungtammasan, M Tomaszkiewicz, R Campos-Sánchez, KA Eckert, M Degiorgio, KD Makova

MOL BIOL EVOL, 2016

DOI

Corrigendum: A genome-wide analysis of common fragile sites: What features determine chromosomal instability in the human genome?

A Fungtammasan, E Walsh, F Chiaromonte, KA Eckert, KD Makova

GENOME RES, 2016

DOI

Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis

Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs’ integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.

R Campos-Sánchez, MA Cremona, A Pini, F Chiaromonte, KD Makova

PLOS COMPUT BIOL, 2016

DOI

A time- and cost-effective strategy to sequence mammalian Y chromosomes: An application to the de novo assembly of gorilla Y

The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.

M Tomaszkiewicz, S Rangavittal, M Cechova, R Campos-Sánchez, HW Fescemyer, R Harris, D Ye, PCM O’Brien, R Chikhi, OA Ryder, MA Ferguson-Smith, P Medvedev, KD Makova

GENOME RES, 2016

DOI

Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data

Background

Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3” half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology.

Results

Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies - and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy.

Conclusions

Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.

Y Liu, F Chiaromonte, H Ross, R Malhotra, D Elleder, M Poss

BMC BIOINFORMATICS, 2015

DOI

Improving the power of structural variation detection by augmenting the reference

The uses of the Genome Reference Consortium’s human reference sequence can be roughly categorized into three related but distinct categories: as a representative species genome, as a coordinate systemfor identifying variants, and as an alignment reference for variation detection algorithms. However, the use of this reference sequence as simultaneously a representative species genome and as an alignment reference leads to unnecessary artifacts for structural variation detection algorithms and limits their accuracy.We show how decoupling these two references and developing a separate alignment reference can significantly improve the accuracy of structural variation detection, lead to improved genotyping of disease related genes, and decrease the cost of studying polymorphismin a population.

J Schröder, S Girirajan, AT Papenfuss, P Medvedev

PLOS ONE, 2015

DOI

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCRfree protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

A Fungtammasan, G Ananda, SE Hile, MSW Su, C Sun, R Harris, P Medvedev, K Eckert, KD Makova

GENOME RES, 2015

DOI

Using Statistics to Shed Light on the Dynamics of the Human Genome: A Review

In this article we review a number of recent studies in which information derived from genomic alignments and data concerning composition, location and biochemical features of the nuclear DNA are used to investigate salient properties and determinants of change (mutations) in the human genome. The studies under review, all conducted by an interdisciplinary group of investigators at The Pennsylvania State University, required the use of a range of statistical techniques—from regression, to multivariate analysis, to the modeling of latent structures.

F Chiaromonte, KD Makova

CONTRIB STAT, 2015

DOI

The effects of chromatin organization on variation in mutation rates in the genome

The variation in local rates of mutations can affect both the evolution of genes and their function in normal and cancer cells. Deciphering the molecular determinants of this variation will be aided by the elucidation of distinct types of mutations, as they differ in regional preferences and in associations with genomic features. Chromatin organization contributes to regional variation in mutation rates, but its contribution differs among mutation types. In both germline and somatic mutations, base substitutions are more abundant in regions of closed chromatin, perhaps reflecting error accumulation late in replication. By contrast, a distinctive mutational state with very high levels of insertions and deletions (indels) and substitutions is enriched in regions of open chromatin. These associations indicate an intricate interplay between the nucleotide sequence of DNA and its dynamic packaging into chromatin, and have important implications for current biomedical research. This Review focuses on recent studies showing associations between chromatin state and mutation rates, including pairwise and multivariate investigations of germline and somatic (particularly cancer) mutations.

KD Makova, RC Hardison

NAT REV GENET, 2015

DOI

Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA

The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000x per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the effective size of the germline mtDNA bottleneck at only ∼30-35 (interquartile range from 9 to 141). Accounting for heteroplasmies, we estimated the mtDNA germ-line mutation rate at 1.3 × 10^-8 (interquartile range from 4.2 × 10^-9 to 4.1 × 10^-8) mutations per site per year, an order of magnitude higher than for nuclear DNA. Notably,we found a positive association between the number of heteroplasmies in a child andmaternal age at fertilization, likely attributable to oocyte aging. This study also took advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novomutation. Our results can be used to predict the transmission of disease-causing mtDNA variants and illuminate evolutionary dynamics of the mitochondrial genome.

B Rebolledo-Jaramillo, MSW Su, N Stoler, JA McElhoe, B Dickins, D Blankenberg, TS Korneliussen, F Chiaromonte, R Nielsen, MM Holland, IM Paul, A Nekrutenko, KD Makova

P NATL ACAD SCI USA, 2014

DOI

The Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) study

Background

Because early life growth has long-lasting metabolic and behavioral consequences, intervention during this period of developmental plasticity may alter long-term obesity risk. While modifiable factors during infancy have been identified, until recently, preventive interventions had not been tested. The Intervention Nurses Starting Infants Growing on Healthy Trajectories (INSIGHT). Study is a longitudinal, randomized, controlled trial evaluating a responsive parenting intervention designed for the primary prevention of obesity. This “parenting” intervention is being compared with a home safety control among first-born infants and their parents. INSIGHT’s central hypothesis is that responsive parenting and specifically responsive feeding promotes self-regulation and shared parent-child responsibility for feeding, reducing subsequent risk for overeating and overweight.

Methods/Design

316 first-time mothers and their full-term newborns were enrolled from one maternity ward. Two weeks following delivery, dyads were randomly assigned to the “parenting” or “safety” groups. Subsequently, research nurses conduct study visits for both groups consisting of home visits at infant age 3-4, 16, 28, and 40 weeks, followed by annual clinic-based visits at 1, 2, and 3 years. Both groups receive intervention components framed around four behavior states: Sleeping, Fussy, Alert and Calm, and Drowsy. The main study outcome is BMI z-score at age 3 years; additional outcomes include those related to patterns of infant weight gain, infant sleep hygiene and duration, maternal responsiveness and soothing strategies for infant/toddler distress and fussiness, maternal feeding style and infant dietary content and physical activity. Maternal outcomes related to weight status, diet, mental health, and parenting sense of competence are being collected. Infant temperament will be explored as a moderator of parenting effects, and blood is collected to obtain genetic predictors of weight status. Finally, second-born siblings of INSIGHT participants will be enrolled in an observation-only study to explore parenting differences between siblings, their effect on weight outcomes, and carryover effects of INSIGHT interventions to subsequent siblings.

Discussion

With increasing evidence suggesting the importance of early life experiences on long-term health trajectories, the INSIGHT trial has the ability to inform future obesity prevention efforts in clinical settings.

Trial registration

NCT01167270. Registered 21 July 2010.

IM Paul, JS Williams, S Anzman-Frasca, JS Beiler, KD Makova, ME Marini, LB Hess, SE Rzucidlo, N Verdiglione, JA Mindell, LL Birch

BMC PEDIATR, 2014

DOI

Microsatellite Interruptions Stabilize Primate Genomes and Exist as Population-Specific Single Nucleotide Polymorphisms within Individual Human Genomes

Interruptions of microsatellite sequences impact genome evolution and can alter disease manifestation. However, human polymorphism levels at interrupted microsatellites (iMSs) are not known at a genome-wide scale, and the pathways for gaining interruptions are poorly understood. Using the 1000 Genomes Phase-1 variant call set, we interrogated mono-, di-, tri-, and tetranucleotide repeats up to 10 units in length. We detected ~26,000-40,000 iMSs within each of four human population groups (African, European, East Asian, and American). We identified population-specific iMSs within exonic regions, and discovered that known disease-associated iMSs contain alleles present at differing frequencies among the populations. By analyzing longer microsatellites in primate genomes, we demonstrate that single interruptions result in a genome-wide average two- to six-fold reduction in microsatellite mutability, as compared with perfect microsatellites. Centrally located interruptions lowered mutability dramatically, by two to three orders of magnitude. Using a biochemical approach, we tested directly whether the mutability of a specific iMS is lower because of decreased DNA polymerase strand slippage errors. Modeling the adenomatous polyposis coli tumor suppressor gene sequence, we observed that a single base substitution interruption reduced strand slippage error rates five- to 50-fold, relative to a perfect repeat, during synthesis by DNA polymerases α, β, or η. Computationally, we demonstrate that iMSs arise primarily by base substitution mutations within individual human genomes. Our biochemical survey of human DNA polymerase α, β, δ, κ, and η error rates within certain microsatellites suggests that interruptions are created most frequently by low fidelity polymerases. Our combined computational and biochemical results demonstrate that iMSs are abundant in human genomes and are sources of population-specific genetic variation that may affect genome stability. The genome-wide identification of iMSs in human populations presented here has important implications for current models describing the impact of microsatellite polymorphisms on gene expression.

G Ananda, SE Hile, A Breski, Y Wang, Y Kelkar, KD Makova, KA Eckert

PLOS GENET, 2014

DOI

Genomic landscape of human, bat, and ex vivo DNA transposon integrations

The integration and fixation preferences of DNA transposons, one of the major classes of eukaryotic transposable elements, have never been evaluated comprehensively on a genome-wide scale. Here, we present a detailed study of the distribution of DNA transposons in the human and bat genomes. We studied three groups of DNA transposons that integrated at different evolutionary times: 1) ancient (>40 My) and currently inactive human elements, 2) younger (<40 My) bat elements, and 3) ex vivo integrations of piggyBat and Sleeping Beauty elements in HeLa cells. Although the distribution of ex vivo elements reflected integration preferences, the distribution of human and (to a lesser extent) bat elements was also affected by selection. We used regression techniques (linear, negative binomial, and logistic regression models with multiple predictors) applied to 20-kb and 1-Mb windows to investigate how the genomic landscape in the vicinity of DNA transposons contributes to their integration and fixation. Our models indicate that genomic landscape explains 16-79% of variability in DNA transposon genome-wide distribution. Importantly, we not only confirmed previously identified predictors (e.g., DNA conformation and recombination hotspots) but also identified several novel predictors (e.g., signatures of double-strand breaks and telomere hexamer). Ex vivo integrations showed a bias toward actively transcribed regions. Older DNA transposons were located in genomic regions scarce in most conserved elements - likely reflecting purifying selection. Our study highlights how DNA transposons are integral to the evolution of bat and human genomes, and has implications for the development of DNA transposon assays for gene therapy and mutagenesis applications.

R Campos-Sánchez, A Kapusta, C Feschotte, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2014

DOI

Development and assessment of an optimized next-generation DNA sequencing approach for the mtgenome using the Illumina MiSeq

The development of molecular tools to detect and report mitochondrial DNA (mtDNA) heteroplasmy will increase the discrimination potential of the testing method when applied to forensic cases. The inherent limitations of the current state-of-the-art, Sanger-based sequencing, including constrictions in speed, throughput, and resolution, have hindered progress in this area. With the advent of next-generation sequencing (NGS) approaches, it is now possible to clearly identify heteroplasmic variants, and at a much lower level than previously possible. However, in order to bring these approaches into forensic laboratories and subsequently as accepted scientific information in a court of law, validated methods will be required to produce and analyze NGS data. We report here on the development of an optimized approach to NGS analysis for the mtDNA genome (mtgenome) using the Illumina MiSeq instrument. This optimized protocol allows for the production of more than 5 gigabases of mtDNA sequence per run, sufficient for detection and reliable reporting of minor heteroplasmic variants down to approximately 0.5-1.0% when multiplexing twelve samples. Depending on sample throughput needs, sequence coverage rates can be set at various levels, but were optimized here for at least 5000 reads. In addition, analysis parameters are provided for a commercially available software package that identify the highest quality sequencing reads and effectively filter out sequencing-based noise. With this method it will be possible to measure the rates of low-level heteroplasmy across the mtgenome, evaluate the transmission of heteroplasmy between the generations of maternal lineages, and assess the drift of variant sequences between different tissue types within an individual.

JA McElhoe, MM Holland, KD Makova, MSW Su, IM Paul, CH Baker, SA Faith, B Young

FORENSIC SCI INT-GEN, 2014

DOI

Controlling for contamination in re-sequencing studies with a reproducible web-based phylogenetic approach

Polymorphism discovery is a routine application of next-generation sequencing technology where multiple samples are sent to a service provider for library preparation, subsequent sequencing, and bioinformatic analyses. The decreasing cost and advances in multiplexing approaches have made it possible to analyze hundreds of samples at a reasonable cost. However, because of the manual steps involved in the initial processing of samples and handling of sequencing equipment, cross-contamination remains a significant challenge. It is especially problematic in cases where polymorphism frequencies do not adhere to diploid expectation, for example, heterogeneous tumor samples, organellar genomes, as well as during bacterial and viral sequencing. In these instances, low levels of contamination may be readily mistaken for polymorphisms, leading to false results. Here we describe practical steps designed to reliably detect contamination and uncover its origin, and also provide new, Galaxy-based, readily accessible computational tools and workflows for quality control. All results described in this report can be reproduced interactively on the web as described at http://usegalaxy.org/contamination.

B Dickins, B Rebolledo-Jaramillo, MSW Su, IM Paul, D Blankenberg, N Stoler, KD Makova, A Nekrutenko

BIOTECHNIQUES, 2014

DOI

Segmenting the human genome based on states of neutral genetic divergence

Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human-orangutan alignments of neutrally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states - each uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine- and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants-including those associated with cancer and other diseases-and to improve computational predictions of noncoding functional elements.

P Kuruppumullage Don, G Ananda, F Chiaromonte, KD Makova

P NATL ACAD SCI USA, 2013

DOI

Mature microsatellites: Mechanisms underlying dinucleotide microsatellite mutational biases in human cells

Dinucleotide microsatellites are dynamic DNA sequences that affect genome stability. Here, we focused on mature microsatellites, defined as pure repeats of lengths above the threshold and unlikely to mutate below it in a single mutational event. We investigated the prevalence and mutational behavior of these sequences by using human genome sequence data, human cells in culture, and purified DNA polymerases. Mature dinucleotides (≥10 units) are present within exonic sequences of >350 genes, resulting in vulnerability to cellular genetic integrity. Mature dinucleotide mutagenesis was examined experimentally using ex vivo and in vitro approaches. We observe an expansion bias for dinucleotide microsatellites up to 20 units in length in somatic human cells, in agreement withprevious computational analyses of germline biases. Using purified DNA polymerases and human cell lines deficient for mismatch repair (MMR), we show that the expansion bias is caused by functional MMR and is not due to DNA polymerase error biases. Specifically, we observe that the MutSα and MutLα complexes protect against expansion mutations. Our data support a model wherein different MMR complexes shift the balance of mutations toward deletionor expansion. Finally, we show that replication fork progression is stalled within long dinucleotides, suggesting that mutational mechanisms within long repeats may be distinct from shorter lengths, depending on the biochemistry of fork resolution. Our work combines computational and experimental approaches to explain the complex mutational behavior of dinucleotide microsatellites in humans.

BA Baptiste, G Ananda, N Strubczewski, A Lutzkanin, SJ Khoo, A Srikanth, N Kim, KD Makova, MM Krasilnikova, KA Eckert

G3-GENES GENOM GENET, 2013

DOI

Distinct mutational behaviors differentiate short tandem repeats from micro satellites in the human genome

A tandem repeat’s (TR) propensity to mutate increases with repeat number, and can become very pronounced beyond a critical boundary, transforming it into a microsatellite (MS). However, a clear understanding of the mutational behavior of different TR classes and motifs and related mechanisms is lacking, as is a consensus on the existence of a boundary separating short TRs (STRs) from MSs. This hinders our understanding of MSs’ mutational properties and their effective use as genetic markers. Using indel calls for 179 individuals from 1000 Genomes Pilot-1 Project, we determined polymorphism incidence for four major TR classes, and formalized its varying relationship with repeat number using segmented regression. We observed a biphasic regime with a transition from a faster to a slower exponential growth at 9, 5, 4, and 4 repeats for mono-, di-, tri-, and tetranucleotide TRs, respectively. We used an in vitro mutagenesis assay to evaluate the contribution of strand slippage errors to mutability. STRs and MSs differ in their absolute polymorphism levels, but more importantly in their rates of mutability growth. Although strand slippage is a major factor driving mononucleotide polymorphism incidence, dinucleotide polymorphism incidence is greater than that expected due to strand slippage alone, indicating that additional cellular factors might be driving dinucleotide mutability in the human genome. Leveraging on hundreds of human genomes, we present the first comprehensive, genome-wide analysis of TR mutational behavior, encompassing several motif sizes and compositions.

G Ananda, E Walsh, KD Jacob, M Krasilnikova, KA Eckert, F Chiaromonte, KD Makova

GENOME BIOL EVOL, 2013

DOI

Rescuing Alu: Recovery of New Inserts Shows LINE-1 Preserves Alu Activity through A-Tail Expansion

Alu elements are trans-mobilized by the autonomous non-LTR retroelement, LINE-1 (L1). Alu-induced insertion mutagenesis contributes to about 0.1% human genetic disease and is responsible for the majority of the documented instances of human retroelement insertion-induced disease. Here we introduce a SINE recovery method that provides a complementary approach for comprehensive analysis of the impact and biological mechanisms of Alu retrotransposition. Using this approach, we recovered 226 de novo tagged Alu inserts in HeLa cells. Our analysis reveals that in human cells marked Alu inserts driven by either exogenously supplied full length L1 or ORF2 protein are indistinguishable. Four percent of de novo Alu inserts were associated with genomic deletions and rearrangements and lacked the hallmarks of retrotransposition. In contrast to L1 inserts, 5′ truncations of Alu inserts are rare, as most of the recovered inserts (96.5%) are full length. De novo Alus show a random pattern of insertion across chromosomes, but further characterization revealed an Alu insertion bias exists favoring insertion near other SINEs, highly conserved elements, with almost 60% landing within genes. De novo Alu inserts show no evidence of RNA editing. Priming for reverse transcription rarely occurred within the first 20 bp (most 5′) of the A-tail. The A-tails of recovered inserts show significant expansion, with many at least doubling in length. Sequence manipulation of the construct led to the demonstration that the A-tail expansion likely occurs during insertion due to slippage by the L1 ORF2 protein. We postulate that the A-tail expansion directly impacts Alu evolution by reintroducing new active source elements to counteract the natural loss of active Alus and minimizing Alu extinction.

BJ Wagstaff, DJ Hedges, RS Derbes, R Campos-Sánchez, F Chiaromonte, KD Makova, AM Roy-Engel

PLOS GENET, 2012

DOI

A genome-wide analysis of common fragile sites: What features determine chromosomal instability in the human genome?

Chromosomal common fragile sites (CFSs) are unstable genomic regions that break under replication stress and are involved in structural variation. They frequently are sites of chromosomal rearrangements in cancer and of viral integration. However, CFSs are undercharacterized at the molecular level and thus difficult to predict computationally. Newly available genome-wide profiling studies provide us with an unprecedented opportunity to associate CFSs with features of their local genomic contexts. Here, we contrasted the genomic landscape of cytogenetically defined aphidicolin-induced CFSs (aCFSs) to that of nonfragile sites, using multiple logistic regression. We also analyzed aCFS breakage frequencies as a function of their genomic landscape, using standard multiple regression. We show that local genomic features are effective predictors both of regions harboring aCFSs (explaining ∼77% of the deviance in logistic regression models) and of aCFS breakage frequencies (explaining ∼45% of the variance in standard regression models). In our optimal models (having highest explanatory power), aCFSs are predominantly located in G-negative chromosomal bands and away from centromeres, are enriched in Alu repeats, and have high DNA flexibility. In alternative models, CpG island density, transcription start site density, H3K4me1 coverage, and mononucleotide microsatellite coverage are significant predictors. Also, aCFSs have high fragility when colocated with evolutionarily conserved chromosomal breakpoints. Our models are predictive of the fragility of aCFSs mapped at a higher resolution. Importantly, the genomic features we identified here as significant predictors of fragility allow us to draw valuable inferences on the molecular mechanisms underlying aCFSs.

A Fungtammasan, E Walsh, F Chiaromonte, KA Eckert, KD Makova

GENOME RES, 2012

DOI

A matter of life or death: How microsatellites emerge in and vanish from the human genome

Microsatellites-tandem repeats of short DNA motifs-are abundant in the human genome and have high mutation rates. While microsatellite instability is implicated in numerous genetic diseases, the molecular processes involved in their emergence and disappearance are still not well understood. Microsatellites are hypothesized to follow a life cycle, wherein they are born and expand into adulthood, until their degradation and death. Here we identified microsatellite births/deaths in human, chimpanzee, and orangutan genomes, using macaque and marmoset as outgroups.We inferred mutations causing births/deaths based on parsimony, and investigated local genomic environments affecting them. We also studied birth/death patterns within transposable elements (Alus and L1s), coding regions, and disease-associated loci. We observed that substitutions were the predominant cause for births of short microsatellites, while insertions and deletions were important for births of longermicrosatellites. Substitutions were the cause for deaths ofmicrosatellites of virtually all lengths. AT-rich L1 sequences exhibited elevated frequency of births/deaths over their entire length, while GC-rich Alus only in their 3′ poly(A) tails and middle A-stretches, with differences depending on transposable element integration timing. Births/deaths were strongly selected against in coding regions. Births/deaths occurred in genomic regions with high substitution rates, protomicrosatellite content, and L1 density, but low GC content and Alu density. The majority of the 17 disease-associated microsatellites examined are evolutionarily ancient (were acquired by the common ancestor of simians). Our genome-wide investigation of microsatellite life cycle has fundamental applications for predicting the susceptibility of birth/death of microsatellites, including many disease-causing loci.

YD Kelkar, KA Eckert, F Chiaromonte, KD Makova

GENOME RES, 2011

DOI

Harnessing cloud computing with Galaxy Cloud

E Afgan, D Baker, N Coraor, H Goto, IM Paul, KD Makova, A Nekrutenko, J Taylor

NAT BIOTECHNOL, 2011

DOI

Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study

Background

Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission.

Results

Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, we devised a set of criteria for detecting polymorphic sites in heterogeneous genetic samples that is resistant to the noise originating from massively parallel sequencing technologies. Application of these criteria to nine human mtDNA samples revealed four heteroplasmic sites.

Conclusions

Our results suggest that the incidence of heteroplasmy may be lower than estimated in some other recent re-sequencing studies, and that mtDNA allelic frequencies differ significantly both between tissues of the same individual and between a mother and her offspring. We designed our study in such a way that the complete analysis described here can be repeated by anyone either at our site or directly on the Amazon Cloud. Our computational pipeline can be easily modified to accommodate other applications, such as viral re-sequencing.

H Goto, B Dickins, E Afgan, IM Paul, J Taylor, KD Makova, A Nekrutenko

GENOME BIOL, 2011

DOI

A genome-wide view of mutation rate co-variation using multivariate analyses

Background

While the abundance of available sequenced genomes has led to many studies of regional heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the associations with multiple genomic features at different genomic scales and phylogenetic distances.

Results

We observe a consistent, largely linear co-variation among rates of nucleotide substitutions, small insertions and small deletions, with some non-linear associations detected among these rates on chromosome X and near autosomal telomeres. This co-variation appears to be shaped by a common set of genomic features, some previously investigated and some novel to this study (nuclear lamina binding sites, methylated non-CpG sites and nucleosome-free regions). Strong non-linear relationships are also detected among genomic features near the centromeres of large chromosomes. Microsatellite mutability co-varies with other mutation rates at finer scales, but not at 1 Mb, and shows varying degrees of association with genomic features at different scales.

Conclusions

Our results allow us to speculate about the role of different molecular mechanisms, such as replication, recombination, repair and local chromatin environment, in mutagenesis. The software tools developed for our analyses are available through Galaxy, an open-source genomics portal, to facilitate the use of multivariate techniques in future large-scale genomics studies.

G Ananda, F Chiaromonte, KD Makova

GENOME BIOL, 2011

DOI

Exploratory spatial analysis of in vitro respiratory syncytial virus co-infections

The cell response to virus infection and virus perturbation of that response is dynamic and is reflected by changes in cell susceptibility to infection. In this study, we evaluated the response of human epithelial cells to sequential infections with human respiratory syncytial virus strains A2 and B to determine if a primary infection with one strain will impact the ability of cells to be infected with the second as a function of virus strain and time elapsed between the two exposures. Infected cells were visualized with fluorescent markers, and location of all cells in the tissue culture well were identified using imaging software. We employed tools from spatial statistics to investigate the likelihood of a cell being infected given its proximity to a cell infected with either the homologous or heterologous virus. We used point processes, K-functions, and simulation procedures designed to account for specific features of our data when assessing spatial associations. Our results suggest that intrinsic cell properties increase susceptibility of cells to infection, more so for RSV-B than for RSV-A. Further, we provide evidence that the primary infection can decrease susceptibility of cells to the heterologous challenge virus but only at the 16 h time point evaluated in this study. Our research effort highlights the merits of integrating empirical and statistical approaches to gain greater insight on in vitro dynamics of virus-host interactions.

I Simeonov, X Gong, O Kim, M Poss, F Chiaromonte, J Fricks

VIRUSES, 2010

DOI

Strong purifying selection at genes escaping X chromosome inactivation

To achieve dosage balance of X-linked genes between mammalian males and females, one female X chromosome becomes inactivated. However, approximately 15% of genes on this inactivated chromosome escape X chromosome inactivation (XCI). Here, using a chromosome-wide analysis of primate X-linked orthologs, we test a hypothesis that such genes evolve under a unique selective pressure. We find that escape genes are subject to stronger purifying selection than inactivated genes and that positive selection does not significantly affect the evolution of these genes. The strength of selection does not differ between escape genes with similar versus different expression levels in males versus females. Intriguingly, escape genes possessing Y homologs evolve under the strongest purifying selection. We also found evidence of stronger conservation in gene expression levels in escape than inactivated genes. We hypothesize that divergence in function and expression between X and Y gametologs is driving such strong purifying selection for escape genes.

C Park, L Carrel, KD Makova

MOL BIOL EVOL, 2010

DOI

Complete Khoisan and Bantu genomes from southern Africa

The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial and small sets of nuclear markers have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans. However, until now, fully sequenced human genomes have been limited to recently diverged populations. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.

SC Schuster, W Miller, A Ratan, LP Tomsho, B Giardine, LR Kasson, RS Harris, DC Petersen, F Zhao, J Qi, C Alkan, JM Kidd, Y Sun, DI Drautz, P Bouffard, DM Muzny, JG Reid, LV Nazareth, Q Wang, R Burhans, C Riemer, NE Wittekindt, P Moorjani, EA Tindall, CG Danko, WS Teo, AM Buboltz, Z Zhang, Q Ma, A Oosthuysen, AW Steenkamp, H Oostuisen, P Venter, J Gajewski, Y Zhang, BF Pugh, KD Makova, A Nekrutenko, ER Mardis, N Patterson, TH Pringle, F Chiaromonte, JC Mullikin, EE Eichler, RC Hardison, RA Gibbs, TT Harkins, VM Hayes

NATURE, 2010

DOI

What is a microsatellite: A computational and experimental definition based upon repeat mutational behavior at A/T and GT/AC repeats

Microsatellites are abundant in eukaryotic genomes and have high rates of strand slippage-induced repeat number alterations. They are popular genetic markers, and their mutations are associated with numerous neurological diseases. However, the minimal number of repeats required to constitute a microsatellite has been debated, and a definition of a microsatellite that considers its mutational behavior has been lacking. To define a microsatellite, we investigated slippage dynamics for a range of repeat sizes, utilizing two approaches. Computationally, we assessed length polymorphism at repeat loci in ten ENCODE regions resequenced in four human populations, assuming that the occurrence of polymorphism reflects strand slippage rates. Experimentally, we determined the in vitro DNA polymerase-mediated strand slippage error rates as a function of repeat number. In both approaches, we compared strand slippage rates at tandem repeats with the background slippage rates. We observed two distinct modes of mutational behavior. At small repeat numbers, slippage rates were low and indistinguishable from background measurements. A marked transition in mutability was observed as the repeat array lengthened, such that slippage rates at large repeat numbers were significantly higher than the background rates. For both mononucleotide and dinucleotide microsatellites studied, the transition length corresponded to a similar number of nucleotides (approximately 10). Thus, microsatellite threshold is determined not by the presence/absence of strand slippage at repeats but by an abrupt alteration in slippage rates relative to background. These findings have implications for understanding microsatellite mutagenesis, standardization of genome-wide microsatellite analyses, and predicting polymorphism levels of individual microsatellite loci.

YD Kelkar, N Strubczewski, SE Hile, F Chiaromonte, KA Eckert, KD Makova

GENOME BIOL EVOL, 2010

DOI

Multivariate statistical analyses demonstrate unique host immune responses to single and dual lentiviral infection

Background

Feline immunodeficiency virus (FIV) and human immunodeficiency virus (HIV) are recently identified lentiviruses that cause progressive immune decline and ultimately death in infected cats and humans. It is of great interest to understand how to prevent immune system collapse caused by these lentiviruses. We recently described that disease caused by a virulent FIV strain in cats can be attenuated if animals are first infected with a feline immunodeficiency virus derived from a wild cougar. The detailed temporal tracking of cat immunological parameters in response to two viral infections resulted in high-dimensional datasets containing variables that exhibit strong co-variation. Initial analyses of these complex data using univariate statistical techniques did not account for interactions among immunological response variables and therefore potentially obscured significant effects between infection state and immunological parameters.

Methodology and Principal Findings

Here, we apply a suite of multivariate statistical tools, including Principal Component Analysis, MANOVA and Linear Discriminant Analysis, to temporal immunological data resulting from FIV superinfection in domestic cats. We investigated the co-variation among immunological responses, the differences in immune parameters among four groups of five cats each (uninfected, single and dual infected animals), and the “immune profiles” that discriminate among them over the first four weeks following superinfection. Dual infected cats mount an immune response by 24 days post superinfection that is characterized by elevated levels of CD8 and CD25 cells and increased expression of IL4 and IFNγ, and FAS. This profile discriminates dual infected cats from cats infected with FIV alone, which show high IL-10 and lower numbers of CD8 and CD25 cells.

Conclusions

Multivariate statistical analyses demonstrate both the dynamic nature of the immune response to FIV single and dual infection and the development of a unique immunological profile in dual infected cats, which are protected from immune decline.

S Roy, J Lavine, F Chiaromonte, J Terwee, S VandeWoude, O Bjornstad, M Poss

PLOS ONE, 2009

DOI

Ride the wavelet: A multiscale analysis of genomic contexts flanking small insertions and deletions

Recent studies have revealed that insertions and deletions (indels) are more different in their formation than previously assumed. What remains enigmatic is how the local DNA sequence context contributes to these differences. To investigate the relative impact of various molecular mechanisms to indel formation, we analyzed sequence contexts of indels in the non protein- or RNA-coding, nonrepetitive (NCNR) portion of the human genome. We considered small (≤30-bp) indels occurring in the human lineage since its divergence from chimpanzee and used wavelet techniques to study, simultaneously for multiple scales, the spatial patterns of short sequence motifs associated with indel mutagenesis. In particular, we focused on motifs associated with DNA polymerase activity, topoisomerase cleavage, double-strand breaks (DSBs), and their repair. We came to the following conclusions. First, many motifs are characterized by unique enrichment profiles in the vicinity of indels vs. indel-free portions of the genome, verifying the importance of sequence context in indel mutagenesis. Second, only limited similarity in motif frequency profiles is evident flanking insertions vs. deletions, confirming differences in their mutagenesis. Third, substantial similarity in frequency profiles exists between pairs of individual motifs flanking insertions (and separately deletions), suggesting “cooperation” among motifs, and thus molecular mechanisms, during indel formation. Fourth, the wavelet analyses demonstrate that all these patterns are highly dependent on scale (the size of an interval considered). Finally, our results depict a model of indel mutagenesis comprising both replication and recombination (via repair of paused replication forks and site-specific recombination).

EM Kvikstad, F Chiaromonte, KD Makova

GENOME RES, 2009

DOI

Human-macaque comparisons illuminate variation in neutral substitution rates

Background

The evolutionary distance between human and macaque is particularly attractive for investigating local variation in neutral substitution rates, because substitutions can be inferred more reliably than in comparisons with rodents and are less influenced by the effects of current and ancient diversity than in comparisons with closer primates. Here we investigate the human-macaque neutral substitution rate as a function of a number of genomic parameters.

Results

Using regression analyses we find that male mutation bias, male (but not female) recombination rate, distance to telomeres and substitution rates computed from orthologous regions in mouse-rat and dog-cow comparisons are prominent predictors of the neutral rate. Additionally, we demonstrate that the previously observed biphasic relationship between neutral rate and GC content can be accounted for by properly combining rates at CpG and non-CpG sites. Finally, we find the neutral rate to be negatively correlated with the densities of several classes of computationally predicted functional elements, and less so with the densities of certain classes of experimentally verified functional elements.

Conclusion

Our results suggest that while female recombination may be mainly responsible for driving evolution in GC content, male recombination may be mutagenic, and that other mutagenic mechanisms acting near telomeres, and mechanisms whose effects are shared across mammalian genomes, play significant roles. We also have evidence that the nonlinear increase in rates at high GC levels may be largely due to hyper-mutability of CpG dinucleotides. Finally, our results suggest that the performance of conservation-based prediction methods can be improved by accounting for neutral rates.

S Tyekucheva, KD Makova, JE Karro, RC Hardison, W Miller, F Chiaromonte

GENOME BIOL, 2008

DOI

The genome-wide determinants of human and chimpanzee microsatellite evolution

Mutation rates of microsatellites vary greatly among loci. The causes of this heterogeneity remain largely enigmatic yet are crucial for understanding numerous human neurological diseases and genetic instability in cancer. In this first genome-wide study, the relative contributions of intrinsic features and regional genomic factors to the variation in mutability among orthologous human-chimpanzee microsatellites are investigated with resampling and regression techniques. As a result, we uncover the intricacies of microsatellite mutagenesis as follows. First, intrinsic features (repeat number, length, and motif size), which all influence the probability and rate of slippage, are the strongest predictors of mutability. Second, mutability increases nonuniformly with length, suggesting that processes additional to slippage, such as faulty repair, contribute to mutations. Third, mutability varies among microsatellites with different motif composition likely due to dissimilarities in secondary DNA structure formed by their slippage intermediates. Fourth, mutability of mononucleotide microsatellites is impacted by their location on sex chromosomes vs. autosomes and inside vs. outside of Alu repeats, the former confirming the importance of replication and the latter suggesting a role for gene conversion. Fifth, transcription status and location in a particular isochore do not influence microsatellite mutability. Sixth, compared with intrinsic features, regional genomic factors have only minor effects. Finally, our regression models explain ∼90% of variation in microsatellite mutability and can generate useful predictions for the studies of human diseases, forensics, and conservation genetics.

YD Kelkar, S Tyekucheva, F Chiaromonte, KD Makova

GENOME RES, 2008

DOI

A macaque's-eye view of human insertions and deletions: Differences in mechanisms

Insertions and deletions (indels) cause numerous genetic diseases and lead to pronounced evolutionary differences among genomes. The macaque sequences provide an opportunity to gain insights into the mechanisms generating these mutations on a genome-wide scale by establishing the polarity of indels occurring in the human lineage since its divergence from the chimpanzee. Here we apply novel regression techniques and multiscale analyses to demonstrate an extensive regional indel rate variation stemming from local fluctuations in divergence, GC content, male and female recombination rates, proximity to telomeres, and other genomic factors. We find that both replication and, surprisingly, recombination are significantly associated with the occurrence of small indels. Intriguingly, the relative inputs of replication versus recombination differ between insertions and deletions, thus the two types of mutations are likely guided in part by distinct mechanisms. Namely, insertions are more strongly associated with factors linked to recombination, while deletions are mostly associated with replication-related features. Indel as a term misleadingly groups the two types of mutations together by their effect on a sequence alignment. However, here we establish that the correct identification of a small gap as an insertion or a deletion (by use of an outgroup) is crucial to determining its mechanism of origin. In addition to providing novel insights into insertion and deletion mutagenesis, these results will assist in gap penalty modeling and eventually lead to more reliable genomic alignments.

EM Kvikstad, S Tyekucheva, F Chiaromonte, KD Makova

PLOS COMPUT BIOL, 2007

DOI

Genomic environment predicts expression patterns on the human inactive X chromosome

What genomic landmarks render most genes silent while leaving others expressed on the inactive X chromosome in mammalian females? To date, signals determining expression status of genes on the inactive X remain enigmatic despite the availability of complete genomic sequences. Long interspersed repeats (L1s), particularly abundant on the X, are hypothesized to spread the inactivation signal and are enriched in the vicinity of inactive genes. However, both L1s and inactive genes are also more prevalent in ancient evolutionary strata. Did L1s accumulate there because of their role in inactivation or simply because they spent more time on the rarely recombining X? Here we utilize an experimentally derived inactivation profile of the entire human X chromosome to uncover sequences important for its inactivation, and to predict expression status of individual genes. Focusing on Xp22, where both inactive and active genes reside within evolutionarily young strata, we compare neighborhoods of genes with different inactivation states to identify enriched oligomers. Occurrences of such oligomers are then used as features to train a linear discriminant analysis classifier. Remarkably, expression status is correctly predicted for 84% and 91% of active and inactive genes, respectively, on the entire X, suggesting that oligomers enriched in Xp22 capture most of the genomic signal determining inactivation. To our surprise, the majority of oligomers associated with inactivated genes fall within L1 elements, even though L1 frequency in Xp22 is low. Moreover, these oligomers are enriched in parts of L1 sequences that are usually underrepresented in the genome. Thus, our results strongly support the role of L1s in X inactivation, yet indicate that a chromatin microenvironment composed of multiple genomic sequence elements determines expression status of X chromosome genes.

L Carrel, C Park, S Tyekucheva, J Dunn, F Chiaromonte, KD Makova

PLOS GENET, 2006

DOI

Strong and weak male mutation bias at different sites in the primate genomes: Insights from the human-chimpanzee comparison

Male mutation bias is a higher mutation rate in males than in females thought to result from the greater number of germ line cell divisions in males. If errors in DNA replication cause most mutations, then the magnitude of male mutation bias, measured as the male-to-female mutation rate ratio (α), should reflect the relative excess of male versus female germ line cell divisions. Evolutionary rates averaged among all sites in a sequence and compared between mammalian sex chromosomes were shown to be indeed higher in males than in females. However, it is presently unknown whether individual classes of substitutions exhibit such bias. To address this issue, we investigated male mutation bias separately at non-CpG and CpG sites using human-chimpanzee whole-genome alignments. We observed strong male mutation bias at non-CpG sites: α in the X-autosome comparison was ∼6-7, which was similar to the male-to-female ratio in the number of germ line cell divisions. In contrast, mutations at CpG sites exhibited weak male mutation bias: α in the X-autosome comparison was only ∼2-3. This is consistent with the methylation-induced and replication-independent mechanism of CpG transitions, which constitute the majority of mutations at CpG sites. Interestingly, our study also indicated weak male mutation bias for transversions at CpG sites, implying a spontaneous mechanism largely not associated with replication. Male mutation bias was equally strong at CpG and non-CpG sites located within unmethylated “CpG islands,” suggesting the replication-dependent origin of these mutations. Thus, we found that the strength of male mutation bias is nonuniform in the primate genomes. Importantly, we discovered that male mutation bias depends on the proportion of CpG sites in the loci compared. This might explain the differences in the magnitude of primate male mutation bias observed among studies.

J Taylor, S Tyekucheva, M Zody, F Chiaromonte, KD Makova

MOL BIOL EVOL, 2006

DOI

Insertions and deletions are male biased too: A whole-genome analysis in rodents

It is presently accepted that, in mammals, due to the greater number of cell divisions in the male germline than in the female germline, nucleotide substitutions occur more frequently in males. The data on mutation bias in insertions and deletions (indels) are contradictory, with some studies indicating no sex bias and others indicating either female or male bias. The sequenced rat and mouse genomes provide a unique opportunity to investigate a potential sex bias for different types of mutations. Indeed, mutation rates can be accurately estimated from a large number of orthologous loci in organisms similar in generation time and in the number of germline cell divisions. Here we compare the mutation rates between chromosome X and autosomes for likely neutral sites in eutherian ancestral interspersed repetitive elements present at orthologous locations in the rat and mouse genomes. We find that small indels are male biased: The male-to-female mutation rate ratio (α) for indels in rodents is ∼2. Similarly, our whole-genome analysis in rodents indicates an approximately twofold excess of nucleotide substitutions originating in males over that in females. This is the same as the male-to-female ratio of the number of germline cell divisions in rat and mouse. Thus, this is consistent with nucleotide substitutions and small indels occurring primarily during DNA replication.

KD Makova, S Yang, F Chiaromonte

GENOME RES, 2004

DOI