|
Refereed journal papers
(My students in red italics)
- Xia, Xuhua. 2024. How Trustworthy Are the Genomic Sequences of SARS-CoV-2 in GenBank? Microorganisms 12, no. 11: 2187.
Highlight: Many SARS-CoV-2 genomes deposited in GenBank and GISAID are fake genomes.
Abstract: Well-annotated gene and genomic sequences serve as a foundation for making inferences in molecular biology and evolution and can directly impact public health. The first SARS-CoV-2 genome was submitted to the GenBank database hosted by the U.S. National Center for Biotechnology Information and used to develop the two successful vaccines. Conserved protein domains are often chosen as targets for developing antiviral medicines or vaccines. Mutation and substitution patterns provide crucial information not only on functional motifs and genome/protein interactions but also for characterizing phylogenetic relationships among viral strains. These patterns, together with the collection time of viral samples, serve as the basis for addressing the question of when and where the host-switching event occurred. Unfortunately, viral genomic sequences submitted to GenBank undergo little quality control, and critical information in the annotation is frequently changed without being recorded. Researchers often have no choice but to hold blind faith in the authenticity of the sequences. There have been reports of incorrect genome annotation but no report that casts doubt on the genomic sequences themselves because it seems theoretically impossible to identify genomic sequences that may not be authentic. This paper takes an innovative approach to show that some SARS-CoV-2 genomes submitted to GenBank cannot possibly be authentic. Specifically, some SARS-CoV-2 genomic sequences deposited in GenBank with collection times in 2023 and 2024, isolated from saliva, nasopharyngeal, sewage, and stool, are identical to the reference genome of SARS-CoV-2 (NC_045512). The probability of such occurrence is effectively 0. I also compile SARS-CoV-2 genomes with changed sample collection times. One may be led astray in bioinformatic analysis without being aware of errors in sequences and sequence annotation.
- Xia, Xuhua. 2024. Phylogeographic Analysis for Understanding Origin, Speciation, and Biogeographic Expansion of Invasive Asian Hornet, Vespa velutina Lepeletier, 1836 (Hymenoptera, Vespidae). Life 14, no. 10: 1293.
Abstract: The Asian hornet, Vespa velutina, is an invasive species that has not only expanded its range in Asia but has also invaded European countries, and it incurs significant costs on local apiculture. This phylogeographic study aims to trace the evolutionary trajectory of V. velutina and its close relatives; it aims to identify features that characterize an invasive species. The last successful invasion of Vespa velutina into France occurred in late May, 2002, and into South Korea in early October, 2002, which were estimated by fitting a logistic equation to the number of observations over time. The instantaneous rate of increase is 1.3667 for V. velutina in France and 0.2812 in South Korea, which are consistent with the interpretation of little competition in France and strong competition from local hornet species in South Korea. The invasive potential of two sister lineages can be compared by their distribution area when proper statistical adjustments are made to account for differences in sample size. V. velutina has a greater invasive potential than its sister lineage. The ancestor of V. velutina split into two lineages, one found in Indonesia/Malaysia and the other colonizing the Asian continent. The second lineage split into a sedentary clade inhabiting Pakistan and India and an invasive lineage colonizing much of Southeast Asia. This latter lineage gave rise to the subspecies V. v. nigrithorax, which invaded France, South Korea, and Japan. My software PGT version 1.5, which generates geophylogenies and computes geographic areas for individual taxa, is useful for understanding biogeography in general and invasive species in particular. I discussed the conceptual formulation of an index of invasiveness for a comparison between sister lineages.
- Askari Rad, M.; Kruglikov, A.; Xia, X. 2024. Three-Way Alignment Improves Multiple Sequence Alignment of Highly Diverged Sequences. Algorithms 17, 205.
Abstract: The standard approach for constructing a phylogenetic tree from a set of sequences consists of two key stages. First, a multiple sequence alignment (MSA) of the sequences is computed. The aligned data are then used to reconstruct the phylogenetic tree. The accuracy of the resulting tree heavily relies on the quality of the MSA. The quality of the popularly used progressive sequence alignment depends on a guide tree, which determines the order of aligning sequences. Most MSA methods use pairwise comparisons to generate a distance matrix and reconstruct the guide tree. However, when dealing with highly diverged sequences, constructing a good guide tree is challenging. In this work, we propose an alternative approach using three-way dynamic programming alignment to generate the distance matrix and the guide tree. This three-way alignment incorporates information from additional sequences to compute evolutionary distances more accurately. Using simulated datasets on two symmetric and asymmetric trees, we compared MAFFT with its default guide tree with MAFFT with a guide tree produced using the three-way alignment. We found that (1) the three-way alignment can reconstruct better guide trees than those from the most accurate options of MAFFT, and (2) the better guide tree, on average, leads to more accurate phylogenetic reconstruction. However, the improvement over the L-INS-i option of MAFFT is small, attesting to the excellence of the alignment quality of MAFFT. Surprisingly, the two criteria for choosing the best MSA (phylogenetic accuracy and sum-of-pair score) conflict with each other.
- Farookhi H, Xia X. 2024. Differential Selection for Translation Efficiency Shapes Translation Machineries in Bacterial Species. Microorganisms 12(4):768.
Abstract: Different bacterial species have dramatically different generation times, from 20–30 min in Escherichia coli to about two weeks in Mycobacterium leprae. The translation machinery in a cell needs to synthesize all proteins for a new cell in each generation. The three subprocesses of translation, i.e., initiation, elongation, and termination, are expected to be under stronger selection pressure to optimize in short-generation bacteria (SGB) such as Vibrio natriegens than in the long-generation Mycobacterium leprae. The initiation efficiency depends on the start codon decoded by the initiation tRNA, the optimal Shine–Dalgarno (SD) decoded by the anti-SD (aSD) sequence on small subunit rRNA, and the secondary structure that may embed the initiation signals and prevent them from being decoded. The elongation efficiency depends on the tRNA pool and codon usage. The termination efficiency in bacteria depends mainly on the nature of the stop codon and the nucleotide immediately downstream of the stop codon. By contrasting SGB with long-generation bacteria (LGB), we predict (1) SGB to have more ribosome RNA operons to produce ribosomes, and more tRNA genes for carrying amino acids to ribosomes, (2) SGB to have a higher percentage of genes using AUG as the start codon and UAA as the stop codon than LGB, (3) SGB to exhibit better codon and anticodon adaptation than LGB, and (4) SGB to have a weaker secondary structure near the translation initiation signals than LGB. These differences between SGB and LGB should be more pronounced in highly expressed genes than the rest of the genes. We present empirical evidence in support of these predictions.
- Nasser F, Gaudreau A, Lubega S, Zaker A, Xia X, Mer AS, D’Costa VM. 2024. Characterization of the diversity of type IV secretion system-encoding plasmids in Acinetobacter. Emerg Microbes Infect 13(1):2320929
Abstract: The multi-drug resistant pathogen Acinetobacter baumannii has gained global attention as an important clinical challenge. Owing to its ability to survive on surfaces, its capacity for horizontal gene transfer, and its resistance to front-line antibiotics, A. baumannii has established itself as a successful pathogen. Bacterial conjugation is a central mechanism for pathogen evolution. The epidemic multidrug-resistant A. baumannii ACICU harbours a plasmid encoding a Type IV Secretion System (T4SS) with homology to the E. coli F-plasmid, and plasmids with homologous gene clusters have been identified in several A. baumannii sequence types. However the genetic and host strain diversity, global distribution, and functional ability of this group of plasmids is not fully understood. Using systematic analysis, we show that pACICU2 belongs to a group of almost 120 T4SS-encoding plasmids within four different species of Acinetobacter and one strain of Klebsiella pneumoniae from human and environmental origin, and globally distributed across 20 countries spanning 4 continents. Genetic diversity was observed both outside and within the T4SS-encoding cluster, and 47% of plasmids harboured resistance determinants, with two plasmids harbouring eleven. Conjugation studies with an extensively drug-resistant (XDR) strain showed that the XDR plasmid could be successfully transferred to a more divergent A. baumanii, and transconjugants exhibited the resistance phenotype of the plasmid. Collectively, this demonstrates that these T4SS-encoding plasmids are globally distributed and more widespread among Acinetobacter than previously thought, and that they represent an important potential reservoir for future clinical concern.
- Freeman A; Xia X. 2024 Phylogeographic Reconstruction to Trace the Source Population of Asian Giant Hornet Caught in Nanaimo in Canada and Blaine in the USA. Life 14(3), 283.
Abstract: The Asian giant hornet, Vespa mandarinia, is an invasive species that could potentially destroy the local honeybee industry in North America. It has been observed to nest in the coastal regions of British Columbia in Canada and Washington State in the USA. What is the source population of the immigrant hornets? The identification of the source population can shed light not only on the route of immigration but also on the similarity between the native habitat and the potential new habitat in the Pacific Northwest. We analyzed mitochondrial COX1 sequences of specimens sampled from multiple populations in China, the Republic of Korea, Japan, and the Russian Far East. V. mandarinia exhibits phylogeographic patterns, forming monophyletic clades for 16 specimens from China, six specimens from the Republic of Korea, and two specimens from Japan. The two mitochondrial COX1 sequences from Nanaimo, British Columbia, are identical to the two sequences from Japan. The COX1 sequence from Blaine, Washington State, clustered with those from the Republic of Korea and is identical to one sequence from the Republic of Korea. Our geophylogeny, which allows visualization of genetic variation over time and space, provides evolutionary insights on the evolution and speciation of three closely related vespine species (V. tropica, V. soror, and V. mandarinia), with the speciation events associated with the expansion of the distribution to the north.
- Kruglikov, A.; Xia, X. 2024 Mesophiles vs. Thermophiles: Untangling the Hot Mess of Intrinsically Disordered Proteins and Growth Temperature of Bacteria. Int. J. Mol. Sci. 25, 2000.
Abstract: The dynamic structures and varying functions of intrinsically disordered proteins (IDPs) have made them fascinating subjects in molecular biology. Investigating IDP abundance in different bacterial species is crucial for understanding adaptive strategies in diverse environments. Notably, thermophilic bacteria have lower IDP abundance than mesophiles, and a negative correlation with optimal growth temperature (OGT) has been observed. However, the factors driving these trends are yet to be fully understood. We examined the types of IDPs present in both mesophiles and thermophiles alongside those unique to just mesophiles. The shared group of IDPs exhibits similar disorder levels in the two groups of species, suggesting that certain IDPs unique to mesophiles may contribute to the observed decrease in IDP abundance as OGT increases. Subsequently, we used quasi-independent contrasts to explore the relationship between OGT and IDP abundance evolution. Interestingly, we found no significant relationship between OGT and IDP abundance contrasts, suggesting that the evolution of lower IDP abundance in thermophiles may not be solely linked to OGT. This study provides a foundation for future research into the intricate relationship between IDP evolution and environmental adaptation. Our findings support further research on the adaptive significance of intrinsic disorder in bacterial species.
- Aris, P.; Mohamadzadeh, M.; Zarei, M.; Xia, X. 2024 Computational Design of Novel Griseofulvin Derivatives Demonstrating Potential Antibacterial Activity: Insights from Molecular Docking and Molecular Dynamics Simulation. Int. J. Mol. Sci. 25, 1039.
Abstract: In response to the urgent demand for innovative antibiotics, theoretical investigations have been employed to design novel analogs. Because griseofulvin is a potential antibacterial agent, we have designed novel derivatives of griseofulvin to enhance its antibacterial efficacy and to evaluate their interactions with bacterial targets using in silico analysis. The results of this study reveal that the newly designed derivatives displayed the most robust binding affinities towards PBP2, tyrosine phosphatase, and FtsZ proteins. Additionally, molecular dynamics (MD) simulations underscored the notable stability of these derivatives when engaged with the FtsZ protein, as evidenced by root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg), and solvent-accessible surface area (SASA). Importantly, this observation aligns with expectations, considering that griseofulvin primarily targets microtubules in eukaryotic cells, and FtsZ functions as the prokaryotic counterpart to microtubules. These findings collectively suggest the promising potential of griseofulvin and its designed derivatives as effective antibacterial agents, particularly concerning their interaction with the FtsZ protein. This research contributes to the ongoing exploration of novel antibiotics and may serve as a foundation for future drug development efforts.
- Xia, X. 2023. Horizontal Gene Transfer and Drug Resistance Involving Mycobacterium tuberculosis. Antibiotics 12, 1367
Abstract: Mycobacterium tuberculosis (Mtb) acquires drug resistance at a rate comparable to that of bacterial pathogens that replicate much faster and have a higher mutation rate. One explanation for this rapid acquisition of drug resistance in Mtb is that drug resistance may evolve in other fast-replicating mycobacteria and then be transferred to Mtb through horizontal gene transfer (HGT). This paper aims to address three questions. First, does HGT occur between Mtb and other mycobacterial species? Second, what genes after HGT tend to survive in the recipient genome? Third, does HGT contribute to antibiotic resistance in Mtb? I present a conceptual framework for detecting HGT and analyze 39 ribosomal protein genes, 23S and 16S ribosomal RNA genes, as well as several genes targeted by antibiotics against Mtb, from 43 genomes representing all major groups within Mycobacterium. I also included mgtC and the insertion sequence IS6110 that were previously reported to be involved in HGT. The insertion sequence IS6110 shows clearly that the Mtb complex participates in HGT. However, the horizontal transferability of genes depends on gene function, as was previously hypothesized. HGT is not observed in functionally important genes such as ribosomal protein genes, rRNA genes, and other genes chosen as drug targets. This pattern can be explained by differential selection against functionally important and unimportant genes after HGT. Functionally unimportant genes such as IS6110 are not strongly selected against, so HGT events involving such genes are visible. For functionally important genes, a horizontally transferred diverged homologue from a different species may not work as well as the native counterpart, so the HGT event involving such genes is strongly selected against and eliminated, rendering them invisible to us. In short, while HGT involving the Mtb complex occurs, antibiotic resistance in the Mtb complex arose from mutations in those drug-targeted genes within the Mtb complex and was not gained through HGT.
- Xia, X. 2023. Identification of host receptors for viral entry and beyond: a perspective from the spike of SARS-CoV-2. Frontiers in Microbiology-Virology 14:1188249
Abstract: Identification of the interaction between the host membrane receptor and viral receptor-binding domain (RBD) represents a crucial step for understanding viral pathophysiology and for developing drugs against pathogenic viruses. While all membrane receptors and carbohydrate chains could potentially be used as receptors for viruses, prioritized searches focus typically on membrane receptors that are known to have been used by the relatives of the pathogenic virus, e.g., ACE2 used as a receptor for SARS-CoV is a prioritized candidate receptor for SARS-CoV-2. An ideal receptor protein from a viral perspective is one that is highly expressed in epithelial cell surface of mammalian respiratory or digestive tracts, strongly conserved in evolution so many mammalian species can serve as potential hosts, and functionally important so that its expression cannot be readily downregulated by the host in response to the infection. Experimental confirmation of host receptors includes (1) infection studies with cell cultures/tissues/organs with or without candidate receptor expression, (2) experimental determination of protein structure of the complex between the putative viral RDB and the candidate host receptor, and (3) experiments with mutant candidate receptor or homologues of the candidate receptor in other species. Successful identification of the host receptor opens the door for mechanism-based development of candidate drugs and vaccines and facilitates the inference of what other animal species are vulnerable to the viral pathogen. I illustrate these approaches with research on identification of the receptor and co-factors for SARS-CoV-2.
- Xia, X. Optimizing Protein Production in Therapeutic Phages against a Bacterial Pathogen, Mycobacterium abscessus. Drugs Drug Candidates 2023, 2, 189-209
Abstract: Therapeutic phages against pathogenic bacteria should kill the bacteria efficiently before the latter evolve resistance against the phages. While many factors contribute to phage efficiency in killing bacteria, such as phage attachment to host, delivery of phage genome into the host, phage mechanisms against host defense, phage biosynthesis rate, and phage life cycle, this paper focuses only on the optimization of phage mRNA for efficient translation. Phage mRNA may not be adapted to its host translation machinery for three reasons: (1) mutation disrupting adaptation, (2) a recent host switch leaving no time for adaptation, and (3) multiple hosts with different translation machineries so that adaptation to one host implies suboptimal adaptation to another host. It is therefore important to optimize phage mRNAs in therapeutic phages. Theoretical and practical principles based on many experiments were developed and applied to phages engineered against a drug-resistant Mycobacterium abscessus that infected a young cystic fibrosis patient. I provide a detailed genomic evaluation of the three therapeutic phages with respect to translation initiation, elongation, and termination, by making use of both experimental results and highly expressed genes in the host. For optimizing phage genes against M. abscessus, the start codon should be AUG. The DtoStart distance from base-pairing between the Shine-Dalgarno (SD) sequence and the anti-SD sequence should be 14–16. The stop codon should be UAA. If UAG or UGA is used as a stop codon, they should be followed by nucleotide U. Start codon, SD, or stop codon should not be embedded in a secondary structure that may obscure the signals and interfere with their decoding. The optimization framework should be generally applicable to developing therapeutic phages against bacterial pathogens.
- Xia, X. 2023 Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time. Viruses 15, 684. Here is a Summary in simple terms.
Abstract: Almost all published rooting and dating studies on SARS-CoV-2 assumed that (1) evolutionary rate does not change over time although different lineages can have different evolutionary rates (uncorrelated relaxed clock), and (2) a zoonotic transmission occurred in Wuhan and the culprit was immediately captured, so that only the SARS-CoV-2 genomes obtained in 2019 and the first few months of 2020 (resulting from the first wave of the global expansion from Wuhan) are sufficient for dating the common ancestor. Empirical data contradict the first assumption. The second assumption is not warranted because mounting evidence suggests the presence of early SARS-CoV-2 lineages cocirculating with the Wuhan strains. Large trees with SARS-CoV-2 genomes beyond the first few months are needed to increase the likelihood of finding SARS-CoV-2 lineages that might have originated at the same time as (or even before) those early Wuhan strains. I extended a previously published rapid rooting method to model evolutionary rate as a linear function instead of a constant. This substantially improves the dating of the common ancestor of sampled SARS-CoV-2 genomes. Based on two large trees with 83,688 and 970,777 high-quality and full-length SARS-CoV-2 genomes that contain complete sample collection dates, the common ancestor was dated to 12 June 2019 and 7 July 2019 with the two trees, respectively. The two data sets would give dramatically different or even absurd estimates if the rate was treated as a constant. The large trees were also crucial for overcoming the high rate-heterogeneity among different viral lineages. The improved method was implemented in the software TRAD.
- Aris, P.; Mohamadzadeh, M.; Kruglikov, A. Askari Rad, M.; Xia, X. In Silico Exploration of Microtubule Agent Griseofulvin and Its Derivatives Interactions with Different Human β-Tubulin Isotypes. Molecules 2023, 28, 2384.
Abstract: Tubulin isotypes are known to regulate microtubule stability and dynamics, as well as to play a role in the development of resistance to microtubule-targeted cancer drugs. Griseofulvin is known to disrupt cell microtubule dynamics and cause cell death in cancer cells through binding to tubulin protein at the taxol site. However, the detailed binding mode involved molecular interactions, and binding affinities with different human β-tubulin isotypes are not well understood. Here, the binding affinities of human β-tubulin isotypes with griseofulvin and its derivatives were investigated using molecular docking, molecular dynamics simulation, and binding energy calculations. Multiple sequence analysis shows that the amino acid sequences are different in the griseofulvin binding pocket of βI isotypes. However, no differences were observed at the griseofulvin binding pocket of other β-tubulin isotypes. Our molecular docking results show the favorable interaction and significant affinity of griseofulvin and its derivatives toward human β-tubulin isotypes. Further, molecular dynamics simulation results show the structural stability of most β-tubulin isotypes upon binding to the G1 derivative. Taxol is an effective drug in breast cancer, but resistance to it is known. Modern anticancer treatments use a combination of multiple drugs to alleviate the problem of cancer cells resistance to chemotherapy. Our study provides a significant understanding of the involved molecular interactions of griseofulvin and its derivatives with β-tubulin isotypes, which may help to design potent griseofulvin analogues for specific tubulin isotypes in multidrug-resistance cancer cells in future.
- Aris, P.; Wei, Y.; Mohamadzadeh, M.; Xia, X. Griseofulvin: An Updated Overview of Old and Current Knowledge. Molecules 2022, 27, 7034. Molecules 27, 7034.
Abstract: Griseofulvin is an antifungal polyketide metabolite produced mainly by ascomycetes. Since it was commercially introduced in 1959, griseofulvin has been used in treating dermatophyte infections. This fungistatic has gained increasing interest for multifunctional applications in the last decades due to its potential to disrupt mitosis and cell division in human cancer cells and arrest hepatitis C virus replication. In addition to these inhibitory effects, we and others found griseofulvin may enhance ACE2 function, contribute to vascular vasodilation, and improve capillary blood flow. Furthermore, molecular docking analysis revealed that griseofulvin and its derivatives have good binding potential with SARS-CoV-2 main protease, RNA-dependent RNA polymerase (RdRp), and spike protein receptor-binding domain (RBD), suggesting its inhibitory effects on SARS-CoV-2 entry and viral replication. These findings imply the repurposing potentials of the FDA-approved drug griseofulvin in designing and developing novel therapeutic interventions. In this review, we have summarized the available information from its discovery to recent progress in this growing field. Additionally, explored is the possible mechanism leading to rare hepatitis induced by griseofulvin. We found that griseofulvin and its metabolites, including 6-desmethylgriseofulvin (6-DMG) and 4-desmethylgriseofulvin (4-DMG), have favorable interactions with cytokeratin intermediate filament proteins (K8 and K18), ranging from −3.34 to −5.61 kcal mol−1. Therefore, they could be responsible for liver injury and Mallory body (MB) formation in hepatocytes of human, mouse, and rat treated with griseofulvin. Moreover, the stronger binding of griseofulvin to K18 in rodents than in human may explain the observed difference in the severity of hepatitis between rodents and human.
- Kruglikov A, Wei Y, Xia X 2022. Proteins from Thermophilic Thermus thermophilus Often Do Not Fold Correctly in a Mesophilic Expression System Such as Escherichia coli. ACS Omega 7:37797–37806.
Abstract: Majority of protein structure studies use Escherichia coli (E. coli) and other model organisms as expression systems for other species’ genes. However, protein folding depends on cellular environment factors, such as chaperone proteins, cytoplasmic pH, temperature, and ionic concentrations. Because of differences in these factors, especially temperature and chaperones, native proteins in organisms such as extremophiles may fold improperly when they are expressed in mesophilic model organisms. Here we present a methodology of assessing the effects of using E. coli as the expression system on protein structures. We compare these effects between eight mesophilic bacteria and Thermus thermophilus (T. thermophilus), a thermophile, and found that differences are significantly larger for T. thermophilus. More specifically, helical secondary structures in T. thermophilus proteins are often replaced by coil structures in E. coli. Our results show unique directionality in misfolding when proteins in thermophiles are expressed in mesophiles. This indicates that extremophiles, such as thermophiles, require unique protein expression systems in protein folding studies
- Xia, X. 2022. Multiple regulatory mechanisms for pH homeostasis in the gastric pathogen, Helicobacter pylori. Advances in Genetics 109:39-69.
Abstract: Acid-resistance in gastric pathogen Helicobacter pylori requires the coordination of four essential processes to regulate urease activity. Firstly, urease expression above a base level needs to be finely tuned at different ambient pH. Secondly, as nickel is needed to activate urease, nickel homeostasis needs to be maintained by proteins that import and export nickel ions, and sequester, store and release nickel when needed. Thirdly, urease accessary proteins that activate urease activity by nickel insertion need to be expressed. Finally, a reliable source of urea needs to be maintained by both intrinsic and extrinsic sources of urea. Two-component systems (arsRS and flgRS), as well as a nickel response regulator (NikR), sense the change in pH and act on a variety of genes to accomplish the function of acid resistance without causing cellular overalkalization and nickel toxicity. Nickel storage proteins also feature built-in switches to store nickel at neutral pH and release nickel at low pH. This review summarizes the current status of H. pylori research and highlights a number of hypotheses that need to be tested.
- Jia, B.; Conner, R.L.; Penner, W.C.; Zheng, C.; Cloutier, S.; Hou, A.; Xia, X.; You, F.M. Quantitative Trait Locus Mapping of Marsh Spot Disease Resistance in Cranberry Common Bean (Phaseolus vulgaris L.). Int. J. Mol. Sci. 2022, 23, 7639.
Abstract: Common bean (Phaseolus vulgaris L.) is a food crop that
is an important source of dietary proteins and carbohydrates. Marsh spot is a physiological
disorder that diminishes seed quality in beans. Prior research suggested that this
disease is likely caused by manganese (Mn) deficiency during seed development and
that marsh spot resistance is controlled by at least four genes. In this study,
genetic mapping was performed to identify quantitative trait loci (QTL) and the
potential candidate genes associated with marsh spot resistance. All 138 recombinant
inbred lines (RILs) from a bi-parental population were evaluated for marsh spot
resistance during five years from 2015 to 2019 in sandy and heavy clay soils in
Morden, Manitoba, Canada. The RILs were sequenced using a genotyping by sequencing
approach. A total of 52,676 single nucleotide polymorphisms (SNPs) were identified
and filtered to generate a high-quality set of 2066 SNPs for QTL mapping. A genetic
map based on 1273 SNP markers distributed on 11 chromosomes and covering 1599 cm
was constructed. A total of 12 stable and 4 environment-specific QTL were identified
using additive effect models, and an additional two epistatic QTL interacting with
two of the 16 QTL were identified using an epistasis model. Genome-wide scans of
the candidate genes identified 13 metal transport-related candidate genes co-locating
within six QTL regions. In particular, two QTL (QTL.3.1 and QTL.3.2) with the highest
R2 values (21.8% and 24.5%, respectively) harbored several metal transport genes
Phvul.003G086300, Phvul.003G092500, Phvul.003G104900, Phvul.003G099700, and Phvul.003G108900
in a large genomic region of 16.8–27.5 Mb on chromosome 3. These results advance
the current understanding of the genetic mechanisms of marsh spot resistance in
cranberry common bean and provide new genomic resources for use in genomics-assisted
breeding and for candidate gene isolation and functional characterization.
- Rakesh, M., Aris-Brosou, S. & Xia, X. 2022. Testing alternative hypotheses on the
origin and speciation of Hawaiian katydids.
BMC Ecol Evo 22, 83
Abstract: Hawaiian Islands offer a unique and dynamic evolutionary
theatre for studying origin and speciation as the islands themselves sequentially
formed by erupting undersea volcanos, which would subsequently become dormant and
extinct. Such dynamics have not been used to resolve the controversy surrounding
the origin and speciation of Hawaiian katydids in the genus Banza, whose ancestor
could be from either the Old-World genera Ruspolia and Euconocephalus, or the New
World Neoconocephalus. To address this question, we performed a chronophylogeographic
analysis of Banza species together with close relatives from the Old and New Worlds.
Based on extensive dated phylogeographic analyses of two mitochondrial genes (COX1
and CYTB), we show that our data are consistent with the interpretation that extant
Banza species resulted from two colonization events, both by katydids from the Old
World rather than from the New World. The first event was by an ancestral lineage
of Euconocephalus about 6 million years ago (mya) after the formation of Nihoa about
7.3 mya, giving rise to B. nihoa. The second colonization event was by a sister
lineage of Ruspolia dubia. The dating result suggests that this ancestral lineage
first colonized an older island in the Hawaiian–Emperor seamount chain before the
emergence of Hawaii Islands, but colonized Kauai after its emergence in 5.8 mya.
This second colonization gave rise to the rest of the Banza species in two major
lineages, one on the older northwestern islands, and the other on the newer southwestern
islands. Chronophylogeographic analyses with well-sampled taxa proved crucial for
resolving phylogeographic controversies on the origin and evolution of species colonizing
a new environment.
- Aris, P.; Mohamadzadeh, M.; Wei, Y.; Xia, X. 2022 In Silico Molecular Dynamics of
Griseofulvin and Its Derivatives Revealed Potential Therapeutic Applications for
COVID-19. Int. J. Mol.
Sci. 23, 6889
Abstract: Treatment options for Coronavirus Disease 2019 (COVID-19)
remain limited, and the option of repurposing approved drugs with promising medicinal
properties is of increasing interest in therapeutic approaches to COVID-19. Using
computational approaches, we examined griseofulvin and its derivatives against four
key anti-SARS-CoV-2 targets: main protease, RdRp, spike protein receptor-binding
domain (RBD), and human host angiotensin-converting enzyme 2 (ACE2). Molecular docking
analysis revealed that griseofulvin (CID 441140) has the highest docking score (–6.8
kcal/mol) with main protease of SARS-CoV-2. Moreover, griseofulvin derivative M9
(CID 144564153) proved the most potent inhibitor with −9.49 kcal/mol, followed by
A3 (CID 46844082) with −8.44 kcal/mol against M protease and ACE2, respectively.
Additionally, H bond analysis revealed that compound A3 formed the highest number
of hydrogen bonds, indicating the strongest inhibitory efficacy against ACE2. Further,
molecular dynamics (MD) simulation analysis revealed that griseofulvin and these
derivatives are structurally stable. These findings suggest that griseofulvin and
its derivatives may be considered when designing future therapeutic options for
SARS-CoV-2 infection.
- Jia B, Conner RL, Khan N, Hou A, Xia X, You FM.
2022. Inheritance of marsh spot disease resistance in cranberry common bean (Phaseolus
vulgaris L.). The Crop
Journal 10(2):456-467
Abstract: Common bean (Phaseolus vulgaris) is an annual legume
crop that is grown worldwide for its edible dry seeds and tender pods. Marsh spot
(MS) of the seeds is a physio-genic stress disease affecting seed quality in beans.
Studies have suggested that this disease involves a nutritional disorder caused
by manganese deficiency, but the inheritance of resistance to this disease has not
been reported. A biparental genetic population composed of 138 recombinant inbred
lines (RILs) was developed from a cross between an MS resistant cultivar ‘Cran09’
and an MS susceptible cultivar ‘Messina’. The 138 RILs and their two parents were
evaluated for MS resistance during five consecutive years from 2015 to 2019 in sandy
and heavy clay soils in Morden, Manitoba, Canada. The MS incidence (MSI) and the
MS resistance index (MSRI) representing disease severity were shown to be both highly
correlated heritable traits that had high broad-sense heritability values (H2) of
86.5% and 83.2%, respectively. No significant differences for MSI and MSRI were
observed between the two soil types in all five- (MSI) or four-year (MSRI) data
collection, but significant correlations among years were observed despite MS resistance
was moderately affected by year. The MSIs and MSRIs displayed a right-skewed distribution,
indicating a mixed genetic model involving a few major genes and polygenes. Using
the joint segregation analysis method, the same four major genes with additive-epistasis
effects showed the best fit for both traits, explaining 84.4% and 85.3% of the phenotypic
variance for MSI and MSRI, respectively. For both traits, the M1, M2, M3 and m4
acted as the favorable (resistant) alleles for the four genes where M and m represent
two alleles of each gene. However, due to epistatic effects, only the individuals
of the M1M2M3M4 haplotype appeared to be highly resistant, whereas those of the
m1m2m3M4 haplotype were the most susceptible. The m4 allele significantly suppressed
the additive effects of M1M2M3 on resistance, but decreased susceptibility due to
the additive effects of m1m2m3. Further quantitative trait locus (QTL) mapping is
warranted to identify and validate individual genes and develop molecular markers
for marker-assisted selection of resistant cultivars.
- Parisa Aris, Lihong Yan, Yulong Wei, Ying Chang, Bihong Shi, Xuhua Xia, 2022. Conservation
of griseofulvin genes in the gsf gene cluster among fungal genomes.
G3 Genes|Genomes|Genetics 12(2)jkab399
Abstract: The polyketide griseofulvin is a natural antifungal compound
and research in griseofulvin has been key in establishing our current understanding
of polyketide biosynthesis. Nevertheless, the griseofulvin gsf biosynthetic gene
cluster (BGC) remains poorly understood in most fungal species, including Penicillium
griseofulvum where griseofulvin was first isolated. To elucidate essential genes
involved in griseofulvin biosynthesis, we performed third-generation sequencing
to obtain the genome of P. griseofulvum strain D-756. Furthermore, we gathered publicly
available genome of 11 other fungal species in which gsf gene cluster was identified.
In a comparative genome analysis, we annotated and compared the gsf BGC of all 12
fungal genomes. Our findings show no gene rearrangements at the gsf BGC. Furthermore,
seven gsf genes are conserved by most genomes surveyed whereas the remaining six
were poorly conserved. This study provides new insights into differences between
gsf BGC and suggests that seven gsf genes are essential in griseofulvin production.
- Xia, X. 2021 Post-Alignment Adjustment and Its Automation. Genes,
12, 1809. https://doi.org/10.3390/genes12111809
Abstract: Multiple sequence alignment (MSA) is the basis for almost
all sequence comparison and molecular phylogenetic inferences. Large-scale genomic
analyses are typically associated with automated progressive MSA without subsequent
manual adjustment, which itself is often error-prone because of the lack of a consistent
and explicit criterion. Here, I outlined several commonly encountered alignment
errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and
codon sequences. Methods that could be automated to fix such alignment errors were
then presented. I emphasized the utility of position weight matrix as a new tool
for MSA refinement and illustrated its usage by refining the MSA of nucleotide and
amino acid sequences. The main advantages of the position weight matrix approach
include (1) its use of information from all sequences, in contrast to other commonly
used methods based on pairwise alignment scores and inconsistency measures, and
(2) its speedy computation, making it suitable for a large number of long viral
genomic sequences
- Xia X. 2021. Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality
and Full-Length SARS-CoV-2 Genomes. Viruses 13(9),1790 https://www.mdpi.com/1262774
Abstract: All dating studies involving SARS-CoV-2 are problematic.
Previous studies have dated the most recent common ancestor (MRCA) between SARS-CoV-2
and its close relatives from bats and pangolins. However, the evolutionary rate
thus derived is expected to differ from the rate estimated from sequence divergence
of SARS-CoV-2 lineages. Here, I present dating results for the first time from a
large phylogenetic tree with 86,582 high-quality full-length SARS-CoV-2 genomes.
The tree contains 83,688 genomes with full specification of collection time. Such
a large tree spanning a period of about 1.5 years offers an excellent opportunity
for dating the MRCA of the sampled SARS-CoV-2 genomes. The MRCA is dated 16 August
2019, with the evolutionary rate estimated to be 0.05526 mutations/genome/day. The
Pearson correlation coefficient (r) between the root-to-tip distance (D) and the
collection time (T) is 0.86295. The NCBI tree also includes 10 SARS-CoV-2 genomes
isolated from cats, collected over roughly the same time span as human COVID-19
infection. The MRCA from these cat-derived SARS-CoV-2 is dated 30 July 2019, with
r = 0.98464. While the dating method is well known, I have included detailed illustrations
so that anyone can repeat the analysis and obtain the same dating results. With
16 August 2019 as the date of the MRCA of sampled SARS-CoV-2 genomes, archived samples
from respiratory or digestive tracts collected around or before 16 August 2019,
or those that are not descendants of the existing SARS-CoV-2 lineages, should be
particularly valuable for tracing the origin of SARS-CoV-2
- Xia, X. 2021. Detailed Dissection and Critical Evaluation of the Pfizer/BioNTech
and Moderna mRNA Vaccines."
Vaccines (Basel) 9(7), 734.
Abstract: The design of Pfizer/BioNTech and Moderna mRNA vaccines
involves many different types of optimizations. Proper optimization of vaccine mRNA
can reduce dosage required for each injection leading to more efficient immunization
programs. The mRNA components of the vaccine need to have a 5’-UTR to load ribosomes
efficiently onto the mRNA for translation initiation, optimized codon usage for
efficient translation elongation, and optimal stop codon for efficient translation
termination. Both 5’-UTR and the downstream 3’-UTR should be optimized for mRNA
stability. The replacement of uridine by N1-methylpseudourinine (Ψ) complicates
some of these optimization processes because Ψ is more versatile in wobbling than
U. Different optimizations can conflict with each other, and compromises would need
to be made. I highlight the similarities and differences between Pfizer/BioNTech
and Moderna mRNA vaccines and discuss the advantage and disadvantage of each to
facilitate future vaccine improvement. In particular, I point out a few optimizations
in the design of the two mRNA vaccines that have not been performed properly.
- Jia, B., Waldo, P., Conner, R., Moumen, I., Khan,
N., Xia, X., Hou, A., You, F. 2021. Marsh Spot Disease and Its Causal Factor, Manganese
Deficiency in Plants: A Historical and Prospective Review. Agricultural Sciences,
12, 928-948 doi: 10.4236/as.2021.129060
Abstract: This review provides an examination of the marsh spot
disease in beans and the roles played by its causal factor, manganese (Mn) deficiency.
The discovery of the marsh spot disease, its relation with Mn deficiency, and how
it can be treated are discussed. Mn serves as a cofactor and a catalyst in various
metabolic processes in different cell compartments, such as the oxygen-evolving
complex of photosystem II (PSII) or reactive oxygen species scavenging. Some major
quantitative trait loci (QTL) and putative candidate genes associated with Mn content
in plants, especially in plant seeds, have been identified. Marsh spot disease in
cranberry common bean is controlled by several major genes with significant additive
and epistatic effects. They provide valuable clues for QTL candidate gene prediction
and an improved understanding of the genetic mechanisms responsible for marsh spot
resistance in plants.
- Tehfe, A.; Roseshter, T.; Wei, Y.; Xia, X. Does
Saccharomyces cerevisiae Require Specific Post-Translational Silencing against Leaky
Translation of Hac1up? Microorganisms
2021, 9, 620.
Abstract: HAC1 encodes a key transcription factor that transmits
the unfolded protein response (UPR) from the endoplasmic reticulum (ER) to the nucleus
and regulates downstream UPR genes in Saccharomyces cerevisiae. In response to the
accumulation of unfolded proteins in the ER, Ire1p oligomers splice HAC1 pre-mRNA
(HAC1u) via a non-conventional process and allow the spliced HAC1 (HAC1i) to be
translated efficiently. However, leaky splicing and translation of HAC1u may occur
in non-UPR cells to induce undesirable UPR. To control accidental UPR activation,
multiple fail-safe mechanisms have been proposed to prevent leaky HAC1 splicing
and translation and to facilitate rapid degradation of translated Hac1up and Hac1ip.
Among proposed regulatory mechanisms is a degron sequence encoded at the 5′ end
of the HAC1 intron that silences Hac1up expression. To investigate the necessity
of an intron-encoded degron sequence that specifically targets Hac1up for degradation,
we employed publicly available transcriptomic data to quantify leaky HAC1 splicing
and translation in UPR-induced and non-UPR cells. As expected, we found that HAC1u
is only efficiently spliced into HAC1i and efficiently translated into Hac1ip in
UPR-induced cells. However, our analysis of ribosome profiling data confirmed frequent
occurrence of leaky translation of HAC1u regardless of UPR induction, demonstrating
the inability of translation fail-safe to completely inhibit Hac1up production.
Additionally, among 32 yeast HAC1 surveyed, the degron sequence is highly conserved
by Saccharomyces yeast but is poorly conserved by all other yeast species. Nevertheless,
the degron sequence is the most conserved HAC1 intron segment in yeasts. These results
suggest that the degron sequence may indeed play an important role in mitigating
the accumulation of Hac1up to prevent accidental UPR activation in the Saccharomyces
yeast.
- Kruglikov A, Rakesh M, Wei Y, Xia X. 2021. Applications
of Protein Secondary Structure Algorithms in SARS-CoV-2 Research.
J Proteome Res 20:1457-1463
Abstract: Since the outset of COVID-19, the pandemic has prompted
immediate global efforts to sequence SARS-CoV-2, and over 450 000 complete genomes
have been publicly deposited over the course of 12 months. Despite this, comparative
nucleotide and amino acid sequence analyses often fall short in answering key questions
in vaccine design. For example, the binding affinity between different ACE2 receptors
and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity
at ACE2 contact sites because protein structure similarities are not fully reflected
by amino acid sequence similarities. To comprehensively compare protein homology,
secondary structure (SS) analysis is required. While protein structure is slow and
difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS
structure may serve as a viable proxy to gain biological insight. Here we review
algorithms and information used in predicting protein SS to highlight its potential
application in pandemics research. We also showed examples of how SS predictions
can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses.
As computational tools are much faster than wet-lab experiments, these applications
can be important for research especially in times when quickly obtained biological
insights can help in speeding up response to pandemics.
- Wei Y, Aris P, Farookhi H & Xia X. 2021 Predicting
mammalian species at risk of being infected by SARS‑CoV‑2 from an ACE2 perspective.
Scientific Reports
11:1702
Abstract: SARS‑CoV‑2 can transmit efficiently in humans, but it
is less clear which other mammals are at risk of being infected. SARS‑CoV‑2 encodes
a Spike (S) protein that binds to human ACE2 receptor to mediate cell entry. A species
with a human‑like ACE2 receptor could therefore be at risk of being infected by
SARS‑CoV‑2. We compared between 132 mammalian ACE2 genes and between 17 coronavirus
S proteins. We showed that while global similarities reflected by whole ACE2 gene
alignments are poor predictors of high‑risk mammals, local similarities at key S
protein‑binding sites highlight several high‑risk mammals that share good ACE2 homology
with human. Bats are likely reservoirs of SARS‑CoV‑2, but there are other high‑risk
mammals that share better ACE2 homologies with human. Both SARS‑CoV‑2 and SARS‑CoV
are closely related to bat coronavirus. Yet, among host‑specific coronaviruses infecting
high‑risk mammals, key ACE2‑binding sites on S proteins share highest similarities
between SARS‑CoV‑2 and Pangolin‑CoV and between SARS‑CoV and Civet‑CoV. These results
suggest that direct coronavirus transmission from bat to human is unlikely, and
that rapid adaptation of a bat SARS‑like coronavirus in different high‑risk intermediate
hosts could have allowed it to acquire distinct high binding potential between S
protein and human‑like ACE2 receptors.
- Xia, X. 2021. Domains and Functions of Spike Protein in SARS-Cov-2 in the Context
of Vaccine Design. Viruses
13(1), 109
Abstract: The spike protein in SARS-CoV-2 (SARS-2-S) interacts
with the human ACE2 receptor to gain entry into a cell to initiate infection. Both
Pfizer/BioNTech’s BNT162b2 and Moderna’s mRNA-1273 vaccine candidates are based
on stabilized mRNA encoding prefusion SARS-2-S that can be produced after the mRNA
is delivered into the human cell and translated. SARS-2-S is cleaved into S1 and
S2 subunits, with S1 serving the function of receptor-binding and S2 serving the
function of membrane fusion. Here, I dissect in detail the various domains of SARS-2-S
and their functions discovered through a variety of different experimental and theoretical
approaches to build a foundation for a comprehensive mechanistic understanding of
how SARS-2-S works to achieve its function of mediating cell entry and subsequent
cell-to-cell transmission. The integration of structure and function of SARS-2-S
in this review should enhance our understanding of the dynamic processes involving
receptor binding, multiple cleavage events, membrane fusion, viral entry, as well
as the emergence of new viral variants. I highlighted the relevance of structural
domains and dynamics to vaccine development, and discussed reasons for the spike
protein to be frequently featured in the conspiracy theory claiming that SARS-CoV-2
is artificially created.
- Wei Y, Silke JR, Aris P, Xia X. 2020. Coronavirus
genomes carry the signatures of their habitats.
PLos One 15(12): e0244025
Abstract: Coronaviruses such as SARS-CoV-2 regularly infect host
tissues that express antiviral proteins (AVPs) in abundance. Understanding how they
evolve to adapt or evade host immune responses is important in the effort to control
the spread of infection. Two AVPs that may shape viral genomes are the zinc finger
antiviral protein (ZAP) and the apolipoprotein B mRNA editing enzyme-catalytic polypeptide-like
3 (APOBEC3). The former binds to CpG dinucleotides to facilitate the degradation
of viral transcripts while the latter frequently deaminates C into U residues which
could generate notable viral sequence variations. We tested the hypothesis that
both APOBEC3 and ZAP impose selective pressures that shape the genome of an infecting
coronavirus. Our investigation considered a comprehensive number of publicly available
genomes for seven coronaviruses (SARS-CoV-2, SARS-CoV, and MERS infecting Homo sapiens,
Bovine CoV infecting Bos taurus, MHV infecting Mus musculus, HEV infecting Sus scrofa,
and CRCoV infecting Canis lupus familiaris). We show that coronaviruses that regularly
infect tissues with abundant AVPs have CpG-deficient and U-rich genomes; whereas
those that do not infect tissues with abundant AVPs do not share these sequence
hallmarks. Among the coronaviruses surveyed herein, CpG is most deficient in SARS-CoV-2
and a temporal analysis showed a marked increase in C to U mutations over four months
of SARS-CoV-2 genome evolution. Furthermore, the preferred motifs in which these
C to U mutations occur are the same as those subjected to APOBEC3 editing in HIV-1.
These results suggest that both ZAP and APOBEC3 shape the SARS-CoV-2 genome: ZAP
imposes a strong CpG avoidance, and APOBEC3 constantly edits C to U. Evolutionary
pressures exerted by host immune systems onto viral genomes may motivate novel strategies
for SARS-CoV-2 vaccine development.
- Xia, X. 2020 Beyond Trees: Regulons and Regulatory Motif Characterization. Genes
11, 995
Abstract: Trees and their seeds regulate their germination, growth,
and reproduction in response to environmental stimuli. These stimuli, through signal
transduction, trigger transcription factors that alter the expression of various
genes leading to the unfolding of the genetic program. A regulon is conceptually
defined as a set of target genes regulated by a transcription factor by physically
binding to regulatory motifs to accomplish a specific biological function, such
as the CO-FT regulon for flowering timing and fall growth cessation in trees. Only
with a clear characterization of regulatory motifs, can candidate target genes be
experimentally validated, but motif characterization represents the weakest feature
of regulon research, especially in tree genetics. I review here relevant experimental
and bioinformatics approaches in characterizing transcription factors and their
binding sites, outline problems in tree regulon research, and demonstrate how transcription
factor databases can be effectively used to aid the characterization of tree regulons.
- Xia, X. 2020 Improving Phylogenetic Signals of Mitochondrial Genes Using a New Method
of Codon Degeneration.
Life 10, 171.
Abstract: Recovering deep phylogeny is challenging with animal
mitochondrial genes because of their rapid evolution. Codon degeneration decreases
the phylogenetic noise and bias by aiming to achieve two objectives: (1) alleviate
the bias associated with nucleotide composition, which may lead to homoplasy and
long-branch attraction, and (2) reduce differences in the phylogenetic results between
nucleotide-based and amino acid (AA)-based analyses. The discrepancy between nucleotide-based
analysis and AA-based analysis is partially caused by some synonymous codons that
differ more from each other at the nucleotide level than from some nonsynonymous
codons, e.g., Leu codon TTR in the standard genetic code is more similar to Phe
codon TTY than to synonymous CTN codons. Thus, nucleotide similarity conflicts with
AA similarity. There are many such examples involving other codon families in various
mitochondrial genetic codes. Proper codon degeneration will make synonymous codons
more similar to each other at the nucleotide level than they are to nonsynonymous
codons. Here, I illustrate a “principled” codon degeneration method that achieves
these objectives. The method was applied to resolving the mammalian basal lineage
and phylogenetic position of rheas among ratites. The codon degeneration method
was implemented in the user-friendly and freely available DAMBE software for all
known genetic codes (genetic codes 1 to 33).
- Xia, X. 2020 Drug efficacy and toxicity prediction: an innovative application of
transcriptomic data.
Cell Biology and Toxicology 36(6):591-602
Abstract: Drug toxicity and efficacy are difficult to predict partly
because they are both poorly defined, which I aim to remedy here from a transcriptomic
perspective. There are two major categories of drugs: (1) restorative drugs aiming
to restore an abnormal cell, tissue, or organ to normal function (e.g., restoring
normal membrane function of epithelial cells in cystic fibrosis), and (2) disruptive
drugs aiming to kill pathogens or malignant cells. These two types of drugs require
different definition of efficacy and toxicity. I outlined rationales for defining
transcriptomic efficacy and toxicity and illustrated numerically their application
with two sets of transcriptomic data, one for restorative drugs (treating cystic
fibrosis with lumacaftor/ivacaftor aiming to restore the cellular function of epithelial
cells) and the other for disruptive drugs (treating acute myeloid leukemia with
prexasertib). The conceptual framework presented will help and sensitize researchers
to collect data required for determining drug toxicity.
- Xia, X. 2020 Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral
defense.
Molecular Biology and Evolution 37:2699–2705.
Abstract: Wild mammalian species, including bats, constitute the
natural reservoir of Betacoronavirus (including SARS, MERS, and the deadly SARS-CoV-2).
Different hosts or host tissues provide different cellular environments, especially
different antiviral and RNA modification activities that can alter RNA modification
signatures observed in the viral RNA genome. The zinc finger antiviral protein (ZAP)
binds specifically to CpG dinucleotides and recruits other proteins to degrade a
variety of viral RNA genomes. Many mammalian RNA viruses have evolved CpG deficiency.
Increasing CpG dinucleotides in these low-CpG viral genomes in the presence of ZAP
consistently leads to decreased viral replication and virulence. Because ZAP exhibits
tissue-specific expression, viruses infecting different tissues are expected to
have different CpG signatures, suggesting a means to identify viral tissue-switching
events. I show that SARS-CoV-2 has the most extreme CpG deficiency in all known
Betacoronavirus genomes. This suggests that SARS-CoV-2 may have evolved in a new
host (or new host tissue) with high ZAP expression. A survey of CpG deficiency in
viral genomes identified a virulent canine coronavirus (Alphacoronavirus) as possessing
the most extreme CpG deficiency, comparable to that observed in SARS-CoV-2. This
suggests that the canine tissue infected by the canine coronavirus may provide a
cellular environment strongly selecting against CpG. Thus, viral surveys focused
on decreasing CpG in viral RNA genomes may provide important clues about the selective
environments and viral defenses in the original hosts.
- Katherine E Noah, Jiasheng Hao, Luyan Li, Xiaoyan Sun, Brian Foley, Qun Yang and
Xuhua Xia. 2020 Major Revisions in Arthropod Phylogeny Through Improved Supermatrix,
With Support for Two Possible Waves of Land Invasion by Chelicerates.
Evolutionary Bioinformatics 16:1:12
Abstract: Deep phylogeny involving arthropod lineages is difficult
to recover because the erosion of phylogenetic signals over time leads to unreliable
multiple sequence alignment (MSA) and subsequent phylogenetic reconstruction. One
way to alleviate the problem is to assemble a large number of gene sequences to
compensate for the weakness in each individual gene. Such an approach has led to
many robustly supported but contradictory phylogenies. A close examination shows
that the supermatrix approach often suffers from two shortcomings. The first is
that MSA is rarely checked for reliability and, as will be illustrated, can be poor.
The second is that, to alleviate the problem of homoplasy at the third codon position
of protein-coding genes due to convergent evolution of nucleotide frequencies, phylogeneticists
may remove or degenerate the third codon position but may do it improperly and introduce
new biases. We performed extensive reanalysis of one of such “big data” sets to
highlight these two problems, and demonstrated the power and benefits of correcting
or alleviating these problems. Our results support a new group with Xiphosura and
Arachnopulmonata (Tetrapulmonata + Scorpiones) as sister taxa. This favors a new
hypothesis in which the ancestor of Xiphosura and the extinct Eurypterida (sea scorpions,
of which many later forms lived in brackish or freshwater) returned to the sea after
the initial chelicerate invasion of land. Our phylogeny is supported even with the
original data but processed with a new “principled” codon degeneration. We also
show that removing the 1673 codon sites with both AGN and UCN codons (encoding serine)
in our alignment can partially reconcile discrepancies between nucleotide-based
and AA-based tree, partly because two sequences, one with AGN and the other with
UCN, would be identical at the amino acid level but quite different at the nucleotide
level.
- Xia X, Moriyama EN, Gu X. 2020. Editorial for the special issue “RNA-Seq: Methods
and applications”
Methods 176:1-3.
Abstract: RNA-Seq is a powerful tool in molecular and evolutionary
biology. A well-built tool extends our vision, just like a microscope or a telescope,
so that we can see patterns of nature that would otherwise be hidden from us [1,
p. xiii]. With proper experimental design, RNA-Seq allows us to see the dynamics
of cellular processes at nucleotide resolution......
- Xia, X. 2020. RNA-Seq approach for accurate characterization of splicing efficiency
of yeast introns. Methods
176:25-33
Abstract: Introns in different genes, or even different introns
within the same gene, often have different splice sites and differ in splicing efficiency
(SE). One expects mass-transcribed genes to have introns with higher SE than weakly
transcribed genes. However, such a simple expectation cannot be tested directly
because variable SE for these genes is often not measured. Mechanistically, SE should
depend on signal strength at key splice sites (SS) such as 5'SS, 3'SS and branchpoint
site (BPS), i.e., SE = F(5'SS, 3'SS, BPS). However, without SE, we again cannot
model how these splice sites contribute to SE. Here I present an RNA-Seq approach
to quantify SE for each of the 304 introns in yeast (Saccharomyces cerevisiae)
genes, including 24 in the 5'UTR, by measuring 1) number of reads mapped to exon-exon
junctions (NEE) as a proxy for the abundance of spliced form, and 2)
number of reads mapped to exon-intron junction (NEI5 and NEI3
at 5' and 3' ends of intron) as a proxy for the abundance of unspliced form. The
total mRNA is NTotal = NEE + p*NEI5 + (1-p)*NEI3,
with the simplest p = 0.5 but statistical methods were presented to estimate p from
data. An estimated p is needed because NEI5 is expected to be smaller
than NEI3 due to 1) step 1 splicing occurs before step 2 so EI5 is broken
before EI3, 2) enrichment of poly(A) mRNA by oligo-dT, and 3) 5' degradation. SE
is defined as the proportion (NEE/NTotal). Application of
the method shows that ribosomal protein messages are efficiently and mostly cotranscriptionally
spliced. Yeast genes with long introns are also spliced efficiently. HAC1/YFL031W
is poorly spliced partly because its splicing involves a nonspliceosome mechanism
and partly because Ire1p, which participate in splicing HAC1, is hardly expressed.
Many putative yeast genes have low SE, and some splice sites are incorrectly annotated.
- Wei, Y. and X. Xia (2019). "Unique Shine-Dalgarno sequences in Cyanobacteria and
chloroplasts reveal evolutionary differences in their translation initiation." Genome Biology
and Evolution 11(11):3194-3206.
Abstract: Microorganisms require efficient translation to grow
and replicate rapidly, and translation is often rate-limited by initiation. A prominent
feature that facilitates translation initiation in bacteria is the Shine-Dalgarno
(SD) sequence. However, there is much debate over its conservation in Cyanobacteria
and in chloroplasts which presumably originated from endosymbiosis of ancient Cyanobacteria.
Elucidating the utilization of SD sequences in Cyanobacteria and in chloroplasts
is therefore important to understand whether 1) SD role in Cyanobacterial translation
has been reduced prior to chloroplast endosymbiosis or 2) translation in Cyanobacteria
and in plastid has been subjected to different evolutionary pressures. To test these
alternatives, we employed genomic, proteomic, and transcriptomic data to trace differences
in SD usage between Synechocystis species, Microcystis aeruginosa, cyanophages,
Nicotiana tabacum chloroplast, and Arabidopsis thaliana chloroplast. We corrected
their mis-annotated 16S rRNA 3’ terminus using an RNA-Seq-based approach to determine
their SD/anti-SD locational constraints using an improved measurement DtoStart.
We found that cyanophages well mimic Cyanobacteria in SD usage because both have
been under the same selection pressure for SD-mediated initiation. Whereas chloroplasts
lost this similarity because the need for SD-facilitated initiation has been reduced
in plastids having much reduced genome size and different ribosomal proteins as
a result of host-symbiont co-evolution. Consequently, SD sequence significantly
increases protein expression in Cyanobacteria but not in chloroplasts, and only
Cyanobacterial genes compensate for a lack of SD sequence by having weaker secondary
structures at the 5’ UTR. Our results suggest different evolutionary pressures operate
on translation initiation in Cyanobacteria and in chloroplast.
- Xia, X. (2019). Starless bias and parameter-estimation bias in the likelihood-based
phylogenetic method.
AIMS Genetics 5(4):212-223.
Abstract: I analyzed various site pattern combinations in a 4-OTU
case to identify sources of starless bias and parameter-estimation bias in likelihood-based
phylogenetic methods, and reported three significant contributions. First, the likelihood
method is counterintuitive in that it may not generate a star tree with sequences
that are equidistant from each other. This behaviour, dubbed starless bias, happens
in a 4-OTU tree when there is an excess (i.e., more than expected from a star tree
and a substitution model) of conflicting phylogenetic signals supporting the three
resolved topologies equally. Special site pattern combinations leading to rejection
of a star tree, when sequences are equidistant from each other, were identified.
Second, fitting gamma distribution to model rate heterogeneity over sites is strongly
confounded with tree topology, especially in conjunction with the starless bias.
I present examples to show dramatic differences in the estimated shape parameter
α between a star tree and a resolved tree. There may be no rate heterogeneity over
sites (with the estimated α > 10000) when a star tree is imposed, but α <
1 (suggesting strong rate heterogeneity over sites) when an (incorrect) resolved
tree is imposed. Thus, the dependence of “rate heterogeneity’’ on tree topology
implies that “rate heterogeneity’’ is not a sequence-specific feature, cautioning
against interpreting a small α to mean that some sites are under strong purifying
selection and others not. Thirdly, because there is no existing (and working) likelihood
method for evaluating a star tree with continuous gamma-distributed rate, I have
implemented the method for JC69 in a self-contained R script for a four-OTU tree
(star or resolved), in addition to another R script assuming a constant rate over
sites. These R scripts should be useful for teaching and exploring likelihood methods
in phylogenetics.
- Xia, X. 2019. Translation Control of HAC1 by Regulation of Splicing in Saccharomyces
cerevisiae. Int.
J. Mol. Sci. 20(12), 2860
Abstract: Hac1p is a key transcription factor regulating the unfolded
protein response (UPR) induced by abnormal accumulation of unfolded/misfolded proteins
in the endoplasmic reticulum (ER) in Saccharomyces cerevisiae. The accumulation
of unfolded/misfolded proteins is sensed by protein Ire1p, which then undergoes
trans-autophosphorylation and oligomerization into discrete foci on the ER membrane.
HAC1 pre-mRNA, which is exported to the cytoplasm but is blocked from translation
by its intron sequence looping back to its 5’UTR to form base-pair interaction,
is transported to the Ire1p foci to be spliced, guided by a cis-acting bipartite
element at its 3’UTR (3’BE). Spliced HAC1 mRNA can be efficiently translated. The
resulting Hac1p enters the nucleus and activates, together with coactivators, a
large number of genes encoding proteins such as protein chaperones to restore and
maintain ER homeostasis and secretary protein quality control. This review details
the translation regulation of Hac1p production, mediated by the nonconventional
splicing, in the broad context of translation control and summarizes the evolution
and diversification of the UPR signaling pathway among fungal, metazoan and plant
lineages.
- Xia, X. (2019). Optimizing Phage Translation Initiation.
OBM Genetics 3(4):16.
Abstract: Phage as an anti-bacterial agent must be efficient in
killing bacteria, and consequently needs to replicate efficiently. Protein production
is a limiting step in replication in almost all forms of life, including phages.
Efficient protein production depends on the efficiency of translation initiation,
elongation and termination, with translation initiation often being rate limiting.
Initiation signals such as Shine-Dalgarno (SD) sequences and start codon are decoded
by anti-SD sequences and initiation tRNA, respectively. While the decoding machinery
cannot be readily modified, the signals can be engineered to increase the efficiency
of their decoding. Here I review our understanding of the translation machinery
to facilitate the engineering of optimal translation initiation signals for facilitating
the design of phage protein-coding genes, including 1) accurate characterization
of the 3' end of 16S rRNA by using RNA-Seq data, 2) identification of the optimal
SD/aSD interaction, and 3) reduction of secondary structure in sequences flanking
the start codon.
- Xia, X. 2019. PGT: Visualizing temporal and spatial biogeographic patterns. Global
Ecology & Biogeography 28:1195-1199
Aim: A geophylogeny, generated by mapping a phylogeny onto geographic
regions, graphically summarizes large-scale genetic variation over space and time,
and is consequently crucial for conceptual understanding and visualization of global
biogeographic patterns. The rapidly expanding DNA barcoding data with geographic
coordinates associated with each specimen have dramatically increased the number
of global phylogeographic studies that would benefit from software generating geophylogenies.
A number of software programs have been developed, some with advanced features,
but they either require additional software or lack in quality, especially in geographic
resolution. Innovation: PGT (Phylogeographic Tree), freely available
at http://dambe.bio.uottawa.ca/PGT/PGT.aspx,
combines the highest map quality and user-friendliness. It accesses Microsoft Bing
Maps and Google Maps seamlessly and generates geophylogenies on high-resolution
regular or terrain maps. Only a few mouse clicks are needed from PGT installation
to the generation of high-resolution geophylogenies, making PGT perfect for both
teaching and research in global ecology and biogeography. The input tree can be
in NEXUS or Newick format, and the geographic data with latitude and longitude values
can be in tab-delimited or comma-delimited format as those exported from spreadsheet
programs. A Quick-Start guide is included in the built-in help system. Main
conclusions: PGT is simpler, more elegant, and of much higher quality
than alternatives for plotting phylogenetic trees over geographic regions for visualizing
distribution of biodiversity over space and time.
- Wei Y, Silke JR, Xia X. 2019. An improved estimation
of tRNA expression to better elucidate the coevolution between tRNA abundance and
codon usage in bacteria. Scientific Reports
9:3184
Abstract: The degree to which codon usage can be explained by tRNA
abundance in bacterial species is often inadequate, partly because differential
tRNA abundance is often approximated by tRNA copy numbers. To better understand
the coevolution between tRNA abundance and codon usage, we provide a better estimate
of tRNA abundance by profiling tRNA mapped reads (tRNA tpm) using publicly available
RNA Sequencing data. To emphasize the feasibility of our approach, we demonstrate
that tRNA tpm is consistent with tRNA abundances derived from RNA fingerprinting
experiments in Escherichia coli, Bacillus subtilis, and Salmonella enterica. Furthermore,
we do not observe an appreciable reduction in tRNA sequencing efficiency due to
post-transcriptional methylations in the seven bacteria studied. To determine translationally
optimal codons, we calculate codon usage in highly and lowly expressed genes determined
by protein per transcript. We found that tRNA tpm identifies more translationally
optimal codons than gene copy number and early tRNA fingerprinting abundances. Additionally,
tRNA tpm improves the predictive power of tRNA adaptation index over codon preference.
Our results suggest that dependence of codon usage on tRNA availability is not always
associated with species growth-rate. Conversely, tRNA availability is better optimized
to codon usage in fast-growing than slow-growing species.
- Xia, X. 2019. Is there a mutation gradient along vertebrate mitochondrial genome
mediated by genome replication? Mitochondrion 46:30-40 Data here
Abstract: There is a long-held belief that a mutation gradient
exists along vertebrate mtDNA, mediated by mitochondrial replication that leaves
different parts of the H-strand exposed in single-stranded state for different durations
(DssH). However, the predicted mutation gradient and its tests suffer
from both conceptual and empirical problems. I assembled representative mammalian,
avian and crocodilian mtDNA to test this prediction. I measured substitution rates
at codon positions 1 and 2 (S12) and at codon position 3 (S3), as well as synonymous
and nonsynonymous substitution rates, and checked their change along the hypothetical
gradient. Mammalian species do not support the predicted mutation gradient, although
they should according to the model. Crocodilian species exhibit a pattern closest
to the prediction, although they should not because their OL, if present, is not
at a fixed position. Correlation between S3 and DssH is much weaker than
that between S12 and DssH (contrary to the prediction). This is not due
to substitution saturation but is instead due to differential gene conservation,
e.g., COX1 is far more conserved than ND6 in all metazoans no matter where they
are located along mtDNA. In vertebrates, conserved genes such as COX1 happen to
have small DssH and variables genes such as ND6 happen to have large
DssH. The observed “mutation gradient” is driven by nonsynonymous substitutions,
with synonymous substitutions associated with a much weaker “mutation gradient”
likely caused by differential codon re-adaptation after nonsynonymous substitutions.
The mammalian and avian results are also confirmed by a much larger compilation
and analysis of 691 mammalian and 462 avian mtDNAs. The results, however, does not
reject paper is not a test of the strand-displacement model (SDM) of mtDNA replication
because a mutation gradient is not a necessary consequence of SDM.
- Silke JR, Wei Y, Xia X.
2018. RNA-Seq-Based Analysis Reveals Heterogeneity in Mature 16S rRNA 3' Termini
and Extended Anti-Shine-Dalgarno Motifs in Bacterial Species. G3: Genes,Genomes,Genetics
7:17639
Abstract: We present an RNA-Seq based approach to map 3′ end sequences
of mature 16S rRNA (3′ TAIL) in bacteria with single-base specificity. Our results
show that 3′ TAILs are heterogeneous among species; they contain the core CCUCC
anti-Shine-Dalgarno motif, but vary in downstream lengths. Importantly, our findings
rectify the mis-annotated 16S rRNAs in 11 out of 13 bacterial species studied herein
(covering Cyanobacteria, Deinococcus-Thermus, Firmicutes, Proteobacteria, Tenericutes,
and Spirochaetes). Furthermore, our results show that species-specific 3′ TAIL boundaries
are retained due to their high complementarity with preferred Shine-Dalgarno sequences,
suggesting that 3′ TAIL bases downstream of the canonical CCUCC motif play a more
important role in translation initiation than previously reported.
- Xia X. (2018) Imputing missing distances in molecular phylogenetics. PeerJ 6:e5321
Abstract: Missing data are frequently encountered in molecular
phylogenetics, but there has been no accurate distance imputation method available
for distance-based phylogenetic reconstruction. The general framework for distance
imputation is to explore tree space and distance values to find an optimal combination
of output tree and imputed distances. Here I develop a least-square method coupled
with multivariate optimization to impute multiple missing distance in a distance
matrix or from a set of aligned sequences with missing genes so that some sequences
share no homologous sites (whose distances therefore need to be imputed). I show
that phylogenetic trees can be inferred from distance matrices with about 10% of
distances missing, and the accuracy of the resulting phylogenetic tree is almost
as good as the tree from full information. The new method has the advantage over
a recently published one in that it does not assume a molecular clock and is more
accurate (comparable to maximum likelihood method based on simulated sequences).
I have implemented the function in DAMBE software, which is freely available at
http://dambe.bio.uottawa.ca.
- Xia, X. 2018. DAMBE7: New and improved tools for data analysis in molecular biology
and evolution. Molecular Biology and Evolution 35:1550–1552.
Abstract: DAMBE is a comprehensive software package for genomic
and phylogenetic data analysis on Windows, Linux and Macintosh computers. New functions
include imputing missing distances and phylogeny simultaneously (paving the way
to build large phage and transposon trees), new bootstrapping/jackknifing methods
for PhyPA (phylogenetics from pairwise alignments), and an improved function for
fast and accurate estimation of the shape parameter of the gamma distribution for
fitting rate heterogeneity over sites. Previous method corrects multiple hits for
each site independently. DAMBE’s new method uses all sites simultaneously for correction.
DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca.
- Wei Y, Silke JR, Xia X.
2017. Elucidating the 16S rRNA 3′ boundaries and defining optimal SD/aSD pairing
in Escherichia coli and Bacillus subtilis using RNA-Seq data.
Scientific Reports 7:17639
Abstract: Bacterial translation initiation is influenced by base
pairing between the Shine-Dalgarno (SD) sequence in the 5′ UTR of mRNA and the anti-SD
(aSD) sequence at the free 3′ end of the 16S rRNA (3′ TAIL) due to: 1) the SD/aSD
sequence binding location and 2) SD/aSD binding affinity. In order to understand
what makes an SD/aSD interaction optimal, we must define: 1) terminus of the 3′
TAIL and 2) extent of the core aSD sequence within the 3′ TAIL. Our approach to
characterize these components in Escherichia coliand Bacillus subtilis involves
1) mapping the 3′ boundary of the mature 16S rRNA using high-throughput RNA sequencing
(RNA-Seq), and 2) identifying the segment within the 3′ TAIL that is strongly preferred
in SD/aSD pairing. Using RNA-Seq data, we resolve previous discrepancies in the
reported 3′ TAIL in B. subtilis and recovered the established 3′ TAIL in E. coli.
Furthermore, we extend previous studies to suggest that both highly and lowly expressed
genes favor SD sequences with intermediate binding affinity, but this trend is exclusive
to SD sequences that complement the core aSD sequences defined herein.
- Xia X (2017) ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic
Data. G3: Genes|Genomes|Genetics 7:3839-3848
Abstract: Two major stumbling blocks exist in high-throughput sequencing
(HTS) data analysis. The first is the sheer file size typically in gigabytes when
uncompressed, causing problems in storage, transmission and analysis. However, these
files do not need to be so large and can be reduced without loss of information.
Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous
identical reads stored as separate entries. For example, among 44603541 forward
reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study)
deposited at NCBI's SRA database, one read has 497027 identical copies. Instead
of storing them as separate entries, one can and should store them as a single entry
with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the
proper allocation of reads that map equally well to paralogous genes. I illustrate
in detail a new method for such allocation. I have developed ARSDA software that
implement these new approaches. A number of HTS files for model species are in the
process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate
that this approach not only saves a huge amount of storage space and transmission
bandwidth, but also dramatically reduces time in downstream data analysis. Instead
of matching the 497027 identical reads separately against the Bacillus subtilis
genome, one only needs to match it once. ARSDA includes functions to take advantage
of HTS data in the new sequence format for downstream data analysis such as gene
expression characterization. I contrasted gene expression results between ARSDA
and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely
available for Windows, Linux and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.
- Abolbaghaei A, Silke JR, Xia X. 2017 How Changes
in Anti-SD Sequences Would Affect SD Sequences in Escherichia coli and Bacillus
subtilis. G3: Genes|Genomes|Genetics 7(5):1607–1615
Abstract: The 3' end of the small ribosomal RNAs (ssu rRNA) in
bacteria is directly involved in the selection and binding of mRNA transcripts during
translation initiation via well-documented interactions between a Shine-Dalgarno
(SD) sequence located upstream of the initiation codon and an anti-SD (aSD) sequence
at the 3' end of the ssu rRNA. Consequently, the 3' end of ssu rRNA (3'TAIL) is
strongly conserved among bacterial species because a change in the region may impact
the translation of many protein-coding genes. Escherichia coli and Bacillus subtilis
differ in their 3' ends of ssu rRNA, being GAUCACCUCCUUA3' in E. coli and
GAUCACCUCCUUUCU3' or GAUCACCUCCUUUCUA3' in B. subtilis. Such differences
in 3'TAIL lead to species-specific SDs (designated SDEc for E. coli and SDBs for
B. subtilis) that can form strong and well-positioned SD/aSD pairing in one species
but not in the other. Selection mediated by the species-specific 3'TAIL is expected
to favour SDBs against SDEc in B. subtilis but favour SDEc against SDBs in E. coli.
Among well-positioned SDs, SDEc is used more in E. coli than in B. subtilis, and
SDBs more in B. subtilis than in E. coli. Highly expressed genes and genes of high
translation efficiency tend to have longer SDs than lowly expressed genes and genes
with low translation efficiency in both species, but more so in B. subtilis than
in E. coli. Both species overuse SDs matching the bolded part of 3'TAIL shown above.
The 3'TAIL difference contributes to host-specificity of phages.
- Xia X. 2017. Self-Organizing Map for Characterizing Heterogeneous Nucleotide and
Amino Acid Sequence Motifs. Computation 5(4):43
Abstract A self-organizing map (SOM) is an artificial neural network
algorithm that can learn from the training data consisting of objects expressed
as vectors and perform non-hierarchical clustering to represent input vectors into
discretized clusters, with vectors assigned to the same cluster sharing similar
numeric or alphanumeric features. SOM has been used widely in transcriptomics to
identify co-expressed genes as candidates for co-regulated genes. I envision SOM
to have great potential in characterizing heterogeneous sequence motifs, and aim
to illustrate this potential by a parallel presentation of SOM with a set of numerical
vectors and a set of equal-length sequence motifs. While there are numerous biological
applications of SOM involving numerical vectors, few studies have used SOM for heterogeneous
sequence motif characterization. This paper is intended to encourage (1) researchers
to study SOM in this new domain and (2) computer programmers to develop user-friendly
motif-characterization SOM tools for biologists.
- Xia X. 2017. Bioinformatics and Drug Discovery. Currrent Topics in Medicinal
Chemistry 17(15):1709-1726
Abstract Bioinformatic analysis can not only accelerate drug target
identification and drug candidate screening and refinement, but also facilitate
characterization of side effects and predict drug resistance. High-throughput data
such as genomic, epigenetic, genome architecture, cistromic, transcriptomic, proteomic,
and ribosome profiling data have all made significant contribution to mechanism-based
drug discovery and drug repurposing. Accumulation of protein and RNA structures,
as well as development of homology modeling and protein structure simulation, coupled
with large structure databases of small molecules and metabolites, paved the way
for more realistic protein-ligand docking experiments and more informative virtual
screening. I present the conceptual framework that drives the collection of these
high-throughput data, summarize the utility and potential of mining these data in
drug discovery, outline a few inherent limitations in data and software mining these
data, point out news ways to refine analysis of these diverse types of data, and
highlight commonly used software and databases relevant to drug discovery.
- Wei Y, Xia X 2017 The Role of +4U as an Extended
Translation Termination Signal in Bacteria. Genetics 205:539–549
Abstract Termination efficiency of stop codons depends on the first
3’ flanking (+4) base in bacteria and eukaryotes. In both Escherichia coli and Saccharomyces
cerevisiae, termination read-through is reduced in the presence of +4U; however,
the molecular mechanism underlying +4U function is poorly understood. Here, we perform
comparative genomics analysis on 25 bacterial species (covering Actinobacteria,
Bacteriodetes, Cyanobacteria, Deinococcus-Thermus, Firmicutes, Proteobacteria and
Spirochaetae) with bioinformatics approaches to examine the influence of +4U in
bacterial translation termination by contrasting between highly and lowly expressed
genes (HEGs and LEGs). We estimated gene expression using the recently formulated
Index of Translation Elongation, ITE, and identified stop codon near-cognate tRNAs
from well annotated genomes. We show that +4U was consistently over-represented
in UAA-ending HEGs relative to LEGs. The result is consistent with the interpretation
that +4U enhances termination mainly for UAA. Usage of +4U decreases in GC-rich
species where most stop codons are UGA and UAG, with few UAA-ending genes, which
is expected if UAA usage in HEGs drives up +4U usage. In highly expressed genes,
+4U usage increases significantly with abundance of UAA nc_tRNAs (near-cognate tRNAs
which decode codons differing from UAA by a single nucleotide), particularly those
with a mismatch at the first stop codon site. UAA is always the preferred stop codon
in highly expressed genes, and our results suggest that UAAU is the most efficient
translation termination signal in bacteria.
- Vlasschaert C, Cook D, Xia X, Gray DA. 2017. The
evolution and functional diversification of the deubiquitinating enzyme superfamily.
Genome Biology and Evolution 9:558-573
Abstract Ubiquitin and ubiquitin-like molecules are attached to
and removed from cellular proteins in a dynamic and highly regulated manner. Deubiquitinating
enzymes are critical to this process, and the genetic catalogue of deubiquitinating
enzymes expanded greatly over the course of evolution. Extensive functional redundancy
has been noted among the 93 members of the human deubiquitinating enzyme (DUB) superfamily.
This is especially true of genes that were generated by duplication (termed paralogs)
as they often retain considerable sequence similarity. Since complete redundancy
in systems should be eliminated by selective pressure we theorized that many overlapping
DUBs must have significant and unique spatiotemporal roles that can be evaluated
in an evolutionary context. We have determined the evolutionary history of the entire
class of deubiquitinating enzymes, including the sequence and means of duplication
for all paralogous pairs. To establish their uniqueness, we have investigated cell-type
specificity in developmental and adult contexts, and have investigated the co-emergence
of substrates from the same duplication events. Our analysis has revealed examples
of DUB gene subfunctionalization, neofunctionalization, and nonfunctionalization.
- Xia X 2017. Deriving Transition Probabilities and Evolutionary Distances from Substitution
Rate Matrix by Probability Reasoning. J Genet Genome Res 4:031.
Abstract Substitution rate matrices are used to correct multiple
hits at the same sites, which requires the derivation of transition probabilities
and evolutionary distances from substitution rate matrices. The derivation is essential
in molecular phylogenetics and phylogenomics, and represents the only statistically
sound way for developing scoring matrices used in sequence alignment and local string
matching (e.g., BLAST and FASTA). Three different approaches are frequently used
for deriving transition probabilities and evolutionary distances: 1) The probability
reasoning, 2) Solving partial differential equations, and 3) Matrix exponential
and logarithm. The first approach demands the least amount of mathematical skills
but offers the best way for conceptual understanding, and can often generate nice
mathematical expressions of transition probabilities and evolutionary distances.
This review represents the most systematic and comprehensive numerical illustration
of the first approach.
- Xia X. 2017. DAMBE6: New tools for microbial genomics, phylogenetics and molecular
evolution. Journal of Heredity 108(4):431-437.
Abstract DAMBE is a comprehensive software workbench for data analysis
in molecular biology, phylogenetics and evolution. Several important new functions
have been added since version 5 of DAMBE: 1) comprehensive genomic profiling of
translation initiation efficiency of different genes in different prokaryotic species,
2) a new index of translation elongation (ITE) that takes into account both tRNA-mediated
selection and background mutation on codon-anticodon adaptation, 3) a new and accurate
phylogenetic approach based on pairwise alignment only, which is useful for highly
divergent sequences from which a reliable multiple sequence alignment is difficult
to obtain. Many other functions have been updated and improved including PWM for
motif characterization, Gibbs sampler for de novo motif discovery, hidden Markov
models for protein secondary structure prediction, self-organizing map for non-linear
clustering of transcriptomic data, comprehensive sequence alignment and phylogenetic
functions. DAMBE features a graphic, user-friendly and intuitive interface, and
is freely available from http://dambe.bio.uottawa.ca.
- Xia X. 2016. PhyPA: phylogenetic method with pairwise sequence alignment outperforms
likelihood methods in phylogenetics involving highly diverged sequences. Molecular
Phylogenetics and Evolution 102:331–343
.
Abstract While pairwise sequence alignment (PSA) by dynamic programming
is guaranteed to generate one of the optimal alignments, multiple sequence alignment
(MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing
all subsequent phylogenetic analysis. One way to avoid this problem is to use only
PSA to reconstruct phylogenetic trees, which can only be done with distance-based
methods. I compared the accuracy of this new computational approach (named PhyPA
for phylogenetics by pairwise alignment) against the maximum likelihood method using
MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated
with different topologies and tree lengths. I present a surprising discovery that
the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly
diverged sequences even when all optimization options were turned on for the ML+MSA
approach. Only when sequences are not highly diverged (i.e., when a reliable MSA
can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies
are always recovered by ML with the true alignment from the simulation. However,
with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered
topology consistently has higher likelihood than that for the true topology. Thus,
the failure to recover the true topology by the ML+MSA is not because of insufficient
search of tree space, but by the distortion of phylogenetic signal by MSA methods.
I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data
sets to derive phylogenetic support for subtrees equivalent to resampling techniques
such as bootstrapping and jackknifing.
- Wei, Y., Wang, J., Xia, X. 2016. Coevolution between
stop codon usage and release factors in bacterial species. Molecular Biology
and Evolution 33:2357-2367.
.
Abstract Three stop codons in bacteria represent different translation
termination signals, and their usage is expected to depend on their differences
in translation termination efficiency, mutation bias, and relative abundance of
release factors (RF1 decoding UAA and UAG, and RF2 decoding UAA and UGA). In 14
bacterial species (covering Proteobacteria, Firmicutes, Cyanobacteria, Actinobacteria
and Spirochetes) with cellular RF1 and RF2 quantified, UAA is consistently over-represented
in highly expressed genes (HEGs) relative to lowly expressed genes (LEGs), whereas
UGA usage is the opposite even in species where RF2 is far more abundant than RF1.
UGA usage relative to UAG increases significantly with PRF2 [=RF2/(RF1+RF2)]
as expected from adaptation between stop codons and their decoders. PRF2
is greater than 0.5 over a wide range of AT content (measured by PAT3
as the proportion of AT at third codon sites), but decreases rapidly towards zero
at the high range of PAT3. This explains why bacterial lineages with
high PAT3 often have UGA reassigned because of low RF2. There is no indication
that UAG is a minor stop codon in bacteria as claimed in a recent publication. The
claim is invalid because of the failure to apply the two key criteria in identifying
a minor codon: 1) it is least preferred by HEGs (or most preferred by LEGs) and
2) it corresponds to the least abundant decoder. Our results suggest a more plausible
explanation for why UAA usage increases, and UGA usage decreases, with PAT3,
but UAG usage remains low over the entire PAT3 range.
- Sun X, Xia X, Yang Q. 2016. Dating the origin of the major lineages of Branchiopoda.
Palaeoworld 25 (2), 303-317
Abstract Despite the well-established phylogeny and good fossil
record of branchiopods, a consistent macro-evolutionary timescale for the group
remains elusive. This study focuses on the early branchiopod divergence dates where
fossil record is extremely fragmentary or missing. On the basis of a large genomic
dataset and carefully evaluated fossil calibration points, we assess the quality
of the branchiopod fossil record by calibrating the tree against well-established
first occurrences, providing paleontological estimates of divergence times and completeness
of their fossil record. The maximum age constraints were set using a quantitative
approach of Marshall (2008). We tested the alternative placements of Yicaris and
Wujicaris in the referred arthropod tree via the likelihood checkpoints method.
Divergence dates were calculated using Bayesian relaxed molecular clock and penalized
likelihood methods. Our results show that the stem group of Branchiopoda is rooted
in the late Neoproterozoic (563 ± 7 Ma); the crown-Branchiopoda diverged during
middle Cambrian to Early Ordovician (478–512 Ma), likely representing the origin
of the freshwater biota; the Phyllopoda clade diverged during Ordovician (448–480
Ma) and Diplostraca during Late Ordovician to early Silurian (430–457 Ma). By evaluating
the congruence between the observed times of appearance of clade in the fossil record
and the results derived from molecular data, we found that the uncorrelated rate
model gave more congruent results for shallower divergence events whereas the auto-correlated
rate model gives more congruent results for deeper events.
- Vlasschaert, C., Xia, X., Gray, D.A. 2016. Selection
preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals.
Scientific Reports 6:20039
.
Abstract Ubiquitin specific protease 4 (USP4) is a highly networked
deubiquitinating enzyme with reported roles in cancer, innate immunity and RNA splicing.
In mammals it has two dominant isoforms arising from inclusion or skipping of exon
7 (E7). We evaluated two plausible mechanisms for the generation of these isoforms:
(A) E7 skipping due to a long upstream intron and (B) E7 skipping due to inefficient
5′ splice sites (5′SS) and/or branchpoint sites (BPS). We then assessed whether
E7 alternative splicing is maintained by selective pressure or arose from genetic
drift. Both transcript variants were generated from a USP4-E7 minigene construct
with short flanking introns, an observation consistent with the second mechanism
whereby differential splice signal strengths are the basis of E7 skipping. Optimization
of the downstream 5′SS eliminated E7 skipping. Experimental validation of the correlation
between 5′SS identity and exon skipping in vertebrates pinpointed the +6 site as
the key splicing determinant. Therian mammals invariably display a 5′SS configuration
favouring alternative splicing and the resulting isoforms have distinct subcellular
localizations. We conclude that alternative splicing of mammalian USP4 is under
selective maintenance and that long and short USP4 isoforms may target substrates
in various cellular compartments.
- Vlasschaert, C., Xia, X., Coulombe, J., Gray, D.A.
2015. Evolution of the highly networked deubiquitinating enzymes USP4, USP15 and
USP11. BMC Evolutionary Biology 15:230
.
Background: USP4, USP15 and USP11 are paralogous deubiquitinating
enzymes as evidenced by structural organization and sequence similarity. Based on
known interactions and substrates it would appear that they have partially redundant
roles in pathways vital to cell proliferation, development and innate immunity,
and elevated expression of all three has been reported in various human malignancies.
The nature and order of duplication events that gave rise to these extant genes
has not been determined, nor has their functional redundancy been established experimentally
at the organismal level. Methods We have employed phylogenetic
and syntenic reconstruction methods to determine the chronology of the duplication
events that generated the three paralogs and have performed genetic crosses to evaluate
redundancy in mice. Results Our analyses indicate that USP4 and
USP15 arose from whole genome duplication prior to the emergence of jawed vertebrates.
Despite having lower sequence identity USP11 was generated later in vertebrate evolution
by small-scale duplication of the USP4-encoding region. While USP11 was subsequently
lost in many vertebrate species, all available genomes retain a functional copy
of either USP4 or USP15, and through genetic crosses of mice with inactivating mutations
we have confirmed that viability is contingent on a functional copy of USP4 or USP15.
Loss of ubiquitin-exchange regulation, constitutive skipping of the seventh exon
and neural-specific expression patterns are derived states of USP11. Post-translational
modification sites differ between USP4, USP15 and USP11 throughout evolution.
Conclusions In isolation sequence alignments can generate erroneous
USP gene phylogenies. Through a combination of methodologies the gene duplication
events that gave rise to USP4, USP15, and USP11 have been established. Although
it operates in the same molecular pathways as the other USPs, the rapid divergence
of the more recently generated USP11 enzyme precludes its functional interchangeability
with USP4 and USP15. Given their multiplicity of substrates the emergence (and in
some cases subsequent loss) of these USP paralogs would be expected to alter the
dynamics of the networks in which they are embedded.
- Prabhakaran, R., Chithambaram, S., Xia, X.
2015. Escherichia coli and Staphylococcus phages: Effect of translation
initiation efficiency on differential codon adaptation mediated by virulent and
temperate lifestyles. Journal of General Virology 96:1169-1179.
.
Abstract Rapid biosynthesis is key to the success of bacteria and
viruses. Highly expressed genes in bacteria exhibit strong codon bias corresponding
to differential availability of tRNAs. However, a large clade of lambdoid coliphages
exhibit relatively poor codon adaptation to the host translation machinery, in contrast
to other coliphages that exhibit strong codon adaptation to the host. Three possible
explanations were previously proposed but dismissed: 1) the phage-borne tRNA genes
that reduce the dependence of phage translation on host tRNAs, 2) lack of time needed
for evolving codon adaptation due to recent host switching, and 3) strong strand
asymmetry with biased mutation disrupting codon adaptation. Here we examine the
possibility that phages with relatively poor codon adaptation have poor translation
initiation which would weaken the selection on codon adaptation. We measure translation
initiation by: 1) the strength and position of the Shine-Dalgarno (SD) sequence
and (2) stability of secondary structure of sequences flanking SD and start codon
known to affect accessibility of SD and start codon. Phage genes with strong codon
adaptation have significantly stronger SD sequences than those with poor codon adaptation.
The former also have significantly weaker secondary structure in sequences flanking
SD and start codon than the latter. Thus, lambdoid phages do not exhibit strong
codon adaptation because they have relatively inefficient translation initiation
and would benefit little from increased elongation efficiency. We also provide evidence
suggesting that phage lifestyle (virulent versus temperate) affects selection intensity
on the efficiency of translation initiation and elongation.
- Xia X. 2015. A major controversy in codon-anticodon adaptation resolved by
a new codon usage index. Genetics 199:573-579
Abstract Two alternative hypotheses attribute different benefits
to codon-anticodon adaptation. The first assumes that protein production is rate-limited
by both initiation and elongation, and codon-anticodon adaptation would result in
higher elongation efficiency and more efficient and accurate protein production,
especially for highly expressed genes. The second claims that protein production
is rate-limited only by initiation efficiency, but improved codon adaptation and
consequently increased elongation efficiency have the benefit of increasing ribosomal
availability for global translation. To test these hypotheses, a recent study engineered
a synthetic library of 154 genes, all encoding the same protein but differing in
degrees of codon adaptation, to quantify the effect of differential codon adaptation
on protein production in Escherichia coli. The surprising conclusion that “codon
bias did not correlate with gene expression” and that “translation initiation, not
elongation, is rate-limiting for gene expression” contradicts the conclusion reached
by many other empirical studies. Here I resolve the contradiction by reanalyzing
the data from the 154 sequences. I demonstrate that translation elongation accounts
for about 17% of total variation in protein production and that the previous conclusion
is due to the use of CAI (codon adaptation index) which does not account for the
mutation bias in characterizing codon adaptation. The effect of translation elongation
becomes undetectable only when translation initiation is unrealistically slow. A
new index of translation elongation (ITE) is formulated to facilitate
studies on the efficiency and evolution of the translation machinery.
- Nikbakht, H., Xia, X., D. Hickey. 2014. The evolution of genomic GC content undergoes
a rapid reversal within the genus Plasmodium. Genome 57:507-511
Abstract The genome of the malarial parasite, Plasmodium falciparum,
is extremely AT-rich. This bias toward a low GC content is a characteristic of several
- but not all - species within the genus Plasmodium. We compared 4283 orthologous
pairs of protein-coding sequences between P. falciparum and the less AT-biased P.
vivax. Our results indicate that the common ancestor of these two species was also
extremely AT-rich. This means that, although there was a strong bias toward A+T
during the early evolution of the ancestral Plasmodium lineage, there was a subsequent
reversal of this trend during the more recent evolution of some species, such as
P. vivax. Moreover, we show that not only is the P. vivax genome losing its AT richness,
it is actually gaining a very significant degree of GC richness. This example illustrates
the potential volatility of nucleotide content during the course of molecular evolution.
Such reversible fluxes in nucleotide content within lineages could have important
implications for phylogenetic reconstruction based on molecular sequence data.
- Chithambaram S, Prabhakaran P, Xia X. 2014.
Differential codon adaptation between dsDNA and ssDNA phages in E. coli. Molecular
Biology and Evolution 31:1606-1617
Abstract Because phages use their host translation machinery, their
codon usage should evolve towards that of highly expressed host genes. We used two
indices to measure codon adaptation of phages to their host, rRSCU (the correlation
in RSCU between phages and their host) and CAI computed with highly expressed host
genes as the reference set (because phage translation depends on host translation
machinery). These indices used for this purpose are appropriate only when hosts
exhibit little mutation bias, so only phages parasitizing Escherichia coli were
included in the analysis. For double-stranded (dsDNA) phages, both rRSCU and CAI
decrease with increasing number of tRNA genes encoded by the phage genome. rRSCU
is greater for dsDNA phages than for ssDNA phages, and the low rRSCU values are
mainly due to poor concordance in RSCU values for Y-ending codons between ssDNA
phages and the E. coli host, consistent with the predicted effect of C→T mutation
bias in the ssDNA phages. Strong C→T mutation bias would improve codon adaptation
in codon families (e.g., Gly) where U-ending codons are favored over C-ending codons
(“U-friendly” codon families) by highly expressed host genes, but decrease codon
adaptation in other codon families where highly expressed host genes favor C-ending
codons against U-ending codons (“U-hostile” codon families). It is remarkable that
ssDNA phages with increasing C→T mutation bias also increased the usage of
codons in the “U-friendly” codon families, thereby achieving CAI values almost as
large as those of dsDNA phages. This represents a new type of codon adaptation.
- Prabhakaran R, Chithambaram S, Xia X 2014. Aeromonas
phages encode tRNAs for their overused codons. Int. J. Computational Biology and
Drug Design 7:168-183
.
Abstract The GC-rich bacterial species, Aeromonas salmonicida,
is parasitised by both GC-rich phages (Aeromonas phages- phiAS7 and vB_AsaM-56)
and GC-poor phages (Aeromonas phages – 25, 31, 44RR2.8t, 65, Aes508, phiAS4 and
phiAS5). Both the GC-rich Aeromonas phage phiAS7 and Aeromonas phage vB_AsaM-56
have nearly identical codon usage bias as their host. While all the remaining seven
GC-poor Aeromonas phages differ dramatically in codon usage from their GC-rich host.
Here, we investigated whether tRNA encoded in the genome of Aeromonas phages facilitate
the translation of phage proteins. We found that tRNAs encoded in the phage genome
correspond to synonymous codons overused in the phage genes but not in the host
genes.
- Chithambaram S, Prabhakaran P, Xia X. 2014.
The effects of mutation and selection on codon adaptation in E. coli bacteriophage.
Genetics 197:301-315
Abstract Studying phage codon adaptation is important not only
for understanding the process of translation elongation, but also for re-engineering
phages for medical and industrial purposes. To evaluate the effect of mutation and
selection on phage codon usage, we developed an index to measure selection imposed
by host translation machinery, based on the difference in codon usage between all
host genes and highly expressed host genes. We developed linear and nonlinear models
to estimate the C→T mutation bias in different phage lineages and to evaluate
the relative effect of mutation and host selection on phage codon usage. C→T
biased mutations occur more frequently in ssDNA phages than in dsDNA phages, and
affect not only synonymous codon usage, but also nonsynonymous substitutions at
second codon positions, especially in ssDNA phages. The host translation machinery
affects codon adaptation in both dsDNA and ssDNA phages, with stronger effect on
dsDNA phages than on ssDNA phages. Strand asymmetry with the associated local variation
in mutation bias can significantly interfere with codon adaptation in both dsDNA
and ssDNA phages.
- Xia, X. 2013. DAMBE5: A comprehensive software package for data analysis in molecular
biology and evolution. Molecular Biology and Evolution 30:1720-1728
.
Abstract Since its first release in 2001 as mainly a software package
for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE)
has gained many new functions that may be classified into six categories: 1) sequence
retrieval, editing, manipulation, and conversion among more than 20 standard sequence
formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability,
2) motif characterization and discovery functions such as position weight matrix
and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions
of codon adaptation index, effective number of codons, protein isoelectric point
profiling, RNA and protein secondary structure prediction and calculation of minimum
folding energy, and genomic skew plots with optimized window size, 4) molecular
phylogenetics including sequence alignment, testing substitution saturation, distance-based,
maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing
the molecular clock hypothesis with either a phylogeny or with relative-rate tests,
dating gene duplication and speciation events, choosing the best-fit substitution
models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative
methods for continuous and discrete variables, and 6) graphic functions including
secondary structure display, optimized skew plot, hydrophobicity plot, and many
other plots of amino acid properties along a protein sequence, tree display and
drawing by dragging nodes to each other, and visual searching of the maximum parsimony
tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely
available from http://dambe.bio.uottawa.ca
- Sun, X. Y., Yang, Q. Xia, X. 2013. An Improved Implementation of
Effective Number of Codons (Nc). Molecular Biology and Evolution
30:191-196.
Abstract The effective number of codons (Nc) is a widely used index
for characterizing codon usage bias because it does not require a set of reference
genes as does codon adaptation index (CAI) and because of the freely available computational
tools such as CodonW. However, Nc, as originally formulated has many problems. For
example, it can have values far greater than the number of sense codons; it treats
a 6-fold compound codon family as a single-codon family although it is made of a
2-fold and a 4-fold codon family that can be under dramatically different selection
for codon usage bias; the existing implementations do not handle all different genetic
codes; it is often biased by codon families with a small number of codons. We developed
a new Nc that has a number of advantages over the original Nc. Its maximum value
equals the number of sense codons when all synonymous codons are used equally, and
its minimum value equals the number of codon families when exactly one codon is
used in each synonymous codon family. It handles all known genetic codes. It breaks
the compound codon families (e.g., those involving amino acids coded by six synonymous
codons) into 2-fold and 4-fold codon families. It reduces the effect of codon families
with few codons by introducing pseudocount and weighted averages. The new Nc has
significantly improved correlation with CAI than the original Nc from CodonW based
on protein-coding genes from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila
melanogaster, Escherichia coli, Bacillus subtilis, Micrococcus luteus, and Mycoplasma
genitalium. It also correlates better with protein abundance data from the yeast
than the original Nc.
- Xia, X. 2012. Position Weight Matrix, Gibbs Sampler, and the Associated
Significance Tests in Motif Characterization and Prediction. Scientifica,
vol. 2012, Article ID 917540. doi:10.6064/2012/917540.
Abstract Position weight matrix (PWM) is not only one of the most
widely used bioinformatic methods, but also a key component in more advanced computational
algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide
or amino acid sequences. However, few generally applicable statistical tests are
available for evaluating the significance of site patterns, PWM, and PWM scores
(PWMS) of putative motifs. Statistical significance tests of the PWM output, that
is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and
have never been collected in a single paper, with the consequence that many implementations
of PWM do not include any significance test. Here I review PWM-based methods used
in motif characterization and prediction (including a detailed illustration of the
Gibbs sampler for de novo motif discovery), present statistical and probabilistic
rationales behind statistical significance tests relevant to PWM, and illustrate
their application with real data. The multiple comparison problem associated with
the test of site-specific frequencies is best handled by false discovery rate methods.
The test of PWM, due to the use of pseudocounts, is best done by resampling methods.
The test of individual PWMS for each sequence segment should be based on the extreme
value distribution.
- Vos, R. A., Balhoff, J. P., Caravas, J. A., Holder, M. T., Lapp, H., Maddison, W.
P., Midford, P. E., Priyam, A., Sukumaran, S. Xia, X., Stoltzfus, A. 2012. NeXML:
rich, extensible, and verifiable representation of comparative data and metadata.
Systematic Biology 61(4):675–689
Abstract In scientific research, integration and synthesis require
a common understanding of where data come from, how much they can be trusted, and
what they may be used for. To make such an understanding computer-accessible requires
standards for exchanging richly annotated data. The challenges of conveying reusable
data are particularly acute in regard to evolutionary comparative analysis, which
comprises an ever-expanding list of data types, methods, research aims, and subdisciplines.
To facilitate interoperability in evolutionary comparative analysis, we present
NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange
of richly annotated comparative data. NeXML defines syntax for operational taxonomic
units, character-state matrices, and phylogenetic trees and networks. Documents
can be validated unambiguously. Importantly, any data element can be annotated,
to an arbitrary degree of richness, using a system that is both flexible and rigorous.
We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies
user needs that cannot be satisfied with other available file formats. By relying
on XML Schema Definition, the design of NeXML facilitates the development and deployment
of software for processing, transforming, and querying documents. The adoption of
NeXML for practical use is facilitated by the availability of (1) an online manual
with code samples and a reference to all defined elements and attributes, (2) programming
toolkits in most of the languages used commonly in evolutionary informatics, and
(3) input-output support in several widely used software applications. An active,
open, community-based development process enables future revision and expansion
of NeXML.
- Xia, X. 2012. DNA Replication and Strand Asymmetry in Prokaryotic
and Mitochondrial Genomes. Current Genomics 13, 16-27
Abstract Different patterns of strand asymmetry have been documented
in a variety of prokaryotic genomes as well as mitochondrial genomes. Because different
replication mechanisms often lead to different patterns of strand asymmetry, much
can be learned of replication mechanisms by examining strand asymmetry. Here I summarize
the diverse patterns of strand asymmetry among different taxonomic groups to suggest
that (1) the single-origin replication may not be universal among bacterial species
as the endosymbionts Wigglesworthia glossinidia, Wolbachia species, cyanobacterium
Synechocystis 6803 and Mycoplasma pulmonis genomes all exhibit strand asymmetry
patterns consistent with the multiple origins of replication, (2) different replication
origins in some archaeal genomes leave quite different patterns of strand asymmetry,
suggesting that different replication origins in the same genome may be differentially
used, (3) mitochondrial genomes from representative vertebrate species share one
strand asymmetry pattern consistent with the strand-displacement replication documented
in mammalian mtDNA, suggesting that the mtDNA replication mechanism in mammals may
be shared among all vertebrate species, and (4) mitochondrial genomes from primitive
forms of metazoans such as the sponge and hydra (representing Porifera and Cnidaria,
respectively), as well as those from plants, have strand asymmetry patterns similar
to single-origin or multi-origin replications observed in prokaryotes and are drastically
different from mitochondrial genomes from other metazoans. This may explain why
sponge and hydra mitochondrial genomes, as well as plant mitochondrial genomes,
evolves much slower than those from other metazoans.
- Xia, X. , MacKay, V., Yao, X., Wu, J., Miura, F. Ito, T. Morris,
D. R. 2011. Translation initiation: a regulatory role for poly(A) tracts in front
of the AUG codon in Saccharomyces cerevisiae. Genetics 189:469-478
Abstract The 5'-UTR serves as the loading dock for ribosomes during
translation initiation and is the key site for translation regulation. Many genes
in the yeast Saccharomyces cerevisiae contain poly(A) tracts in their 5'-UTRs. We
studied these pre-AUG poly(A) tracts in a set of 3274 recently identified 5'-UTRs
in the yeast to characterize their effect on in vivo protein abundance, ribosomal
density, and protein synthesis rate in the yeast. The protein abundance and the
protein synthesis rate increase with the length of the poly(A), but exhibit a dramatic
decrease when the poly(A) length is ≥12. The ribosomal density also reaches
the lowest level when the poly(A) length is ≥12. This supports the hypothesis
that a pre-AUG poly(A) tract can bind to translation initiation factors to enhance
translation initiation, but a long (≥12) pre-AUG poly(A) tract will bind to
Pab1p, whose binding size is 12 consecutive A residues in yeast, resulting in repression
of translation. The hypothesis explains why a long pre-AUG poly(A) leads to more
efficient translation initiation than a short one when PABP is absent, and why pre-AUG
poly(A) is short in the early genes but long in the late genes of vaccinia virus.
- Ma, P.,Ma, P., Xia X. 2011. Factors
affecting splicing strength of yeast genes. Comparative and Functional Genomics.
Article ID 212146, 13 pages
Abstract Accurate and efficient splicing is of crucial importance
for highly-transcribed intron-containing genes (ICGs) in rapidly replicating unicellular
eukaryotes such as the budding yeast Saccharomyces cerevisiae. We characterize the
5' and 3' splice sites (ss) by position weight matrix scores (PWMSs), which is the
highest for the consensus sequence and the lowest for splice sites differing most
from the consensus sequence and used PWMS as a proxy for splicing strength. HAC1,
which is known to be spliced by a nonspliceosomal mechanism, has the most negative
PWMS for both its 5' ss and 3' ss. Several genes under strong splicing regulation
and requiring additional splicing factors for their splicing also have small or
negative PWMS values. Splicing strength is higher for highly transcribed ICGs than
for lowly transcribed ICGs and higher for transcripts that bind strongly to spliceosomes
than those that bind weakly. The 3' splice site features a prominent poly-U tract
before the 3'AG. Our results suggest the potential of using PWMS as a screening
tool for ICGs that are either spliced by a nonspliceosome mechanism or under strong
splicing regulation in yeast and other fungal species.
- Xia, X. , Yang, Q. 2011. A Distance-based Least-square Method for
Dating Speciation Events. Molecular Phylogenetics and Evolution
59:342-353.
Abstract Distance-based phylogenetic methods are widely used in
biomedical research. However, there has been little development of rigorous statistical
methods and software for dating speciation and gene duplication events by using
evolutionary distances. Here we present a simple, fast and accurate dating method
based on the least-squares (LS) method that has already been widely used in molecular
phylogenetic reconstruction. Dating methods with a global clock or two different
local clocks are presented. Single or multiple fossil calibration points can be
used, and multiple data sets can be integrated in a combined analysis. Variation
of the estimated divergence time is estimated by resampling methods such as bootstrapping
or jackknifing. Application of the method to dating the divergence time among seven
ape species or among 35 mammalian species including major mammalian orders shows
that the estimated divergence time with the LS criterion is nearly identical to
those obtained by the likelihood method or Bayesian inference.
- van Weringh, A, M. Ragonnet-Cronin, E. Pranckeviciene,
M. Pavon-Eternod, L. Kleiman, X. Xia. 2011. HIV-1 modulates the
tRNA pool to improve translation efficiency. Molecular Biology and Evolution
28:1827-1834
Abstract Despite its poorly adapted codon usage, HIV-1 replicates
and is expressed extremely well in human host cells. HIV-1 has recently been shown
to package non-lysyl transfer RNAs (tRNAs) in addition to the tRNA(Lys) needed for
priming reverse transcription and integration of the HIV-1 genome. By comparing
the codon usage of HIV-1 genes with that of its human host, we found that tRNAs
decoding codons that are highly used by HIV-1 but avoided by its host are overrepresented
in HIV-1 virions. In particular, tRNAs decoding A-ending codons, required for the
expression of HIV's A-rich genome, are highly enriched. Because the affinity of
Gag-Pol for all tRNAs is nonspecific, HIV packaging is most likely passive and reflects
the tRNA pool at the time of viral particle formation. Codon usage of HIV-1 early
genes is similar to that of highly expressed host genes, but codon usage of HIV-1
late genes was better adapted to the selectively enriched tRNA pool, suggesting
that alterations in the tRNA pool are induced late in viral infection. If HIV-1
genes are adapting to an altered tRNA pool, codon adaptation of HIV-1 may be better
than previously thought.
- Palidwor GA, Perkins TJ, Xia X.
2010. A General Model of Codon Bias Due to GC Mutational Bias. PLoS ONE
5(10): e13431.
BACKGROUND: In spite of extensive research on the effect of mutation and
selection on codon usage, a general model of codon usage bias due to mutational
bias has been lacking. Because most amino acids allow synonymous GC content changing
substitutions in the third codon position, the overall GC bias of a genome or genomic
region is highly correlated with GC3, a measure of third position GC content. For
individual amino acids as well, G/C ending codons usage generally increases with
increasing GC bias and decreases with increasing AT bias. Arginine and leucine,
amino acids that allow GC-changing synonymous substitutions in the first and third
codon positions, have codons which may be expected to show different usage patterns.PRINCIPAL
FINDINGS:In analyzing codon usage bias in hundreds of prokaryotic and plant
genomes and in human genes, we find that two G-ending codons, AGG (arginine) and
TTG (leucine), unlike all other G/C-ending codons, show overall usage that decreases
with increasing GC bias, contrary to the usual expectation that G/C-ending codon
usage should increase with increasing genomic GC bias. Moreover, the usage of some
codons appears nonlinear, even nonmonotone, as a function of GC bias. To explain
these observations, we propose a continuous-time Markov chain model of GC-biased
synonymous substitution. This model correctly predicts the qualitative usage patterns
of all codons, including nonlinear codon usage in isoleucine, arginine and leucine.
The model accounts for 72%, 64% and 52% of the observed variability of codon usage
in prokaryotes, plants and human respectively. When codons are grouped based on
common GC content, 87%, 80% and 68% of the variation in usage is explained for prokaryotes,
plants and human respectively.CONCLUSIONS:The model clarifies the sometimes-counterintuitive
effects that GC mutational bias can have on codon usage, quantifies the influence
of GC mutational bias and provides a natural null model relative to which other
influences on codon bias may be measured.
- Jiang, J.-Y., H. Xiong, M. Cao, X. Xia,
M.-A. Sirard, B Tsang. 2010. Mural granulosa cell gene expression associated with
oocyte developmental competence. Journal of Ovarian Research 2010,
3:6.
BACKGROUND: Ovarian follicle development is a complex process. Paracrine
interactions between somatic and germ cells are critical for normal follicular development
and oocyte maturation. Studies have suggested that the health and function of the
granulosa and cumulus cells may be reflective of the health status of the enclosed
oocyte. The objective of the present study is to assess, using an in vivo immature
rat model, gene expression profile in granulosa cells, which may be linked to the
developmental competence of the oocyte. We hypothesized that expression of specific
genes in granulosa cells may be correlated with the developmental competence of
the oocyte.METHODS:Immature rats were injected with eCG and 24 h thereafter
with anti-eCG antibody to induce follicular atresia or with pre-immune serum to
stimulate follicle development. A high percentage (30-50%, normal developmental
competence, NDC) of oocytes from eCG/pre-immune serum group developed to term after
embryo transfer compared to those from eCG/anti-eCG (0%, poor developmental competence,
PDC). Gene expression profiles of mural granulosa cells from the above oocyte-collected
follicles were assessed by Affymetrix rat whole genome array.RESULTS:The
result showed that twelve genes were up-regulated, while one gene was down-regulated
more than 1.5 folds in the NDC group compared with those in the PDC group. Gene
ontology classification showed that the up-regulated genes included lysyl oxidase
(Lox) and nerve growth factor receptor associated protein 1 (Ngfrap1), which are
important in the regulation of protein-lysine 6-oxidase activity, and in apoptosis
induction, respectively. The down-regulated genes included glycoprotein-4-beta galactosyltransferase
2 (Ggbt2), which is involved in the regulation of extracellular matrix organization
and biogenesis.CONCLUSIONS:The data in the present study demonstrate a close
association between specific gene expression in mural granulosa cells and the developmental
competence of oocytes. This finding suggests that the most differentially expressed
gene, lysyl oxidase, may be a candidate biomarker of oocyte health and useful for
the selection of good quality oocytes for assisted reproduction.
- Zhang, D., J. T. Popesku, C. J. Martyniuk, H. Xiong,
P. Duarte-Guterman, L. Yao, Xia, X., and V. L. Trudeau. 2009. Profiling
neuroendocrine gene expression changes following fadrozole-induced estrogen decline
in the female goldfish. Physiol. Genomics 38:351-361.
Abstract Teleost fish represent unique models to study the role
of neuroestrogens because of the extremely high activity of brain aromatase (AroB;
the product of cyp19a1b). Aromatase respectively converts androstenedione and testosterone
to estrone and 17beta-estradiol (E2). Specific inhibition of aromatase activity
by fadrozole has been shown to impair estrogen production and influence neuroendocrine
and reproductive functions in fish, amphibians, and rodents. However, very few studies
have identified the global transcriptomic response to fadrozole-induced decline
of estrogens in a physiological context. In our study, sexually mature prespawning
female goldfish were exposed to fadrozole (50 mcirog/l) in March and April when
goldfish have the highest AroB activity and maximal gonadal size. Fadrozole treatment
significantly decreased serum E2 levels (4.7 times lower; P = 0.027) and depressed
AroB mRNA expression threefold in both the telencephalon (P = 0.021) and the hypothalamus
(P = 0.006). Microarray expression profiling of the telencephalon identified 98
differentially expressed genes after fadrozole treatment (q value <0.05). Some of
these genes have shown previously to be estrogen responsive in either fish or other
species, including rat, mouse, and human. Gene ontology analysis together with functional
annotations revealed several regulatory themes for physiological estrogen action
in fish brain that include the regulation of calcium signaling pathway and autoregulation
of estrogen receptor action. Real-time PCR verified microarray data for decreased
(activin-betaA) or increased (calmodulin, ornithine decarboxylase 1) mRNA expression.
These data have implications for our understanding of estrogen actions in the adult
vertebrate brain.
- Li, H., G. Liu, and X. Xia. 2009. Correlations between recombination
rate and intron distributions along chromosomes of C. elegans. Progress in Natural
Science 19:517.
- Xia, X. 2009. Information-theoretic indices and an approximate
significance test for testing the molecular clock hypothesis with genetic distances.
Molecular Phylogenetics and Evolution 52:665-676.
Abstract Distance-based phylogenetic methods are widely used in
biomedical research. However, distance-based dating of speciation events and the
test of the molecular clock hypothesis are relatively underdeveloped. Here I develop
an approximate test of the molecular clock hypothesis for distance-based trees,
as well as information-theoretic indices that have been used frequently in model
selection, for use with distance matrices. The results are in good agreement with
the conventional sequence-based likelihood ratio test. Among the information-theoretic
indices, AICu is the most consistent with the sequence-based likelihood ratio test.
The confidence in model selection by the indices can be evaluated by bootstrapping.
I illustrate the usage of the indices and the approximate significance test with
both empirical and simulated sequences. The tests show that distance matrices from
protein gel electrophoresis and from genome rearrangement events do not violate
the molecular clock hypothesis, and that the evolution of the third codon position
conforms to the molecular clock hypothesis better than the second codon position
in vertebrate mitochondrial genes. I outlined evolutionary distances that are appropriate
for phylogenetic reconstruction and dating.
- Xia, X., Holcik, M., 2009. Strong Eukaryotic IRESs Have Weak Secondary
Structure. PLoS ONE 4, e4136.
BACKGROUND: The objective of this work was to investigate the hypothesis
that eukaryotic Internal Ribosome Entry Sites (IRES) lack secondary structure and
to examine the generality of the hypothesis.METHODOLOGY/PRINCIPAL FINDINGS:
IRESs of the yeast and the fruit fly are located in the 5'UTR immediately upstream
of the initiation codon. The minimum folding energy (MFE) of 60 nt RNA segments
immediately upstream of the initiation codons was calculated as a proxy of secondary
structure stability. MFE of the reverse complements of these 60 nt segments was
also calculated. The relationship between MFE and empirically determined IRES activity
was investigated to test the hypothesis that strong IRES activity is associated
with weak secondary structure. We show that IRES activity in the yeast and the fruit
fly correlates strongly with the structural stability, with highest IRES activity
found in RNA segments that exhibit the weakest secondary structure. CONCLUSIONS:
We found that a subset of eukaryotic IRESs exhibits very low secondary structure
in the 5'-UTR sequences immediately upstream of the initiation codon. The consistency
in results between the yeast and the fruit fly suggests a possible shared mechanism
of cap-independent translation initiation that relies on an unstructured RNA segment.
- Cong, P., X. Xia, and Q. Yang. 2009. Monophyly of the ring-forming
group in Diplopoda (Myriapoda, Arthropoda) based on SSU and LSU ribosomal RNA sequences.
Progress in Natural Science 19:1297-1303
- Zhang, D., H. Xiong, J. A. Mennigen, J. T. Popesku,
V. L. Marlatt, C. J. Martyniuk, K. Crump, A. R. Cossins, X. Xia,
and V. L. Trudeau. 2009. Defining Global Neuroendocrine Gene Expression Patterns
Associated with Reproductive Seasonality in Fish. PLoS ONE 4:e5816..
BACKGROUND: Many vertebrates, including the goldfish, exhibit seasonal reproductive
rhythms, which are a result of interactions between external environmental stimuli
and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While
it is long believed that differential expression of neuroendocrine genes contributes
to establishing seasonal reproductive rhythms, no systems-level investigation has
yet been conducted.METHODOLOGY/PRINCIPAL FINDINGS: In the present study,
by analyzing multiple female goldfish brain microarray datasets, we have characterized
global gene expression patterns for a seasonal cycle. A core set of genes (873 genes)
in the hypothalamus were identified to be differentially expressed between May,
August and December, which correspond to physiologically distinct stages that are
sexually mature (prespawning), sexual regression, and early gonadal redevelopment,
respectively. Expression changes of these genes are also shared by another brain
region, the telencephalon, as revealed by multivariate analysis. More importantly,
by examining one dataset obtained from fish in October who were kept under long-daylength
photoperiod (16 h) typical of the springtime breeding season (May), we observed
that the expression of identified genes appears regulated by photoperiod, a major
factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed
that hormone genes and genes functionally involved in G-protein coupled receptor
signaling pathway and transmission of nerve impulses are significantly enriched
in an expression pattern, whose transition is located between prespawning and sexually
regressed stages. The existence of seasonal expression patterns was verified for
several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin,
and aromatase b by independent samplings of goldfish brains from six seasonal time
points and real-time PCR assays.CONCLUSIONS/SIGNIFICANCE: Using both theoretical
and experimental strategies, we report for the first time global gene expression
patterns throughout a breeding season which may account for dynamic neuroendocrine
regulation of seasonal reproductive development.
- Zhang, D., H. Xiong, J. Shan, X. Xia,
and V. Trudeau. 2008. Functional insight into Maelstrom in the germline piRNA pathway:
a unique domain homologous to the DnaQ-H 3'-5' exonuclease, its lineage-specific
expansion/loss and evolutionarily active site switch. Biology Direct
3:48.
Abstract Maelstrom (MAEL) plays a crucial role in a recently-discovered
piRNA pathway; however its specific function remains unknown. Here a novel MAEL-specific
domain characterized by a set of conserved residues (Glu-His-His-Cys-His-Cys, EHHCHC)
was identified in a broad range of species including vertebrates, sea squirts, insects,
nematodes, and protists. It exhibits ancient lineage-specific expansions in several
species, however, appears to be lost in all examined teleost fish species. Functional
involvement of MAEL domains in DNA- and RNA-related processes was further revealed
by its association with HMG, SR-25-like and HDAC_interact domains. A distant similarity
to the DnaQ-H 3'-5' exonuclease family with the RNase H fold was discovered based
on the evidence that all MAEL domains adopt the canonical RNase H fold; and several
protist MAEL domains contain the conserved 3'-5' exonuclease active site residues
(Asp-Glu-Asp-His-Asp, DEDHD). This evolutionary link together with structural examinations
leads to a hypothesis that MAEL domains may have a potential nuclease activity or
RNA-binding ability that may be implicated in piRNA biogenesis. The observed transition
of two sets of characteristic residues between the ancestral DnaQ-H and the descendent
MAEL domains may suggest a new mode for protein function evolution called "active
site switch", in which the protist MAEL homologues are the likely evolutionary intermediates
due to harboring the specific characteristics of both 3'-5' exonuclease and MAEL
domains.
- Popesku, J. T., C. J. Martyniuk, J. Mennigen, H. Xiong,
D. Zhang, X. Xia, A. R. Cossins, and V. L. Trudeau. 2008. The goldfish
(Carassius auratus) as a model for neuroendocrine signaling. Mol Cell Endocrinol.
293(1-2):43-56
Abstract Goldfish (Carassius auratus) are excellent model organisms
for the neuroendocrine signaling and the regulation of reproduction in vertebrates.
Goldfish also serve as useful model organisms in numerous other fields. In contrast
to mammals, teleost fish do not have a median eminence; the anterior pituitary is
innervated by numerous neuronal cell types and thus, pituitary hormone release is
directly regulated. Here we briefly describe the neuroendocrine control of luteinizing
hormone. Stimulation by gonadotropin-releasing hormone and a multitude of classical
neurotransmitters and neuropeptides is opposed by the potent inhibitory actions
of dopamine. The stimulatory actions of gamma-aminobutyric acid and serotonin are
also discussed. We will focus on the development of a cDNA microarray composed of
carp and goldfish sequences which has allowed us to examine neurotransmitter-regulated
gene expression in the neuroendocrine brain and to investigate potential genomic
interactions between these key neurotransmitter systems. We observed that isotocin
(fish homologue of oxytocin) and activins are regulated by multiple neurotransmitters,
which is discussed in light of their roles in reproduction in other species. We
have also found that many novel and uncharacterized goldfish expressed sequence
tags in the brain are also regulated by neurotransmitters. Their sites of production
and whether they play a role in neuroendocrine signaling and control of reproduction
remain to be determined. The transcriptomic tools developed to study reproduction
could also be used to advance our understanding of neuroendocrine-immune interactions
and the relationship between growth and food intake in fish.
- Vinci, G., X. Xia, and R. A. Veitia. 2008. Preservation of Genes
Involved in Sterol Metabolism in Cholesterol Auxotrophs: Facts and Hypotheses.
PLoS ONE 3:e2883.
BACKGROUND: It is known that primary sequences of enzymes involved in sterol
biosynthesis are well conserved in organisms that produce sterols de novo. However,
we provide evidence for a preservation of the corresponding genes in two animals
unable to synthesize cholesterol (auxotrophs): Drosophila melanogaster and Caenorhabditis
elegans. Principal Findings: We have been able to detect bona fide orthologs
of several ERG genes in both organisms using a series of complementary approaches.
We have detected strong sequence divergence between the orthologs of the nematode
and of the fruitfly; they are also very divergent with respect to the orthologs
in organisms able to synthesize sterols de novo (prototrophs). Interestingly, the
orthologs in both the nematode and the fruitfly are still under selective pressure.
It is possible that these genes, which are not involved in cholesterol synthesis
anymore, have been recruited to perform different new functions. We propose a more
parsimonious way to explain their accelerated evolution and subsequent stabilization.
The products of ERG genes in prototrophs might be involved in several biological
roles, in addition to sterol synthesis. In the case of the nematode and the fruitfly,
the relevant genes would have lost their ancestral function in cholesterogenesis
but would have retained the other function(s), which keep them under pressure. Conclusions:
By exploiting microarray data we have noticed a strong expressional correlation
between the orthologs of ERG24 and ERG25 in D. melanogaster and genes encoding factors
involved in intracellular protein trafficking and folding and with Start1 involved
in ecdysteroid synthesis. These potential functional connections are worth being
explored not only in Drosophila, but also in Caenorhabditis as well as in sterol
prototrophs.
- Xia, X. 2008. The cost of wobble translation in fungal mitochondrial
genomes: integration of two traditional hypotheses. BMC Evolutionary Biology
8:211.
BACKGROUND: Fungal and animal mitochondrial genomes typically have one tRNA
for each synonymous codon family. The codon-anticodon adaptation hypothesis predicts
that the wobble nucleotide of a tRNA anticodon should evolve towards maximizing
Watson-Crick base pairing with the most frequently used codon within each synonymous
codon family, whereas the wobble versatility hypothesis argues that the nucleotide
at the wobble site should be occupied by a nucleotide most versatile in wobble pairing,
i.e., the tRNA wobble nucleotide should be G for NNY codon families, and U for NNR
and NNN codon families (where Y stands for C or U, R for A or G and N for any nucleotide).
RESULTS: We here integrate these two traditional hypotheses on tRNA anticodons
into a unified model based on an analysis of the wobble costs associated with different
wobble base pairs. This novel approach allows the relative cost of wobble pairing
to be qualitatively evaluated. A comprehensive study of 36 fungal genomes suggests
very different costs between two kinds of U:G wobble pairs, i.e., (1) between a
G at the wobble site of a tRNA anticodon and a U at the third codon position (designated
MU3:G) and (2) between a U at the wobble site of a tRNA anticodon and a G at the
third codon position (designated MG3:U). CONCLUSION: In general, MU3:G is
much smaller than MG3:U, suggesting no selection against U-ending codons in NNY
codon families with a wobble G in the tRNA anticodon but strong selection against
G-ending codons in NNR codon families with a wobble U at the tRNA anticodon. This
finding resolves several puzzling observations in fungal genomics and corroborates
previous studies showing that U3:G wobble is energetically more favorable than G3:U
wobble.
- Mennigen, J. A., C. J. Martyniuk, K. Crump, H. Xiong,
E. Zhao, J. Popesku, H. Anisman, A. R. Cossins, X. Xia, and V.
L. Trudeau. 2008. Effects of fluoxetine on the reproductive axis of female goldfish
(Carassius auratus). Physiol. Genomics 35:273-282.
Abstract We investigated the effects of fluoxetine, a selective
serotonin reuptake inhibitor, on neuroendocrine function and the reproductive axis
in female goldfish. Fish were given intraperitoneal injections of fluoxetine twice
a week for 14 days, resulting in five injections of 5 microg fluoxetine/g body wt.
We measured the monoamine neurotransmitters serotonin, dopamine, and norepinephrine
in addition to their metabolites with HPLC. Homovanillic acid, a metabolite in the
dopaminergic pathway, increased significantly in the hypothalamus. Plasma estradiol
levels were measured by radioimmunoassay and were significantly reduced approximately
threefold after fluoxetine treatment. We found that fluoxetine also significantly
reduced the expression of estrogen receptor (ER)beta1 mRNA by 4-fold in both the
hypothalamus and the telencephalon and ERalpha mRNA by 1.7-fold in the telencephalon.
Fluoxetine had no effect on the expression of ERbeta2 mRNA in the hypothalamus or
telencephalon. Microarray analysis identified isotocin, a neuropeptide that stimulates
reproductive behavior in fish, as a candidate gene affected by fluoxetine treatment.
Real-time RT-PCR verified that isotocin mRNA was downregulated approximately sixfold
in the hypothalamus and fivefold in the telencephalon. Intraperitoneal injection
of isotocin (1 microg/g) increased plasma estradiol, providing a potential link
between changes in isotocin gene expression and decreased circulating estrogen in
fluoxetine-injected fish. Our results reveal targets of serotonergic modulation
in the neuroendocrine brain and indicate that fluoxetine has the potential to affect
sex hormones and modulate genes involved in reproductive function and behavior in
the brain of female goldfish. We discuss these findings in the context of endocrine
disruption because fluoxetine has been detected in the environment.
- Marin, A. and Xia, X. 2008. GC skew in protein-coding genes between
the leading and lagging strands in bacterial genomes: new substitution models incorporating
strand-bias. Journal of Theoretical Biology 253(3):508-513
Abstract The DNA strands in most prokaryotic genomes experience
strand-biased spontaneous mutation, especially C→T mutations produced by deamination
that occur preferentially in the leading strand. This has often been invoked to
account for the asymmetry in nucleotide composition, typically measured by GC skew,
between the leading and the lagging strand. Casting such strand asymmetry in the
framework of a nucleotide substitution model is important for understanding genomic
evolution and phylogenetic reconstruction. We present a substitution model showing
that the increased C→T mutation will lead to positive GC skew in one strand
but negative GC skew in the other, with greater C→T mutation pressure associated
with greater differences in GC skew between the leading and the lagging strand.
However, the model based on mutation bias alone does not predict any positive correlation
in GC skew between the leading and lagging strands. We computed GC skew for coding
sequences collinear with the leading and lagging strands across 339 prokaryotic
genomes and found a strong and positive correlation in GC skew between the two strands.
We show that the observed positive correlation can be satisfactorily explained by
an improved substitution model with one additional parameter incorporating a general
trend of C avoidance.
- Aris-Brosou, S. and Xia, X. 2008 Phylogenetic analyses: a toolbox
expanding towards Bayesian methods. International Journal of Plant Genomics Article
ID 683509.
Abstract The reconstruction of phylogenies is becoming an increasingly
simple activity. This is mainly due to two reasons: the democratization of computing
power and the increased availability of sophisticated yet user-friendly software.
This review describes some of the latest additions to the phylogenetic toolbox,
along with some of their theoretical and practical limitations. It is shown that
Bayesian methods are under heavy development, as they offer the possibility to solve
a number of long-standing issues and to integrate several steps of the phylogenetic
analyses into a single framework. Specific topics include not only phylogenetic
reconstruction, but also the comparison of phylogenies, the detection of adaptive
evolution, and the estimation of divergence times between species.
- Carullo, M. and Xia, X. 2008 An
extensive study of mutation and selection on the wobble nucleotide in tRNA anticodons
in fungal mitochondrial genomes. Journal of Molecular Evolution 66:484-493
.
Abstract Two alternative hypotheses aim to predict the wobble nucleotide
of tRNA anticodons in mitochondrion. The codon-anticodon adaptation hypothesis predicts
that the wobble nucleotide of tRNA anticodon should evolve toward maximizing the
Watson-Crick base pairing with the most frequently used codon within each synonymous
codon family. In contrast, the wobble versatility hypothesis argues that the nucleotide
at the wobble site should be occupied by a nucleotide most versatile in wobble pairing,
i.e., the wobble site of the tRNA anticodon should be G for NNY codon families and
U for NNR and NNN codon families (where Y stands for C or U, R for A or G, and N
for any nucleotide). We examined codon usage and anticodon wobble sites in 36 fungal
genomes to evaluate these two alternative hypotheses and identify exceptional cases
that deserve new explanations. While the wobble versatility hypothesis is generally
supported, there are interesting exceptions involving tRNA(Arg) translating the
CGN codon family, tRNA(Trp) translating the UGR codon family, and tRNA(Met) translating
the AUR codon family. Our results suggest that the potential to suppress stop codons,
the historical inertia, and the conflict between translation initiation and elongation
can all contribute to determining the wobble nucleotide of tRNA anticodons.
- Marlatt, V.L., Martyniuk, C. J., Zhang, D., Xiong, H.,
Watt, J., Xia, X., Moon, T., Trudeau, V.L. 2008. Auto-regulation
of estrogen receptor subtypes and gene expression profiling of 17beta-estradiol
action in the neuroendocrine axis of male goldfish. Molecular and Cellular Endocrinology
283:38-48.
Abstract Auto-regulation of the three goldfish estrogen receptor
(ER) subtypes was examined simultaneously in multiple tissues, in relation to mRNA
levels of liver vitellogenin (VTG) and brain transcripts. Male goldfish were implanted
with a silastic implant containing either no steroid or 17beta-estradiol (E2) (100
microg/g body mass) for one and seven days. Liver transcript levels of ERalpha were
the most highly up-regulated of the ERs, and a parallel induction of liver VTG was
observed. In the testes (7d) and telencephalon (7d), E2 induced ERalpha. In the
liver (1d) and hypothalamus (7d) ERbeta1 was down-regulated, while ERbeta2 remained
unchanged under all conditions. Although aromatase B levels increased in the brain,
the majority of candidate genes identified by microarray in the hypothalamus (1d)
decreased. These results demonstrate that ER subtypes are differentially regulated
by E2, and several brain transcripts decrease upon short-term elevation of circulating
E2 levels.
- Xiong, H., Zhang D., Martyniuk, C.J., Trudeau, V.L.,
Xia, X.. 2008. Using Generalized Procrustes Analysis (GPA) for
normalization of cDNA microarray data. BMC Bioinformatics, 9(2008) 25
BACKGROUND: Normalization is essential in dual-labelled microarray data analysis
to remove non-biological variations and systematic biases. Many normalization methods
have been used to remove such biases within slides (Global, Lowess) and across slides
(Scale, Quantile and VSN). However, all these popular approaches have critical assumptions
about data distribution, which is often not valid in practice. RESULTS: In
this study, we propose a novel assumption-free normalization method based on the
Generalized Procrustes Analysis (GPA) algorithm. Using experimental and simulated
normal microarray data and boutique array data, we systemically evaluate the ability
of the GPA method in normalization compared with six other popular normalization
methods including Global, Lowess, Scale, Quantile, VSN, and one boutique array-specific
housekeeping gene method. The assessment of these methods is based on three different
empirical criteria: across-slide variability, the Kolmogorov-Smirnov (K-S) statistic
and the mean square error (MSE). Compared with other methods, the GPA method performs
effectively and consistently better in reducing across-slide variability and removing
systematic bias. CONCLUSION: The GPA method is an effective normalization
approach for microarray data analysis. In particular, it is free from the statistical
and biological assumptions inherent in other normalization methods that are often
difficult to validate. Therefore, the GPA method has a major advantage in that it
can be applied to diverse types of array sets, especially to the boutique array
where the majority of genes may be differentially expressed.
- Khalouei, S., Xia, X.. 2008. Selective
pressure against AUG triplets in the 5' untranslated region of human immunodeficiency
virus type 1 supports cap-dependent translation initiation mechanism. Retrovirology:
Research and Treatment 2:1-8.
- Xia, X. 2007. An Improved Implementation of Codon Adaptation Index.
Evolutionary Bioinformatics 3:53–58.
Abstract Codon adaptation index is a widely used index for characterizing
gene expression in general and translation efficiency in particular. Current computational
implementations have a number of problems leading to various systematic biases.
I illustrate these problems and provide a better computer implementation to solve
these problems. The improved CAI can predict protein production better than CAI
from other commonly used implementations.
Correction:In discussing the problem arising when a codon
is not used in the reference set of highly expressed genes, which would yield w=0,
I stated that Sharp & Li (1987) suggested using w=0.5 in that situation. Sharp &
Li (1987) actually suggested using Xij=0.5. Michael Bulmer (1988, J.Evol.Biol.)
suggested an alternative modification, which is to set the minimum value of w to
be 0.01.
- Xia, X., Huang H.,Carullo,
M.,Betran, E.,Moriyama, E. 2007. Conflict between translation initiation
and elongation in vertebrate mitochondrial genomes. PLoS ONE 2(2): e227.
Abstract The strand-biased mutation spectrum in vertebrate mitochondrial
genomes results in an AC-rich L-strand and a GT-rich H-strand. Because the L-strand
is the sense strand of 12 protein-coding genes out of the 13, the third codon position
is overall strongly AC-biased. The wobble site of the anticodon of the 22 mitochondrial
tRNAs is either U or G to pair with the most abundant synonymous codon, with only
one exception. The wobble site of Met-tRNA is C instead of U, forming the Watson-Crick
match with AUG instead of AUA, the latter being much more frequent than the former.
This has been attributed to a compromise between translation initiation and elongation;
i.e., AUG is not only a methionine codon, but also an initiation codon, and an anticodon
matching AUG will increase the initiation rate. However, such an anticodon would
impose selection against the use of AUA codons because AUA needs to be wobble-translated.
According to this translation conflict hypothesis, AUA should be used relatively
less frequently compared to UUA in the UUR codon family. A comprehensive analysis
of mitochondrial genomes from a variety of vertebrate species revealed a general
deficiency of AUA codons relative to UUA codons. In contrast, urochordate mitochondrial
genomes with two tRNA(Met) genes with CAU and UAU anticodons exhibit increased AUA
codon usage. Furthermore, six bivalve mitochondrial genomes with both of their tRNA-Met
genes with a CAU anticodon have reduced AUA usage relative to three other bivalve
mitochondrial genomes with one of their two tRNA-Met genes having a CAU anticodon
and the other having a UAU anticodon. We conclude that the translation conflict
hypothesis is empirically supported, and our results highlight the fine details
of selection in shaping molecular evolution.
- Xia, X. 2007. The +4G site in Kozak consensus is not related to
the efficiency of translation initiation. PLoS ONE 2(2):e188.
Abstract The optimal context for translation initiation in mammalian
species is GCCRCCaugG (where R = purine and "aug" is the initiation codon), with
the -3R and +4G being particularly important. The presence of +4G has been interpreted
as necessary for efficient translation initiation. Accumulated experimental and
bioinformatic evidence has suggested an alternative explanation based on amino acid
constraint on the second codon, i.e., amino acid Ala or Gly are needed as the second
amino acid in the nascent peptide for the cleavage of the initiator Met, and the
consequent overuse of Ala and Gly codons (GCN and GGN) leads to the +4G consensus.
I performed a critical test of these alternative hypotheses on +4G based on 34169
human protein-coding genes and published gene expression data. The result shows
that the prevalence of +4G is not related to translation initiation. Among the five
G-starting codons, only alanine codons (GCN), and glycine codons (GGN) to a much
smaller extent, are overrepresented at the second codon, whereas the other three
codons are not overrepresented. While highly expressed genes have more +4G than
lowly expressed genes, the difference is caused by GCN and GGN codons at the second
codon. These results are inconsistent with +4G being needed for efficient translation
initiation, but consistent with the proposal of amino acid constraint hypothesis.
- Khalouei, S., X. Yao,
J. Mennigen, M. Carullo, P. Ma,
Z. Song, H. Xiong,
and Xia, X.. 2007. Bioinformatic Approach to Identify Penultimate
Amino Acids Efficient for N-Terminal Methionine Excision. Pp. 386-389. Bioinformatics
and Biomedical Engineering, 2007, IEEE. The 1st International Conference on Bioinformatics
and Biomedical Engineering (ICBBE2007).
- Martyniuk, C. J., Xiong H., Crump, K., Chiu, S.,
Sardana, R., Nadler, A., Gerrie, E. R., Xia, X., Trudeau, V. L.
2006. Gene expression profiling in the neuroendocrine brain of male goldfish (Carassius
auratus) exposed to 17-alpha-ethinylestradiol. Physiol. Genomics 27(3):328-336.
Abstract 17-alpha ethinylestradiol (EE2), a pharmaceutical estrogen,
is detectable in water systems worldwide. Although studies report on the effects
of xenoestrogens in tissues such as liver and gonad, few studies to date have investigated
the effects of EE2 in the vertebrate brain at a large scale. The purpose of this
study was to develop a goldfish brain-enriched cDNA array and use this in conjunction
with a mixed tissue carp microarray to study the genomic response to EE2 in the
brain. Gonad-intact male goldfish were exposed to nominal concentrations of 0.1
nM (29.6 ng/l) and 1.0 nM (296 ng/l) EE2 for 15 days. Male goldfish treated with
the higher dose of EE2 had significantly smaller gonads compared with controls.
Males also had a significantly reduced level of circulating testosterone (T) and
17beta-estradiol (E2) in both treatment groups. Candidate genes identified by microarray
analysis fall into functional categories that include neuropeptides, cell metabolism,
and transcription/translation factors. Differentially expressed genes verified by
real-time RT-PCR included brain aromatase, secretogranin-III, and interferon-related
developmental regulator 1. Our results suggest that the expression of genes in the
sexually mature adult brain appears to be resistant to low EE2 exposure but is affected
significantly at higher doses of EE2. This study demonstrates that microarray technology
is a useful tool to study the effects of endocrine disrupting chemicals on neuroendocrine
function and suggest that exposure to EE2 may have significant effects on localized
E2 synthesis in the brain by affecting transcription of brain aromatase.
- Xia, X. 2006. Topological Bias in Distance-Based Phylogenetic Methods:
Problems with Over- and Underestimated Genetic Distances. Evolutionary Bioinformatics
2006: 2 375–387.
Abstract I show several types of topological biases in distance-based
methods that use the least-squares method to evaluate branch lengths and the minimum
evolution (ME) or the Fitch-Margoliash (FM) criterion to choose the best tree. For
a 6-species tree, there are two tree shapes, one with three cherries (a cherry is
a pair of adjacent leaves descending from the most recent common ancestor), and
the other with two. When genetic distances are underestimated, the 3-cherry tree
shape is favored with either the ME or FM criterion. When the genetic distances
are overestimated, the ME criterion favors the 2-cherry tree, but the direction
of bias with the FM criterion depends on whether negative branches are allowed,
i.e. allowing negative branches favors the 3-cherry tree shape but disallowing negative
branches favors the 2-cherry tree shape. The extent of the bias is explored by computer
simulation of sequence evolution.
- Wang, H. C., Xia, X. , D. Hickey. 2006. Thermal adaptation of small
ribosomal RNA genes: a comparative study. Journal of Molecular Evolution
63(1):120-126
Abstract We carried out a comprehensive survey of small subunit
ribosomal RNA sequences from archaeal, bacterial, and eukaryotic lineages in order
to understand the general patterns of thermal adaptation in the rRNA genes. Within
each lineage, we compared sequences from mesophilic, moderately thermophilic, and
hyperthermophilic species. We carried out a more detailed study of the archaea,
because of the wide range of growth temperatures within this group. Our results
confirmed that there is a clear correlation between the GC content of the paired
stem regions of the 16S rRNA genes and the optimal growth temperature, and we show
that this correlation cannot be explained simply by phylogenetic relatedness among
the thermophilic archaeal species. In addition, we found a significant, positive
relationship between rRNA stem length and growth temperature. These correlations
are found in both bacterial and archaeal rRNA genes. Finally, we compared rRNA sequences
from warm-blooded and cold-blooded vertebrates. We found that, while rRNA sequences
from the warm-blooded vertebrates have a higher overall GC content than those from
the cold-blooded vertebrates, this difference is not concentrated in the paired
regions of the molecule, suggesting that thermal adaptation is not the cause of
the nucleotide differences between the vertebrate lineages.
- Xia, X. , Wang, H. C., Z. Xie,
M. Carullo, H. Huang,
and D. Hickey. 2006. Cytosine usage modulates the correlation between CDS length
and CG content in prokaryotic genomes. Molecular Biology and Evolution 23:1450-1454
Abstract Previous studies have argued that, given the AT-rich nature
of stop codons, the length and CG% of coding sequences (CDSs) should be positively
correlated. This prediction is generally supported empirically by prokaryotic genomes.
However, the correlation is weak for a number of species, with 4 species showing
a negative correlation. Here we formulate a more general hypothesis incorporating
selection against cytosine (C) usage to explain the lack of strong positive correlation
between the length and GC% of CDSs. Two factors contribute to the selection against
C usage in long CDSs. First, C is the least abundant nucleotide in the cell, and
a long CDS should have fewer Cs to increase transcription efficiency. Second, C
is prone to mutation to U/T and selection for increased reliability should reduce
C usage in long CDSs. Empirical data from prokaryotic genomes lend strong support
for this new hypothesis.
- Cai, J. J., Smith,
D. K., Xia, X. , Yuen, K. Y. 2006. MBEToolbox 2.0: An enhanced
version of a MATLAB toolbox for Molecular Biology and Evolution. Evolutionary Bioinformatics
2:187-190.
Abstract MBEToolbox is an extensible MATLAB-based software package
for analysis of DNA and protein sequences. MBEToolbox version 2.0 includes enhanced
functions for phylogenetic analyses by the maximum likelihood method. For example,
it is capable of estimating the synonymous and nonsynonymous substitution rates
using a novel or several known codon substitution models. MBEToolbox 2.0 introduces
new functions for estimating site-specific evolutionary rates by using a maximum
likelihood method or an empirical Bayesian method. It also incorporates several
different methods for recombination detection. Multi-platform versions of the software
are freely available at http://www.bioinformatics.org/mbetoolbox/.
- Xia, X. and G. Palidwor. 2005.
Genomic Adaptation to Acidic Environment: Evidence from Helicobacter pylori.
American Naturalist 166:776-784
Abstract The origin of new functions is fundamental in understanding
evolution, and three processes known as adaptation, preadaptation, and exaptation
have been proposed as possible evolutionary pathways leading to the origin of new
functions. Here we examine the origin of an acid resistance mechanism in the mammalian
gastric pathogen Helicobacter pylori, with reference to these three evolutionary
pathways. The mechanism involved is that H. pylori, when exposed to the acidic environment
in mammalian stomach, restricts the acute proton entry across its membrane by its
increased usage of positively charged amino acids in the inner and outer membrane
proteins. The results of our comparative genomic analysis between H. pylori, the
two closely related species Helicobacter hepaticus and Campylobacter jejuni, and
other relevant proteobacterial species are incompatible with the hypotheses invoking
preadaptation or exaptation. The acid resistance mechanism most likely arose by
selection favoring an increased usage of positively charged lysine in membrane proteins.
- Shi, B., and X. Xia. 2005. Genetic
variation in clones of Pseudomonas pseudoalcaligenes after ten months of
selection in different thermal environments in the laboratory. Curr Microbiol
50:238-45.
Abstract The random amplification of polymorphic DNA (RAPD) method
was used to examine genetic variation in experimental clones of Pseudomonas pseudoalcaligenes
in two experimental groups, as well as their common ancestor. Six clones derived
from a single colony of P. pseudoalcaligenes were cultured in two different thermal
regimes for 10 months. Three clones in the Control group were cultured at constant
temperature of 35 degrees C and another three clones in the High Temperature (HT)
group were propagated at incremental temperature ranging from 41 to 47 degrees C
for 10 months. A total of 45 RAPD primers generated 146 polymorphic markers. Analysis
of molecular variance (AMOVA) revealed mild (11%) but significant (P < 0.001) genetic
difference between the Control and the HT clones. Phylogenetic analysis based on
pairwise genetic distances showed that the HT clones were more divergent from the
ancestor and from each other than the Control clones, implying that the HT clones
of P. pseudoalcaligenes may have evolved faster than the Control clones.
- Xia. X. and K. Y. Yuen. 2005. Differential selection and mutation
between dsDNA and ssDNA phages shape the evolution of their genomic AT percentage.
BMC Genetics 6:20.
BACKGROUND: Bacterial genomes differ dramatically in AT%. We have developed
a model to show that the genomic AT% in rapidly replicating bacterial species can
be used as an index of the availability of nucleotides A and T for DNA replication
in cellular medium. This index is then used to (1) study the evolution and adaptation
of the bacteriophage genomic AT% in response to the differential nucleotide availability
of the host and (2) test the prediction that double-stranded DNA (dsDNA) phage should
exhibit better adaptation than single-stranded DNA (ssDNA) phage because the rate
of spontaneous deamination, which leads to C→T or C→U mutations depending
on whether C is methylated or not, is about 100-fold greater in ssDNA than in dsDNA.
RESULTS: We retrieved 79 dsDNA phage and 27 ssDNA phage genomes together
with their host genomic sequences. The dsDNA phages have their genomic AT% better
adapted to the host genomic AT% than ssDNA phage. The poorer adaptation of the ssDNA
phage can be partially accounted for by the C→T(U) mutations mediated by the
spontaneous deamination. For ssDNA phage, the genomic A% is more strongly correlated
with their host genomic AT% than the genomic T%. CONCLUSION: A significant
fraction of variation in the genomic AT% in the dsDNA phage, and that in the genomic
A% and T% of the ssDNA phage, can be explained by the difference in selection and
mutation between them.
- Cai, J., Smith, D., X. Xia, and
K. Y. Yuen. 2005. MBEToolbox: a Matlab toolbox for sequence data analysis of molecular
biology and evolution. BMC Bioinformatics 6:64.
BACKGROUND: MATLAB is a high-performance language for technical computing,
integrating computation, visualization, and programming in an easy-to-use environment.
It has been widely used in many areas, such as mathematics and computation, algorithm
development, data acquisition, modeling, simulation, and scientific and engineering
graphics. However, few functions are freely available in MATLAB to perform the sequence
data analyses specifically required for molecular biology and evolution. RESULTS:
We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this
gap by offering efficient implementations of the most needed functions in molecular
biology and evolution. It can be used to manipulate aligned sequences, calculate
evolutionary distances, estimate synonymous and nonsynonymous substitution rates,
and infer phylogenetic trees. Moreover, it provides an extensible, functional framework
for users with more specialized requirements to explore and analyze aligned nucleotide
or protein sequences from an evolutionary perspective. The full functions in the
toolbox are accessible through the command-line for seasoned MATLAB users. A graphical
user interface, that may be especially useful for non-specialist end users, is also
provided. CONCLUSION: MBEToolbox is a useful tool that can aid in the exploration,
interpretation and visualization of data in molecular biology and evolution. The
software is publicly available at http://web.hku.hk/~jamescai/mbetoolbox/ and http://bioinformatics.org/project/?group_id=454
- Xia, X. 2005. Mutation and Selection on the Anticodon of tRNA Genes
in Vertebrate Mitochondrial Genomes. Gene 345:13-20.
Abstract The H-strand of vertebrate mitochondrial DNA is left single-stranded
for hours during the slow DNA replication. This facilitates C→U mutations
on the H-strand (and consequently G→A mutations on the L-strand) via spontaneous
deamination which occurs much more frequently on single-stranded than on double-stranded
DNA. For the 12 coding sequences (CDS) collinear with the L-strand, NNY synonymous
codon families (where N stands for any of the four nucleotides and Y stands for
either C or U) end mostly with C, and NNR and NNN codon families (where R stands
for either A or G) end mostly with A. For the lone ND6 gene on the other strand,
the codon bias is the opposite, with NNY codon families ending mostly with U and
NNR and NNN codon families ending mostly with G. These patterns are consistent with
the strand-specific mutation bias. The codon usage biased towards C-ending and A-ending
in the 12 CDS sequences affects the codon-anticodon adaptation. The wobble site
of the anticodon is always G for NNY codon families dominated by C-ending codons
and U for NNR and NNN codon families dominated by A-ending codons. The only, but
consistent, exception is the anticodon of tRNA-Met which consistently has a 5'-CAU-3'
anticodon base-pairing with the AUG codon (the translation initiation codon) instead
of the more frequent AUA. The observed CAU anticodon (matching AUG) would increase
the rate of translation initiation but would reduce the rate of peptide elongation
because most methionine codons are AUA, whereas the unobserved UAU anticodon (matching
AUA) would increase the elongation rate at the cost of translation initiation rate.
The consistent CAU anticodon in tRNA-Met suggests the importance of maximizing the
rate of translation initiation.
- Baron, D., J. Cocquet, X. Xia, M. Fellous, Y. Guiguen, and R. A.
Veitia. 2004. An evolutionary and functional analysis of FoxL2 in rainbow trout
gonad differentiation. J. Mol. Endocrinol. 33:705-715.
Abstract FOXL2 is a forkhead transcription factor involved in ovarian
development and function. Here, we have studied the evolution and pattern of expression
of the FOXL2 gene and its paralogs in fish. We found well conserved FoxL2 sequences
(FoxL2a) and divergent genes, whose forkhead domains belonged to the class L2 and
were shown to be paralogs of the FoxL2a sequences (named FoxL2b). In the rainbow
trout, FoxL2a and FoxL2b were specifically expressed in the ovary, but displayed
different temporal patterns of expression. FoxL2a expression correlated with the
level of aromatase, the key enzyme in estrogen production, and an estrogen treatment
used to feminize genetically male individuals elicited the up-regulation of both
paralogs. Conversely, androgens or an aromatase inhibitor down-regulated FoxL2a
and FoxL2b in females. We speculate that there is a direct link between estrogens
and FoxL2 expression in fish, at least during the period where the identity of the
gonad is sensitive to hormonal treatments.
- Xia, X. 2004. A peculiar codon usage pattern revealed after removing
the effect of DNA methylation. Proceedings of the 4th International Conference on
Bioinformatics of Genome Regulation and Structure 1:216-220.
- Cocquet, J., E. De Baere, M. Gareil, M. Pannetier, X. Xia, M. Fellous,
R. Veitia. 2003. Structure, evolution and expression of the FOXL2 transcription
unit. Cytogenetic Genome Res 101:206-211.
Abstract FOXL2 is a putative transcription factor involved in ovarian
development and function. Its mutations in humans are responsible for the blepharophimosis
syndrome, characterized by eyelid malformations and premature ovarian failure (POF).
Here we have performed a comparative sequence analysis of FOXL2 sequences of ten
vertebrate species. We demonstrate that the entire open reading frame (ORF) is under
purifying selection leading to strong protein conservation. We also review recent
data on FOXL2 transcript and protein expression. FOXL2 has been shown 1) to be the
earliest known sex dimorphic marker of ovarian determination/differentiation in
vertebrates, 2) to have, at least in mammals, an ovarian expression persisting until
adulthood. The conservation of its sequence and pattern of expression suggests that
FOXL2 might be a key factor in the early development of the vertebrate female gonad
and involved later in adult ovarian function. Finally, we provide arguments for
the existence of an alternative transcript in rodents, that may arise from a differential
polyadenylation. Although it has only been demonstrated in rodents, its presence/absence
in other species deserves further investigation.
- X. Xia. 2003. DNA methylation and Mycoplasma genomes. Journal
of Molecular Evolution 57:S21-S28.
Abstract DNA methylation is one of the many hypotheses proposed
to explain the observed deficiency in CpG dinucleotides in a variety of genomes
covering a wide taxonomic distribution. Recent studies challenged the methylation
hypothesis on empirical grounds. First, it cannot explain why the Mycoplasma genitalium
genome exhibits strong CpG deficiency without DNA methylation. Second, it cannot
explain the great variation in CpG deficiency between M. genitalium and M. pneumoniae
that also does not have CpG-specific methyltransferase genes. I analyzed the genomic
sequences of these Mycoplasma species together with the recently sequenced genomes
of M. pulmonis, Ureaplasma urealyticum, and Staphylococcus aureus, and found the
results fully compatible with the methylation hypothesis. In particular, I present
compelling empirical evidence to support the following scenario. The common ancestor
of the three Mycoplasma species has CpG-specific methyltransferases, and has evolved
strong CpG deficiency as a result of the specific DNA methylation. Subsequently,
this ancestral genome diverged into M. pulmonis and the common ancestor of M. pneumoniae
and M. genitalium. M. pulmonis has retained methyltransferases and exhibits the
strongest CpG deficiency. The common ancestor lost the methyltransferase gene and
then diverged into M. genitalium and M. pneumoniae. M. genitalium and M. pneumoniae,
after losing methylation activities, began to regain CpG dinucleotides through random
mutation. M. genitalium evolved more slowly than M. pneumoniae, gained relatively
fewer CpG dinucleotides, and is more CpG-deficient.
- Shi, B., X. Xia. 2003. Changes
in growth parameters of Pseudomonas pseudoalcaligenes after ten months culturing
at increasing temperature in the laboratory. FEMS Microbiology Ecology 45:127-134.
Abstract In this paper, we report the thermal adaptation of Pseudomonas
pseudoalcaligenes, characterized as changes in growth parameters. Six clones derived
from a single colony of P. pseudoalcaligenes were cultured in two different temperature
regimes for 10 months, with three clones forming the control group, cultured at
a constant temperature, and another three clones forming the high-temperature (HT)
group, cultured at increasing temperature (from 41 to 47 degrees C). Three growth
parameters were measured: the lag time (lambda), which is the period between the
time of transfer to a new medium and the time when the cell replication starts;
the maximum growth rate (mu(m)); and the maximum yield (A). These three parameters
are major components of bacterial fitness. The Gompertz and logistic models were
used to estimate these three parameters. The two models gave almost identical estimates,
but the Gompertz model had R(2) values consistently larger than the logistic model.
The HT clones had significantly shorter lambda, but higher mu(m) and A than the
control clones when both were grown at the originally stressful temperature of 45
degrees C, suggesting significant thermal adaptation. Interestingly, the HT clones
grew equally well as the control clones at 35 degrees C, i.e. improved performance
at 45 degrees C was not associated with a reduced performance at 35 degrees C.
- Xia, X., Z. Xie, K. Kjer. 2003.
18S rRNA and Tetrapod Phylogeny. Systematic Biology 52(3):283-295 (Editor's
choice in
)
Abstract Previous phylogenetic analyses of tetrapod 18S ribosomal
RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular
data, and morphological and paleontological data favor the grouping of birds with
crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive
evidence of different genes providing significantly different estimates of phylogeny
in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158).
Our research indicates that the previous discrepancy of phylogenetic results between
the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the
sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence
quality. When the sequences are aligned with the aide of the secondary structure
of the 18S rRNA molecule and when the frequency parameters are estimated either
from all sites or from the variable domains where substitutions have occurred, the
18S rRNA sequences no longer support the grouping of the avian species with the
mammalian species.
- Xia, X., Z. Xie, W. H. Li. 2003.
Effects of GC Content and Mutational Pressure on the Lengths of Exons and Coding
Sequences. Journal of Molecular Evolution 56:362-370.
Abstract It has been hypothesized that the length of an exon tends
to increase with the GC content because stop codons are AT-rich and should occur
less frequently in GC-rich exons. This prediction assumes that mutation pressure
plays a significant role in the occurrence and distribution of stop codons. However,
the prediction is applicable not to all exons, but only to the last coding exon
of a gene and to single-exon CDS sequences. We classified exons in multiexon genes
in eight eukaryotic species into three groups-the first exon, the internal, and
the last exon-and computed the Spearman correlation between the exon length and
the percentage GC (%GC) for each of the three groups. In only five of the species
studied is the correlation for the last coding exon greater than that for the first
or internal exons. For the single-exon CDS sequences, the correlation between CDS
length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted
relationship between exon length and %GC. In prokaryotic genomes, CDS length and
%GC are positively correlated in each of the 68 completely sequenced prokaryotic
genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the
wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover,
the average CDS length and the genomic GC content are also positively correlated.
After correcting for genome size, the partial correlation between the average CDS
length and the genomic GC content is 0.3217 ( p < 0.025).
- Shi, B., X. Xia. 2003. Morphological
changes of Pseudomonas pseudoalcaligenes in response to temperature selection. Current
Microbiology 46:120-123.
Abstract Adaptation to novel environments usually entails morphological
changes. The cell morphology of six experimental populations of Pseudomonas pseudoalcaligenes
and their common ancestor were examined with scanning electron microscopy (SEM).
The six experimental populations were propagated under different temperatures for
10 months: three of them cultured at constant normal temperature (35 degrees C)
forming the control group, and the other three cultured at incremental higher temperatures
(from 41 degrees to 47 degrees C) as the HT group. SEM showed the deformed and elongated
cells in the 6-h cultures of both ancestral and control populations at 45 degrees
C, indicating that 45 degrees C is stressful for the ancestral and the control populations.
In contrast, the HT populations retained normal cell shape in the 6-h cultures at
both 35 degrees C and 45 degrees C. The mean cell volumes of control and HT populations
increased 29% and 34%, respectively, relative to the ancestor at their respective
thermal regimens, suggestion that the culturing conditions might favor larger cells.
- Xia, X., Z. Xie, M. Salemi, L. Chen, Y. Wang. 2003.
An index of substitution saturation and its application. Molecular Phylogenetics
and Evolution 26:1-7.
Abstract We introduce a new index to measure substitution saturation
in a set of aligned nucleotide sequences. The index is based on the notion of entropy
in information theory. We derive the critical values of the index based on computer
simulation with different sequence lengths, different number of OTUs and different
topologies. The critical value enables researchers to quickly judge whether a set
of aligned sequences is useful in phylogenetics. We illustrate the index by applying
it to an analysis of the aligned sequences of the elongation factor-1alpha gene
originally used to resolve the deep phylogeny of major arthropod groups. The method
has been implemented in DAMBE.
- Cocquet, J., E. Pailhoux, F. Jaubert, N. Servel, X. Xia, M. Pannetier,
E. De Baere, L. Messiaen, C. Cotinot, M. Fellous, R. Veitia. 2002. Evolution and
expression of FOXL2. Journal of Medical Genetics.39:916-921.
Abstract Mutations in FOXL2, a forkhead transcription factor gene,
have recently been shown to cause the blepharophimosis-ptosis-epicanthus inversus
syndrome (BPES). This rare genetic disorder leads to a complex eyelid malformation
associated or not with premature ovarian failure (BPES type I or II, respectively).
We performed a comparative analysis of the FOXL2 sequence in several species (human,
goat, mouse, and pufferfish) showing that the FOXL2 coding region is highly conserved
in these species. The FOXL2 protein contains a polyalanine tract whose role has
not yet been elucidated. Recurrent mutations leading to its expansion result in
BPES type II and account for 30% of the deleterious alterations detected in the
open reading frame (ORF) of FOXL2. We showed that the number of alanine residues
is strictly conserved among the mammals studied, suggesting the existence of strong
functional or structural constraints. We provide immunohistochemical evidence indicating
that FOXL2 is a nuclear protein specifically expressed in eyelids and in fetal and
adult ovarian follicular cells. It does not undergo any major post-translational
maturation. FOXL2 is the earliest known marker of ovarian differentiation in mammals
and may play a role in ovarian somatic cell differentiation and in further follicle
development and/or maintenance.
- Xia, X., T. Wei,
Z. Xie and A. Danchin. 2002. Genomic changes in nucleotide and di-nucleotide
frequencies in Pasteurella multocida cultured under high temperature. Genetics
161:1385-94.
Abstract We used 94 RAPD primers of different nucleotide composition
to probe the genomic differences between a highly virulent P. multocida strain and
an attenuated vaccine strain derived from the virulent strain after culturing the
latter under increasing temperature for approximately 14,400 generations. The GC
content of the vaccine strain is significantly (P < 0.05) lower than that of the
virulent strain, contrary to the popular hypothesis of covariation between the GC
content and temperature. The frequencies of AA, TA, and TT dinucleotides were higher,
and those of AT, GC, and CG dinucleotides were lower, in the vaccine strain than
in the virulent strain. A statistic called genomic RAPD entropy is formulated to
measure the randomness of the genome, and the genome of the vaccine strain is more
random than that of the virulent strain. These differences between the virulent
and vaccine strains are interpreted in terms of mutation and selection under increased
culturing temperature. A method for estimating substitution rates is developed in
the appendix.
- Xia, X. and Z. Xie. 2002. Protein
Structure, Neighbor Effect, and a New Index of Amino Acid Dissimilarities. Molecular
Biology and Evolution 19:58-67.
Abstract Amino acids interact with each other, especially with
neighboring amino acids, to generate protein structures. We studied the pattern
of association and repulsion of amino acids based on 24,748 protein-coding genes
from human, 11,321 from mouse, and 15,028 from Escherichia coli, and documented
the pattern of neighbor preference of amino acids. All amino acids have different
preferences for neighbors. We have also analyzed 7,342 proteins with known secondary
structure and estimated the propensity of the 20 amino acids occurring in three
of the major secondary structures, i.e., helices, sheets, and turns. Much of the
neighbor preference can be explained by the propensity of the amino acids in forming
different secondary structures, but there are also a number of intriguing association
and repulsion patterns. The similarity in neighbor preference among amino acids
is significantly correlated with the number of amino acid substitutions in both
mitochondrial and nuclear genes, with amino acids having similar sets of neighbors
replacing each other more frequently than those having very different sets of neighbors.
This similarity in neighbor preference is incorporated into a new index of amino
acid dissimilarities that can predict nonsynonymous codon substitutions better than
the two existing indices of amino acid dissimilarities, i.e., Grantham's and Miyata's
distances.
- Xia, X. and Z. Xie. 2001. AMADA:
Analysis of microarray data. Bioinformatics 17:569-570.
Abstract AMADA is a Windows program for identifying co-expressed
genes from microarray data. It performs data transformation, principal component
analysis, a variety of cluster analyses and extensive graphic functions for visualizing
expression profiles.
- Xia, X. and Z. Xie. 2001. DAMBE:
Data analysis in molecular biology and evoluiton. Journal of Heredity 92:371-373.
Abstract DAMBE (data analysis in molecular biology and evolution)
is an integrated software package for converting, manipulating, statistically and
graphically describing, and analyzing molecular sequence data with a user-friendly
Windows 95/98/2000/NT interface. DAMBE is free and can be downloaded from http://web.hku.hk/~xxia/software/software.htm.
- Chen, B., and X. Xia, 2001 The
genus Schevodera Borchmann: Phylogeny, historic biogeography and new Chinese records,
with description of a new species (Coleoptera: Tenebrionidae: Lagriinae). Oriental
Insects 35: 3-27.
Abstract Schevodera Borchmann belongs to the subfamily Lagriinae
and its members are phytophagous. A new species, S. glabricollis is described from
China. Redescriptions of the genus and two known species, S. gracilicornis and S.
inflata with new records for China are given. A key to Chinese species is given.
The phylogeny of the nine known species and one subspecies is ç ladistically analysed
based on 21 morphological characters from adults. The confidence of the phylogram
obtained from the cladistic analysis and its monophylies are examined with PTP and
T-PTP tests. The ancestral distribution of the genus is also reconstructed based
on the dispersal-vicariance analysis. The results suggest that the genus would be
monophyletic. In the late Permian — late Triassic period around 255–220 million
years ago, it is hypothesized to have originated from a Lagria-like ancestral species
between western Yunnan, China and Burma in the Shan-Thai terrain. It dispersed from
western Yunnan and northern Burma to Sumatra and Java, and then northward through
Borneo to Palawan, Luzon and finally Mindanao. Based on phylogeny and historic biogeography,
the genus is divided into three species groups: Yunnan, Indonesia and Philippines
groups. The Yunnan group is the most primitive, consisting of S. inflata, S. glabricollis
and S. gracilicornis, and is mainly distributed in Yunnan and Burma. The Indonesia
group includes S. hirticollis and S. hirticollis salvazai, S. curticollis and S.
dohrni, and occurs primarily in Indonesia but also reaches into Burma and the Philippines.
The S. hirticollis salvazai has dispersed from Burma to Laos. The group originated
from the ancestor of Yunnan group after Ecocene, i.e. no longer than 50 million
years ago. The monophyletic Philippines group is composed of three endemic species:
S. setosa, S. spoliata and S. insularis. It originated from the ancestor of the
Indonesian group after the Miocene around 20 million years ago and dispersed from
Palawan to Luzon and then Mindanao. The synapomorphies between these groups, interspecific
phylogenetic relationships, time and place of origin and potential distribution
of each species are also discussed in detail.
- Xia, X. 2000. Phylogenetic Relationship among Horseshoe Crab Species:
The Effect of Substitution Models on Phylogenetic Analyses. Systematic Biology
49:87-100.
Abstract The horseshoe crabs, known as living fossils, have maintained
their morphology almost unchanged for the past 150 million years. The little morphological
differentiation among horseshoe crab lineages has resulted in substantial controversy
concerning the phylogenetic relationship among the extant species of horseshoe crabs,
especially among the three species in the Indo-Pacific region. Previous studies
suggest that the three species constitute a phylogenetically unresolvable trichotomy,
the result of a cladogenetic process leading to the formation of all three Indo-Pacific
species in a short geological time. Data from two mitochondrial genes (for 16S ribosomal
rRNA and cytochrome oxidase subunit I) and one nuclear gene (for coagulogen) in
the four species of horseshoe crabs and outgroup species were used in a phylogenetic
analysis with various substitution models. All three genes yield the same tree topology,
with Tachypleus-gigas and Carcinoscorpius-rotundicauda grouped together as a monophyletic
taxon. This topology is significantly better than all the alternatives when evaluated
with the RELL (resampling estimated log-likelihood) method.
- Xia, X. and W.-H. Li 1998. What amino acid properties affect protein
evolution? Journal of Molecular Evolution 47:557-564.
Abstract We studied 10 protein-coding mitochondrial genes from
19 mammalian species to evaluate the effects of 10 amino acid properties on the
evolution of the genetic code, the amino acid composition of proteins, and the pattern
of nonsynonymous substitutions. The 10 amino acid properties studied are the chemical
composition of the side chain, two polarity measures, hydropathy, isoelectric point,
volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation. The genetic
code appears to have evolved toward minimizing polarity and hydropathy but not the
other seven properties. This can be explained by our finding that the presumably
primitive amino acids differed much only in polarity and hydropathy, but little
in the other properties. Only the chemical composition (C) and isoelectric point
(IE) appear to have affected the amino acid composition of the proteins studied,
that is, these proteins tend to have more amino acids with typical C and IE values,
so that nonsynonymous mutations tend to result in small differences in C and IE.
All properties, except for hydroxythiolation, affect the rate of nonsynonymous substitution,
with the observed amino acid changes having only small differences in these properties,
relative to the spectrum of all possible nonsynonymous mutations.
- Xia, X. 1998. How optimized is the translational machinery in E.
coli, S. typhimurium, and S. cerevisiae? Genetics 149: 37-44.
Abstract The optimization of the translational machinery in cells
requires the mutual adaptation of codon usage and tRNA concentration, and the adaptation
of tRNA concentration to amino acid usage. Two predictions were derived based on
a simple deterministic model of translation which assumes that elongation of the
peptide chain is rate-limiting. The highest translational efficiency is achieved
when the codon recognized by the most abundant tRNA reaches the maximum frequency.
For each codon family, the tRNA concentration is optimally adapted to codon usage
when the concentration of different tRNA species matches the square-root of the
frequency of their corresponding synonymous codons. When tRNA concentration and
codon usage are well adapted to each other, the optimal content of all tRNA species
carrying the same amino acid should match the square-root of the frequency of the
amino acid. These predictions are examined against empirical data from Escherichia
coli, Salmonella typhimurium, and Saccharomyces cerevisiae.
- Xia, X. 1998. The rate heterogeneity of nonsynonymous substitutions
in mammalian mitochondrial genes. Molecular Biology and Evolution 15:336-344.
Abstract Substitution rates at the three codon positions (r1, r2,
and r3) of mammalian mitochondrial genes are in the order of r3 > r1 > r2, and the
rate heterogeneity at the three positions, as measured by the shape parameter of
the gamma distribution (alpha 1, alpha 2, and alpha 3), is in the order of alpha
3 > alpha 1 > alpha 2. The causes for the rate heterogeneity at the three codon
positions remain unclear and, in particular, there has been no satisfactory explanation
for the observation of alpha 1 > alpha 2. I attempted to dissect the causes of rate
heterogeneity by studying the pattern of nonsynonymous substitutions with respect
to codon positions in 10 mitochondrial genes from 19 mammalian species. Nonsynonymous
substitutions involve more different amino acid replacements at the second than
at the first codon position, which results in r1 > r2. The difference between r1
and r2 increases with the intensity of purifying selection, and so does the rate
heterogeneity in nonsynonymous substitutions among sites at the same codon position.
All mitochondrial genes appear to have functionally important and unimportant codons,
with the latter having all three codon positions prone to nonsynonymous substitutions.
Within the functionally important codons, the second codon position is much more
conservative than the codon position. This explains why alpha 1 > alpha 2. The result
suggests that overweighting of the second codon position in phylogenetic analysis
may be a misguided practice.
- Xia, X. 1996. Maximising transcription efficiency causes codon
usage bias. Genetics 144:1309-1320.
Abstract The rate of protein synthesis depends on both the rate
of initiation of translation and the rate of elongation of the peptide chain. The
rate of initiation depends on the encountering rate between ribosomes and mRNA;
this rate in turn depends on the concentration of ribosomes and mRNA. Thus, patterns
of codon usage that increase transcriptional efficiency should increase mRNA concentration,
which in turn would increase the initiation rate and the rate of protein synthesis.
An optimality model of the transcriptional process is presented with the prediction
that the most frequently used ribonucleotide at the third codon sites in mRNA molecules
should be the same as the most abundant ribonucleotide in the cellular matrix where
mRNA is transcribed. This prediction is supported by four kinds of evidence. First,
A-ending codons are the most frequently used synonymous codons in mitochondria,
where ATP is much more abundant than that of the three other ribonucleotides. Second,
A-ending codons are more frequently used in mitochondrial genes than in nuclear
genes. Third, protein genes from organisms with a high metabolic rate use more A-ending
codons and have higher A content in their introns than those from organisms with
a low metabolic rate.
- Xia, X., Hafner, M. S. and P. D. Sudman. 1996. On transition bias
in mitochondrial genes of pocket gophers. Journal of Molecular Evolution
43:32-40.
Abstract The relative contribution of mutation and purifying selection
to transition bias has not been quantitatively assessed in mitochondrial protein
genes. The observed transition/transversion (s/v) ratio is (micros Ps)/(microv Pv),
where micros and microv denote mutation rate of transitions and transversions, respectively,
and Ps and Pv denote fixation probabilities of transitions and transversions, respectively.
Because selection against synonymous transitions can be assumed to be roughly equal
to that against synonymous transversions, Ps/Pv approximately 1 at fourfold degenerate
sites, so that the s/v ratio at fourfold degenerate sites is approximately micros/microv,
which is a measure of mutational contribution to transition bias. Similarly, the
s/v ratio at nondegenerate sites is also an estimate of micros/microv if we assume
that selection against nonsynonymous transitions is roughly equal to that against
nonsynonymous transversions. In two mitochondrial genes, cytochrome oxidase subunit
I (COI) and cytochrome b (cyt-b) in pocket gophers, the s/v ratio is about two at
nondegenerate and fourfold degenerate sites for both the COI and the cyt-b genes.
This implies that mutation contribution to transition bias is relatively small.
In contrast, the s/v ratio is much greater at twofold degenerate sites, being 48
for COI and 40 for cyt-b. Given that the micros/microv ratio is about 2, the Ps/Pv
ratio at twofold degenerate sites must be on the order of 20 or greater. This suggests
a great effect of purifying selection on transition bias in mitochondrial protein
genes because transitions are synonymous and transversions are nonsynonymous at
twofold degenerate sites in mammalian mitochondrial genes. We also found that nonsynonymous
mutations at twofold degenerate sites are more neutral than nonsynonymous mutations
at nondegenerate sites, and that the COI gene is subject to stronger purifying selection
than is the cyt-b gene. A model is presented to integrate the effect of purifying
selection, codon bias, DNA repair and GC content on s/v ratio of protein-coding
genes.
- Xia, X. 1995. Body temperature, rate of biosynthesis, and evolution
of genome size. Molecular Biology and Evolution 12:834-842.
Abstract An optimality model relating the rate of biosynthesis
to body temperature and gene duplication is presented to account for several observed
patterns of genome size variation. The model predicts (1) that poikilotherms living
in a warm climate should have a smaller genome than poikilotherms living in a cold
climate, (2) that homeotherms should have a small genome as well as a small variation
in genome size relative to their poikilothermic ancestors, (3) that cold geological
periods should favor the evolution of poikilotherms with a large genome and that
warm geological periods should do the opposite, and (4) that poikilotherms with
a small genome should be more sensitive to changes in temperature than poikilotherms
with a large genome. The model also offers two explanations for the empirically
documented trend that organisms with a large cell volume have larger genomes than
those with a small cell volume. Relevant empirical evidence is summarized to support
these predictions.
- Xia, X. 1995. Revisiting Hamilton's rule. American Naturalist 145:483-492.
Abstract The expectation of fitness gain for helping offspring is the same as that for helping siblings. However, the variance of the fitness gain differs between the two (i.e., helping offspring and helping siblings). All offspring are equally related to parents, but siblings are not equally related to each other. I presented population genetics models to show that helping offspring is evolutionarily more profitable than helping siblings. It clarifies some misunderstandings related to a previous publication.
- Xia, X. 1993. A full sibling is not as valuable as an offspring: on Hamilton's rule. American Naturalist 142:174-185.
Abstract The coefficient of relationship (r) between parents and offspring is the same as that between full siblings. However, the variance of r_(sib_i,sib_j) is greater than that of r_(parent,offspring). I present a population genetic model to show that helping offspring is evolutionarily more profitable than helping siblings.
- Boonstra, R., Xia, X. and L. Pavone. 1992. Mating system of the meadow vole, Microtus pennsylvanicus. Behavioral Ecology 4:83-89.
Abstract Previous studies on parental and spacing behavior of Microtus pennsylvanicus suggest a promiscuous mating
system, but attempts to find multiple paternity in single litters have been unsuccessful. In this paper we
present evidence of multiple paternity in single litters conceived in the wild early in the breeding season.
The proportion of litters sired by multiple males was estimated, by a conservative method, to be 33.1%.
We argue that the presence of promiscuity, rather than polygyny, in M. pennsylvanicus is the result of two
factors. First, overwintered breeding males are similar in age and size, resulting in small variation in
competitive ability among males. This reduced variation in competitive ability reduces the possibility that
some males defend several females and others defend none. Second, the habitat structure of the meadow
vole makes it difficult for a male to detect other males nearby, and this reduces the possibility that one
male excludes others from mating when a female comes into estrus. Key words: competitive ability, mating
systems, meadow voles, Microtus pennsylvanicus, multiple paterni
- Xia, X. and R. Boonstra. 1992. Measuring temporal variation in population density: a critique. American Naturalist 140:883-892.
Abstract Many authors have recently compared temporal variation in population density among different taxa or among different populations within the same taxon. There are many problems associated with this approach, and most generalizations made in these articles are not valid. In this note, we analyze in detail one specific example and highlight inherent problems associated with such an approach. We also present new methods for quantifying the spatial distribution and population density of natural populations from trapping data.
- Xia, X. 1992. Uncertainty of paternity can select against paternal
care. American Naturalist 139:1126-1129.
- Xia, X. and J. S. Millar. 1991. Genetic evidence of promiscuity in Peromyscus leucopus. Behavioral Ecology and Sociobiology 28:171-178.
Abstract We collected pregnant female Peromyscus leucopus from natural populations during the summer of
1987 and 1988 and allowed these females to give birth
to their field-conceived young in the laboratory. Blood
samples were taken by suborbital puncture and phenotypes of five genetic loci (Esterase-l, trasferrin, hemoglobin, albumin and 6-phosphogluconate dehydrogenase)
were studied using horizontal starch-gel electrophoresis
to detect multiple paternity in single litters. Only esterase-1 was found to be highly polymorphic, with four
alleles in samples of both years. One litter out of 29
in 1987 and 6 litters out of 32 in 1988 contained three
different paternal alleles and served as direct evidence
of multiple paternity in the field. The proportion of females engaging in multiple matings in natural populations of P. leucopus, assuming that all males were involved in every multiple mating, is 25%-100% (mean
68%). Because it is unlikely that all males are involved
in every multiple mating, the actual proportion of females engaging in multiple matings should be greater.
- Millar, J. S., Xia, X. and M. B. Norrie. 1991. Relationship among
reproductive status, nutritional status and food characteristics in a natural population
of Peromyscus maniculatus. Canadian Journal of Zoology 69:555-559.
- Xia, X. and J. S. Millar. 1990. Infestation of wild Peromyscus leucopus by bot fly larvae. Journal of Mammalogy 71:255-258.
Abstract The infestation of white-footed mice, Peromyscus leucopus, by the botfly, Curerebra fontinella, is by physical contact. Botflies lay eggs in environs frequented by P. leucopus. The body temperature of the mouse triggers quick hatching of the botfly eggs and the larvae infest the mouse through the nose, eyes or wounds. The requirement of physical contact for infestation implies that mobile mice should encounter more botfly eggs and get more infested than sedentary mice. Our previous investigation of the white-footed mouse showed that adult males are far more mobile than adult females, and therefore should be more infested. We recorded infested and uninfested mice by age and sex, and found that the rate of infestation depends highly significantly on age and sex. While adult males tend to be the most infested, males in general are more infested than females and adults are more infested than juveniles.
- Xia, X. and J. S. Millar. 1989. Dispersion of adult males in relation to female reproductive status in Peromyscus leucopus. Canadian Journal of Zoology 67:1047-1052.
Abstract We studied dispersion of adult male Peromyscus leucopus in relation to the stage of pregnancy of adult females in natural
populations monitored with Longworth live traps. Because postpartum mating is common in P. leucopus and days to parturition measures how far a female is from her next mating, we predicted that a female in early pregnancy (many days to parturition) would have fewer adult males in her neighbourhood than a female in late pregnancy (few days to parturition). Number of
adult males caught within 30 m of each adult female was recorded and number of days to parturition for each female was
obtained by bringing females back to the laboratory and allowing them to give birth. A negative relationship was found
between number of adult males in a female's neighbourhood and days to parturition of the female (r = -0.419, p < 0.01),
with the latter accounting for 8.8 % of the variance in the former. These results support the hypothesis of a promiscuous mating
system in P. leucopus.
- El-Haddad, M., J. S. Millar and X. Xia. 1989. Offspring recognition
by male Peromyscus maniculatus. Journal of Mammalogy 69:811-813.
- Xia, X. and J. S. Millar. 1988. Paternal behaviour by Peromyscus leucopus in enclosures. Canadian Journal of Zoology 66:1184-1187.
Abstract Male Peromyscus leucopus are known to exhibit well-developed paternal behavior in confined cages, but electrophoresis
indicates promiscuity in this species. One explanation for this paradox is that the documented paternal behavioral patterns are
laboratory artifacts. We made nocturnal observations of parental behavior in 14 families of P. leucopus in large enclosures and
observed no paternal care. Males rarely entered the natal nest and when they did, remained in the nest for less than 2 min.
Thus, we consider direct paternal care such as licking, retrieving, and huddling unlikely. We also failed to observe any indirect
paternal investment such as nest building or food caching. The female in each of five pairs was very aggressive towards the
male, continuously chasing him throughout most of the observation periods. Another three females actively prevented their
mates from entering the natal nest. Paternal care probably does not contribute to the growth and survivorship of the young
under natural conditions.
- Xia, X. and J. S. Millar. 1987. Morphological variation in deer mice in relation to sex and habitat. Canadian Journal of Zoology 65:527-533.
Abstract Peromyscus maniculatus borealis were collected in two habitats with contrasting physiognomic features in the Kananaskis
Valley, Alberta, in the summer of 1983. We tested for differences between sexes and habitats using 4 body measurements (body
length, tail length, hind foot length, and ear length) and 10 cranial (including mandibular) measurements of 222 and 192 adult P.
m. borealis, respectively. Body measurements of 132 juveniles and five cranial (including mandibular) measurements from 124
juvenile skulls were analyzed similarly. When differences in body length were controlled, adult males had significantly longer
hind feet than adult females. The mandible was also significantly longer in adult males than in adult females. We interpreted the
longer hind foot length in adult males as an adaptation to provide greater mobility, and the differences in mandibular morphology
as a consequence of differential habitat use between the two sexes. No significant differences were found between juvenile males
and females. Sexual dimorphism appeared to be age dependent rather than size dependent when adults and juveniles of similar
body size were analyzed.
- Xia, X. and J. S. Millar. 1986. Sex-related dispersion in Peromyscus maniculatus. Canadian Journal of Zoology 64:933-936.
Abstract Snap trapping of small mammals in the Kananaskis Valley, Alberta, during the breeding seasons of 1982 and 1983 provided
data used to analyze sex-related dispersion patterns of adult Peromyscus maniculatus. A dispersion pattern of regular alternation
of males and females, within-sex avoidance, and strong between-sex association was found. Within-sex exclusion was better
exhibited by females than by males. These data are consistent with what would be expected for a promiscuous mating system.
Intraspecific resource partitioning between different sexes may occur through adjustments in spatial relationships.
|