XiaLab at University of Ottawa

Skip Navigation Links

Refereed journal papers

(Links to my books and my book chapters)

(My students in red italics)

  1. Freeman A; Xia X. 2024 Phylogeographic Reconstruction to Trace the Source Population of Asian Giant Hornet Caught in Nanaimo in Canada and Blaine in the USA. Life 14(3), 283.
  2. Abstract: The Asian giant hornet, Vespa mandarinia, is an invasive species that could potentially destroy the local honeybee industry in North America. It has been observed to nest in the coastal regions of British Columbia in Canada and Washington State in the USA. What is the source population of the immigrant hornets? The identification of the source population can shed light not only on the route of immigration but also on the similarity between the native habitat and the potential new habitat in the Pacific Northwest. We analyzed mitochondrial COX1 sequences of specimens sampled from multiple populations in China, the Republic of Korea, Japan, and the Russian Far East. V. mandarinia exhibits phylogeographic patterns, forming monophyletic clades for 16 specimens from China, six specimens from the Republic of Korea, and two specimens from Japan. The two mitochondrial COX1 sequences from Nanaimo, British Columbia, are identical to the two sequences from Japan. The COX1 sequence from Blaine, Washington State, clustered with those from the Republic of Korea and is identical to one sequence from the Republic of Korea. Our geophylogeny, which allows visualization of genetic variation over time and space, provides evolutionary insights on the evolution and speciation of three closely related vespine species (V. tropica, V. soror, and V. mandarinia), with the speciation events associated with the expansion of the distribution to the north.

  3. Kruglikov, A.; Xia, X. 2024 Mesophiles vs. Thermophiles: Untangling the Hot Mess of Intrinsically Disordered Proteins and Growth Temperature of Bacteria. Int. J. Mol. Sci. 25, 2000.
  4. Abstract: The dynamic structures and varying functions of intrinsically disordered proteins (IDPs) have made them fascinating subjects in molecular biology. Investigating IDP abundance in different bacterial species is crucial for understanding adaptive strategies in diverse environments. Notably, thermophilic bacteria have lower IDP abundance than mesophiles, and a negative correlation with optimal growth temperature (OGT) has been observed. However, the factors driving these trends are yet to be fully understood. We examined the types of IDPs present in both mesophiles and thermophiles alongside those unique to just mesophiles. The shared group of IDPs exhibits similar disorder levels in the two groups of species, suggesting that certain IDPs unique to mesophiles may contribute to the observed decrease in IDP abundance as OGT increases. Subsequently, we used quasi-independent contrasts to explore the relationship between OGT and IDP abundance evolution. Interestingly, we found no significant relationship between OGT and IDP abundance contrasts, suggesting that the evolution of lower IDP abundance in thermophiles may not be solely linked to OGT. This study provides a foundation for future research into the intricate relationship between IDP evolution and environmental adaptation. Our findings support further research on the adaptive significance of intrinsic disorder in bacterial species.

  5. Aris, P.; Mohamadzadeh, M.; Zarei, M.; Xia, X. 2024 Computational Design of Novel Griseofulvin Derivatives Demonstrating Potential Antibacterial Activity: Insights from Molecular Docking and Molecular Dynamics Simulation. Int. J. Mol. Sci. 25, 1039.
  6. Abstract: In response to the urgent demand for innovative antibiotics, theoretical investigations have been employed to design novel analogs. Because griseofulvin is a potential antibacterial agent, we have designed novel derivatives of griseofulvin to enhance its antibacterial efficacy and to evaluate their interactions with bacterial targets using in silico analysis. The results of this study reveal that the newly designed derivatives displayed the most robust binding affinities towards PBP2, tyrosine phosphatase, and FtsZ proteins. Additionally, molecular dynamics (MD) simulations underscored the notable stability of these derivatives when engaged with the FtsZ protein, as evidenced by root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg), and solvent-accessible surface area (SASA). Importantly, this observation aligns with expectations, considering that griseofulvin primarily targets microtubules in eukaryotic cells, and FtsZ functions as the prokaryotic counterpart to microtubules. These findings collectively suggest the promising potential of griseofulvin and its designed derivatives as effective antibacterial agents, particularly concerning their interaction with the FtsZ protein. This research contributes to the ongoing exploration of novel antibiotics and may serve as a foundation for future drug development efforts.

  7. Xia, X. 2023. Horizontal Gene Transfer and Drug Resistance Involving Mycobacterium tuberculosis. Antibiotics 12, 1367
  8. Abstract: Mycobacterium tuberculosis (Mtb) acquires drug resistance at a rate comparable to that of bacterial pathogens that replicate much faster and have a higher mutation rate. One explanation for this rapid acquisition of drug resistance in Mtb is that drug resistance may evolve in other fast-replicating mycobacteria and then be transferred to Mtb through horizontal gene transfer (HGT). This paper aims to address three questions. First, does HGT occur between Mtb and other mycobacterial species? Second, what genes after HGT tend to survive in the recipient genome? Third, does HGT contribute to antibiotic resistance in Mtb? I present a conceptual framework for detecting HGT and analyze 39 ribosomal protein genes, 23S and 16S ribosomal RNA genes, as well as several genes targeted by antibiotics against Mtb, from 43 genomes representing all major groups within Mycobacterium. I also included mgtC and the insertion sequence IS6110 that were previously reported to be involved in HGT. The insertion sequence IS6110 shows clearly that the Mtb complex participates in HGT. However, the horizontal transferability of genes depends on gene function, as was previously hypothesized. HGT is not observed in functionally important genes such as ribosomal protein genes, rRNA genes, and other genes chosen as drug targets. This pattern can be explained by differential selection against functionally important and unimportant genes after HGT. Functionally unimportant genes such as IS6110 are not strongly selected against, so HGT events involving such genes are visible. For functionally important genes, a horizontally transferred diverged homologue from a different species may not work as well as the native counterpart, so the HGT event involving such genes is strongly selected against and eliminated, rendering them invisible to us. In short, while HGT involving the Mtb complex occurs, antibiotic resistance in the Mtb complex arose from mutations in those drug-targeted genes within the Mtb complex and was not gained through HGT.

  9. Xia, X. 2023. Identification of host receptors for viral entry and beyond: a perspective from the spike of SARS-CoV-2. Frontiers in Microbiology-Virology 14:1188249
  10. Abstract: Identification of the interaction between the host membrane receptor and viral receptor-binding domain (RBD) represents a crucial step for understanding viral pathophysiology and for developing drugs against pathogenic viruses. While all membrane receptors and carbohydrate chains could potentially be used as receptors for viruses, prioritized searches focus typically on membrane receptors that are known to have been used by the relatives of the pathogenic virus, e.g., ACE2 used as a receptor for SARS-CoV is a prioritized candidate receptor for SARS-CoV-2. An ideal receptor protein from a viral perspective is one that is highly expressed in epithelial cell surface of mammalian respiratory or digestive tracts, strongly conserved in evolution so many mammalian species can serve as potential hosts, and functionally important so that its expression cannot be readily downregulated by the host in response to the infection. Experimental confirmation of host receptors includes (1) infection studies with cell cultures/tissues/organs with or without candidate receptor expression, (2) experimental determination of protein structure of the complex between the putative viral RDB and the candidate host receptor, and (3) experiments with mutant candidate receptor or homologues of the candidate receptor in other species. Successful identification of the host receptor opens the door for mechanism-based development of candidate drugs and vaccines and facilitates the inference of what other animal species are vulnerable to the viral pathogen. I illustrate these approaches with research on identification of the receptor and co-factors for SARS-CoV-2.

  11. Xia, X. Optimizing Protein Production in Therapeutic Phages against a Bacterial Pathogen, Mycobacterium abscessus. Drugs Drug Candidates 2023, 2, 189-209
  12. Abstract: Therapeutic phages against pathogenic bacteria should kill the bacteria efficiently before the latter evolve resistance against the phages. While many factors contribute to phage efficiency in killing bacteria, such as phage attachment to host, delivery of phage genome into the host, phage mechanisms against host defense, phage biosynthesis rate, and phage life cycle, this paper focuses only on the optimization of phage mRNA for efficient translation. Phage mRNA may not be adapted to its host translation machinery for three reasons: (1) mutation disrupting adaptation, (2) a recent host switch leaving no time for adaptation, and (3) multiple hosts with different translation machineries so that adaptation to one host implies suboptimal adaptation to another host. It is therefore important to optimize phage mRNAs in therapeutic phages. Theoretical and practical principles based on many experiments were developed and applied to phages engineered against a drug-resistant Mycobacterium abscessus that infected a young cystic fibrosis patient. I provide a detailed genomic evaluation of the three therapeutic phages with respect to translation initiation, elongation, and termination, by making use of both experimental results and highly expressed genes in the host. For optimizing phage genes against M. abscessus, the start codon should be AUG. The DtoStart distance from base-pairing between the Shine-Dalgarno (SD) sequence and the anti-SD sequence should be 14–16. The stop codon should be UAA. If UAG or UGA is used as a stop codon, they should be followed by nucleotide U. Start codon, SD, or stop codon should not be embedded in a secondary structure that may obscure the signals and interfere with their decoding. The optimization framework should be generally applicable to developing therapeutic phages against bacterial pathogens.

  13. Xia, X. 2023 Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time. Viruses 15, 684. Here is a Summary in simple terms.
  14. Abstract: Almost all published rooting and dating studies on SARS-CoV-2 assumed that (1) evolutionary rate does not change over time although different lineages can have different evolutionary rates (uncorrelated relaxed clock), and (2) a zoonotic transmission occurred in Wuhan and the culprit was immediately captured, so that only the SARS-CoV-2 genomes obtained in 2019 and the first few months of 2020 (resulting from the first wave of the global expansion from Wuhan) are sufficient for dating the common ancestor. Empirical data contradict the first assumption. The second assumption is not warranted because mounting evidence suggests the presence of early SARS-CoV-2 lineages cocirculating with the Wuhan strains. Large trees with SARS-CoV-2 genomes beyond the first few months are needed to increase the likelihood of finding SARS-CoV-2 lineages that might have originated at the same time as (or even before) those early Wuhan strains. I extended a previously published rapid rooting method to model evolutionary rate as a linear function instead of a constant. This substantially improves the dating of the common ancestor of sampled SARS-CoV-2 genomes. Based on two large trees with 83,688 and 970,777 high-quality and full-length SARS-CoV-2 genomes that contain complete sample collection dates, the common ancestor was dated to 12 June 2019 and 7 July 2019 with the two trees, respectively. The two data sets would give dramatically different or even absurd estimates if the rate was treated as a constant. The large trees were also crucial for overcoming the high rate-heterogeneity among different viral lineages. The improved method was implemented in the software TRAD.

  15. Aris, P.; Mohamadzadeh, M.; Kruglikov, A.; Askari Rad, M.; Xia, X. In Silico Exploration of Microtubule Agent Griseofulvin and Its Derivatives Interactions with Different Human β-Tubulin Isotypes. Molecules 2023, 28, 2384.
  16. Abstract: Tubulin isotypes are known to regulate microtubule stability and dynamics, as well as to play a role in the development of resistance to microtubule-targeted cancer drugs. Griseofulvin is known to disrupt cell microtubule dynamics and cause cell death in cancer cells through binding to tubulin protein at the taxol site. However, the detailed binding mode involved molecular interactions, and binding affinities with different human β-tubulin isotypes are not well understood. Here, the binding affinities of human β-tubulin isotypes with griseofulvin and its derivatives were investigated using molecular docking, molecular dynamics simulation, and binding energy calculations. Multiple sequence analysis shows that the amino acid sequences are different in the griseofulvin binding pocket of βI isotypes. However, no differences were observed at the griseofulvin binding pocket of other β-tubulin isotypes. Our molecular docking results show the favorable interaction and significant affinity of griseofulvin and its derivatives toward human β-tubulin isotypes. Further, molecular dynamics simulation results show the structural stability of most β-tubulin isotypes upon binding to the G1 derivative. Taxol is an effective drug in breast cancer, but resistance to it is known. Modern anticancer treatments use a combination of multiple drugs to alleviate the problem of cancer cells resistance to chemotherapy. Our study provides a significant understanding of the involved molecular interactions of griseofulvin and its derivatives with β-tubulin isotypes, which may help to design potent griseofulvin analogues for specific tubulin isotypes in multidrug-resistance cancer cells in future.

  17. Aris, P.; Wei, Y.; Mohamadzadeh, M.; Xia, X. Griseofulvin: An Updated Overview of Old and Current Knowledge. Molecules 2022, 27, 7034. Molecules 27, 7034.
  18. Abstract: Griseofulvin is an antifungal polyketide metabolite produced mainly by ascomycetes. Since it was commercially introduced in 1959, griseofulvin has been used in treating dermatophyte infections. This fungistatic has gained increasing interest for multifunctional applications in the last decades due to its potential to disrupt mitosis and cell division in human cancer cells and arrest hepatitis C virus replication. In addition to these inhibitory effects, we and others found griseofulvin may enhance ACE2 function, contribute to vascular vasodilation, and improve capillary blood flow. Furthermore, molecular docking analysis revealed that griseofulvin and its derivatives have good binding potential with SARS-CoV-2 main protease, RNA-dependent RNA polymerase (RdRp), and spike protein receptor-binding domain (RBD), suggesting its inhibitory effects on SARS-CoV-2 entry and viral replication. These findings imply the repurposing potentials of the FDA-approved drug griseofulvin in designing and developing novel therapeutic interventions. In this review, we have summarized the available information from its discovery to recent progress in this growing field. Additionally, explored is the possible mechanism leading to rare hepatitis induced by griseofulvin. We found that griseofulvin and its metabolites, including 6-desmethylgriseofulvin (6-DMG) and 4-desmethylgriseofulvin (4-DMG), have favorable interactions with cytokeratin intermediate filament proteins (K8 and K18), ranging from −3.34 to −5.61 kcal mol−1. Therefore, they could be responsible for liver injury and Mallory body (MB) formation in hepatocytes of human, mouse, and rat treated with griseofulvin. Moreover, the stronger binding of griseofulvin to K18 in rodents than in human may explain the observed difference in the severity of hepatitis between rodents and human.

  19. Kruglikov A, Wei Y, Xia X 2022. Proteins from Thermophilic Thermus thermophilus Often Do Not Fold Correctly in a Mesophilic Expression System Such as Escherichia coli. ACS Omega 7:37797–37806.
  20. Abstract: Majority of protein structure studies use Escherichia coli (E. coli) and other model organisms as expression systems for other species’ genes. However, protein folding depends on cellular environment factors, such as chaperone proteins, cytoplasmic pH, temperature, and ionic concentrations. Because of differences in these factors, especially temperature and chaperones, native proteins in organisms such as extremophiles may fold improperly when they are expressed in mesophilic model organisms. Here we present a methodology of assessing the effects of using E. coli as the expression system on protein structures. We compare these effects between eight mesophilic bacteria and Thermus thermophilus (T. thermophilus), a thermophile, and found that differences are significantly larger for T. thermophilus. More specifically, helical secondary structures in T. thermophilus proteins are often replaced by coil structures in E. coli. Our results show unique directionality in misfolding when proteins in thermophiles are expressed in mesophiles. This indicates that extremophiles, such as thermophiles, require unique protein expression systems in protein folding studies

  21. Xia, X. 2022. Multiple regulatory mechanisms for pH homeostasis in the gastric pathogen, Helicobacter pylori. Advances in Genetics 109:39-69.
  22. Abstract: Acid-resistance in gastric pathogen Helicobacter pylori requires the coordination of four essential processes to regulate urease activity. Firstly, urease expression above a base level needs to be finely tuned at different ambient pH. Secondly, as nickel is needed to activate urease, nickel homeostasis needs to be maintained by proteins that import and export nickel ions, and sequester, store and release nickel when needed. Thirdly, urease accessary proteins that activate urease activity by nickel insertion need to be expressed. Finally, a reliable source of urea needs to be maintained by both intrinsic and extrinsic sources of urea. Two-component systems (arsRS and flgRS), as well as a nickel response regulator (NikR), sense the change in pH and act on a variety of genes to accomplish the function of acid resistance without causing cellular overalkalization and nickel toxicity. Nickel storage proteins also feature built-in switches to store nickel at neutral pH and release nickel at low pH. This review summarizes the current status of H. pylori research and highlights a number of hypotheses that need to be tested.

  23. Jia, B.; Conner, R.L.; Penner, W.C.; Zheng, C.; Cloutier, S.; Hou, A.; Xia, X.; You, F.M. Quantitative Trait Locus Mapping of Marsh Spot Disease Resistance in Cranberry Common Bean (Phaseolus vulgaris L.). Int. J. Mol. Sci. 2022, 23, 7639.
  24. Abstract: Common bean (Phaseolus vulgaris L.) is a food crop that is an important source of dietary proteins and carbohydrates. Marsh spot is a physiological disorder that diminishes seed quality in beans. Prior research suggested that this disease is likely caused by manganese (Mn) deficiency during seed development and that marsh spot resistance is controlled by at least four genes. In this study, genetic mapping was performed to identify quantitative trait loci (QTL) and the potential candidate genes associated with marsh spot resistance. All 138 recombinant inbred lines (RILs) from a bi-parental population were evaluated for marsh spot resistance during five years from 2015 to 2019 in sandy and heavy clay soils in Morden, Manitoba, Canada. The RILs were sequenced using a genotyping by sequencing approach. A total of 52,676 single nucleotide polymorphisms (SNPs) were identified and filtered to generate a high-quality set of 2066 SNPs for QTL mapping. A genetic map based on 1273 SNP markers distributed on 11 chromosomes and covering 1599 cm was constructed. A total of 12 stable and 4 environment-specific QTL were identified using additive effect models, and an additional two epistatic QTL interacting with two of the 16 QTL were identified using an epistasis model. Genome-wide scans of the candidate genes identified 13 metal transport-related candidate genes co-locating within six QTL regions. In particular, two QTL (QTL.3.1 and QTL.3.2) with the highest R2 values (21.8% and 24.5%, respectively) harbored several metal transport genes Phvul.003G086300, Phvul.003G092500, Phvul.003G104900, Phvul.003G099700, and Phvul.003G108900 in a large genomic region of 16.8–27.5 Mb on chromosome 3. These results advance the current understanding of the genetic mechanisms of marsh spot resistance in cranberry common bean and provide new genomic resources for use in genomics-assisted breeding and for candidate gene isolation and functional characterization.

  25. Rakesh, M., Aris-Brosou, S. & Xia, X. 2022. Testing alternative hypotheses on the origin and speciation of Hawaiian katydids. BMC Ecol Evo 22, 83
  26. Abstract: Hawaiian Islands offer a unique and dynamic evolutionary theatre for studying origin and speciation as the islands themselves sequentially formed by erupting undersea volcanos, which would subsequently become dormant and extinct. Such dynamics have not been used to resolve the controversy surrounding the origin and speciation of Hawaiian katydids in the genus Banza, whose ancestor could be from either the Old-World genera Ruspolia and Euconocephalus, or the New World Neoconocephalus. To address this question, we performed a chronophylogeographic analysis of Banza species together with close relatives from the Old and New Worlds. Based on extensive dated phylogeographic analyses of two mitochondrial genes (COX1 and CYTB), we show that our data are consistent with the interpretation that extant Banza species resulted from two colonization events, both by katydids from the Old World rather than from the New World. The first event was by an ancestral lineage of Euconocephalus about 6 million years ago (mya) after the formation of Nihoa about 7.3 mya, giving rise to B. nihoa. The second colonization event was by a sister lineage of Ruspolia dubia. The dating result suggests that this ancestral lineage first colonized an older island in the Hawaiian–Emperor seamount chain before the emergence of Hawaii Islands, but colonized Kauai after its emergence in 5.8 mya. This second colonization gave rise to the rest of the Banza species in two major lineages, one on the older northwestern islands, and the other on the newer southwestern islands. Chronophylogeographic analyses with well-sampled taxa proved crucial for resolving phylogeographic controversies on the origin and evolution of species colonizing a new environment.

  27. Aris, P.; Mohamadzadeh, M.; Wei, Y.; Xia, X. 2022 In Silico Molecular Dynamics of Griseofulvin and Its Derivatives Revealed Potential Therapeutic Applications for COVID-19. Int. J. Mol. Sci. 23, 6889
  28. Abstract: Treatment options for Coronavirus Disease 2019 (COVID-19) remain limited, and the option of repurposing approved drugs with promising medicinal properties is of increasing interest in therapeutic approaches to COVID-19. Using computational approaches, we examined griseofulvin and its derivatives against four key anti-SARS-CoV-2 targets: main protease, RdRp, spike protein receptor-binding domain (RBD), and human host angiotensin-converting enzyme 2 (ACE2). Molecular docking analysis revealed that griseofulvin (CID 441140) has the highest docking score (–6.8 kcal/mol) with main protease of SARS-CoV-2. Moreover, griseofulvin derivative M9 (CID 144564153) proved the most potent inhibitor with −9.49 kcal/mol, followed by A3 (CID 46844082) with −8.44 kcal/mol against M protease and ACE2, respectively. Additionally, H bond analysis revealed that compound A3 formed the highest number of hydrogen bonds, indicating the strongest inhibitory efficacy against ACE2. Further, molecular dynamics (MD) simulation analysis revealed that griseofulvin and these derivatives are structurally stable. These findings suggest that griseofulvin and its derivatives may be considered when designing future therapeutic options for SARS-CoV-2 infection.

  29. Jia B, Conner RL, Khan N, Hou A, Xia X, You FM. 2022. Inheritance of marsh spot disease resistance in cranberry common bean (Phaseolus vulgaris L.). The Crop Journal 10(2):456-467
  30. Abstract: Common bean (Phaseolus vulgaris) is an annual legume crop that is grown worldwide for its edible dry seeds and tender pods. Marsh spot (MS) of the seeds is a physio-genic stress disease affecting seed quality in beans. Studies have suggested that this disease involves a nutritional disorder caused by manganese deficiency, but the inheritance of resistance to this disease has not been reported. A biparental genetic population composed of 138 recombinant inbred lines (RILs) was developed from a cross between an MS resistant cultivar ‘Cran09’ and an MS susceptible cultivar ‘Messina’. The 138 RILs and their two parents were evaluated for MS resistance during five consecutive years from 2015 to 2019 in sandy and heavy clay soils in Morden, Manitoba, Canada. The MS incidence (MSI) and the MS resistance index (MSRI) representing disease severity were shown to be both highly correlated heritable traits that had high broad-sense heritability values (H2) of 86.5% and 83.2%, respectively. No significant differences for MSI and MSRI were observed between the two soil types in all five- (MSI) or four-year (MSRI) data collection, but significant correlations among years were observed despite MS resistance was moderately affected by year. The MSIs and MSRIs displayed a right-skewed distribution, indicating a mixed genetic model involving a few major genes and polygenes. Using the joint segregation analysis method, the same four major genes with additive-epistasis effects showed the best fit for both traits, explaining 84.4% and 85.3% of the phenotypic variance for MSI and MSRI, respectively. For both traits, the M1, M2, M3 and m4 acted as the favorable (resistant) alleles for the four genes where M and m represent two alleles of each gene. However, due to epistatic effects, only the individuals of the M1M2M3M4 haplotype appeared to be highly resistant, whereas those of the m1m2m3M4 haplotype were the most susceptible. The m4 allele significantly suppressed the additive effects of M1M2M3 on resistance, but decreased susceptibility due to the additive effects of m1m2m3. Further quantitative trait locus (QTL) mapping is warranted to identify and validate individual genes and develop molecular markers for marker-assisted selection of resistant cultivars.

  31. Parisa Aris, Lihong Yan, Yulong Wei, Ying Chang, Bihong Shi, Xuhua Xia, 2022. Conservation of griseofulvin genes in the gsf gene cluster among fungal genomes. G3 Genes|Genomes|Genetics 12(2)jkab399
  32. Abstract: The polyketide griseofulvin is a natural antifungal compound and research in griseofulvin has been key in establishing our current understanding of polyketide biosynthesis. Nevertheless, the griseofulvin gsf biosynthetic gene cluster (BGC) remains poorly understood in most fungal species, including Penicillium griseofulvum where griseofulvin was first isolated. To elucidate essential genes involved in griseofulvin biosynthesis, we performed third-generation sequencing to obtain the genome of P. griseofulvum strain D-756. Furthermore, we gathered publicly available genome of 11 other fungal species in which gsf gene cluster was identified. In a comparative genome analysis, we annotated and compared the gsf BGC of all 12 fungal genomes. Our findings show no gene rearrangements at the gsf BGC. Furthermore, seven gsf genes are conserved by most genomes surveyed whereas the remaining six were poorly conserved. This study provides new insights into differences between gsf BGC and suggests that seven gsf genes are essential in griseofulvin production.

  33. Xia, X. 2021 Post-Alignment Adjustment and Its Automation. Genes, 12, 1809. https://doi.org/10.3390/genes12111809
  34. Abstract: Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences

  35. Xia X. 2021. Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes. Viruses 13(9),1790 https://www.mdpi.com/1262774
  36. Abstract: All dating studies involving SARS-CoV-2 are problematic. Previous studies have dated the most recent common ancestor (MRCA) between SARS-CoV-2 and its close relatives from bats and pangolins. However, the evolutionary rate thus derived is expected to differ from the rate estimated from sequence divergence of SARS-CoV-2 lineages. Here, I present dating results for the first time from a large phylogenetic tree with 86,582 high-quality full-length SARS-CoV-2 genomes. The tree contains 83,688 genomes with full specification of collection time. Such a large tree spanning a period of about 1.5 years offers an excellent opportunity for dating the MRCA of the sampled SARS-CoV-2 genomes. The MRCA is dated 16 August 2019, with the evolutionary rate estimated to be 0.05526 mutations/genome/day. The Pearson correlation coefficient (r) between the root-to-tip distance (D) and the collection time (T) is 0.86295. The NCBI tree also includes 10 SARS-CoV-2 genomes isolated from cats, collected over roughly the same time span as human COVID-19 infection. The MRCA from these cat-derived SARS-CoV-2 is dated 30 July 2019, with r = 0.98464. While the dating method is well known, I have included detailed illustrations so that anyone can repeat the analysis and obtain the same dating results. With 16 August 2019 as the date of the MRCA of sampled SARS-CoV-2 genomes, archived samples from respiratory or digestive tracts collected around or before 16 August 2019, or those that are not descendants of the existing SARS-CoV-2 lineages, should be particularly valuable for tracing the origin of SARS-CoV-2

  37. Xia, X. 2021. Detailed Dissection and Critical Evaluation of the Pfizer/BioNTech and Moderna mRNA Vaccines." Vaccines (Basel) 9(7), 734.
  38. Abstract: The design of Pfizer/BioNTech and Moderna mRNA vaccines involves many different types of optimizations. Proper optimization of vaccine mRNA can reduce dosage required for each injection leading to more efficient immunization programs. The mRNA components of the vaccine need to have a 5’-UTR to load ribosomes efficiently onto the mRNA for translation initiation, optimized codon usage for efficient translation elongation, and optimal stop codon for efficient translation termination. Both 5’-UTR and the downstream 3’-UTR should be optimized for mRNA stability. The replacement of uridine by N1-methylpseudourinine (Ψ) complicates some of these optimization processes because Ψ is more versatile in wobbling than U. Different optimizations can conflict with each other, and compromises would need to be made. I highlight the similarities and differences between Pfizer/BioNTech and Moderna mRNA vaccines and discuss the advantage and disadvantage of each to facilitate future vaccine improvement. In particular, I point out a few optimizations in the design of the two mRNA vaccines that have not been performed properly.

  39. Jia, B., Waldo, P., Conner, R., Moumen, I., Khan, N., Xia, X., Hou, A., You, F. 2021. Marsh Spot Disease and Its Causal Factor, Manganese Deficiency in Plants: A Historical and Prospective Review. Agricultural Sciences, 12, 928-948 doi: 10.4236/as.2021.129060
  40. Abstract: This review provides an examination of the marsh spot disease in beans and the roles played by its causal factor, manganese (Mn) deficiency. The discovery of the marsh spot disease, its relation with Mn deficiency, and how it can be treated are discussed. Mn serves as a cofactor and a catalyst in various metabolic processes in different cell compartments, such as the oxygen-evolving complex of photosystem II (PSII) or reactive oxygen species scavenging. Some major quantitative trait loci (QTL) and putative candidate genes associated with Mn content in plants, especially in plant seeds, have been identified. Marsh spot disease in cranberry common bean is controlled by several major genes with significant additive and epistatic effects. They provide valuable clues for QTL candidate gene prediction and an improved understanding of the genetic mechanisms responsible for marsh spot resistance in plants.

  41. Tehfe, A.; Roseshter, T.; Wei, Y.; Xia, X. Does Saccharomyces cerevisiae Require Specific Post-Translational Silencing against Leaky Translation of Hac1up? Microorganisms 2021, 9, 620.
  42. Abstract: HAC1 encodes a key transcription factor that transmits the unfolded protein response (UPR) from the endoplasmic reticulum (ER) to the nucleus and regulates downstream UPR genes in Saccharomyces cerevisiae. In response to the accumulation of unfolded proteins in the ER, Ire1p oligomers splice HAC1 pre-mRNA (HAC1u) via a non-conventional process and allow the spliced HAC1 (HAC1i) to be translated efficiently. However, leaky splicing and translation of HAC1u may occur in non-UPR cells to induce undesirable UPR. To control accidental UPR activation, multiple fail-safe mechanisms have been proposed to prevent leaky HAC1 splicing and translation and to facilitate rapid degradation of translated Hac1up and Hac1ip. Among proposed regulatory mechanisms is a degron sequence encoded at the 5′ end of the HAC1 intron that silences Hac1up expression. To investigate the necessity of an intron-encoded degron sequence that specifically targets Hac1up for degradation, we employed publicly available transcriptomic data to quantify leaky HAC1 splicing and translation in UPR-induced and non-UPR cells. As expected, we found that HAC1u is only efficiently spliced into HAC1i and efficiently translated into Hac1ip in UPR-induced cells. However, our analysis of ribosome profiling data confirmed frequent occurrence of leaky translation of HAC1u regardless of UPR induction, demonstrating the inability of translation fail-safe to completely inhibit Hac1up production. Additionally, among 32 yeast HAC1 surveyed, the degron sequence is highly conserved by Saccharomyces yeast but is poorly conserved by all other yeast species. Nevertheless, the degron sequence is the most conserved HAC1 intron segment in yeasts. These results suggest that the degron sequence may indeed play an important role in mitigating the accumulation of Hac1up to prevent accidental UPR activation in the Saccharomyces yeast.

  43. Kruglikov A, Rakesh M, Wei Y, Xia X. 2021. Applications of Protein Secondary Structure Algorithms in SARS-CoV-2 Research. J Proteome Res 20:1457-1463
  44. Abstract: Since the outset of COVID-19, the pandemic has prompted immediate global efforts to sequence SARS-CoV-2, and over 450 000 complete genomes have been publicly deposited over the course of 12 months. Despite this, comparative nucleotide and amino acid sequence analyses often fall short in answering key questions in vaccine design. For example, the binding affinity between different ACE2 receptors and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity at ACE2 contact sites because protein structure similarities are not fully reflected by amino acid sequence similarities. To comprehensively compare protein homology, secondary structure (SS) analysis is required. While protein structure is slow and difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS structure may serve as a viable proxy to gain biological insight. Here we review algorithms and information used in predicting protein SS to highlight its potential application in pandemics research. We also showed examples of how SS predictions can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses. As computational tools are much faster than wet-lab experiments, these applications can be important for research especially in times when quickly obtained biological insights can help in speeding up response to pandemics.

  45. Wei Y, Aris P, Farookhi H & Xia X. 2021 Predicting mammalian species at risk of being infected by SARS‑CoV‑2 from an ACE2 perspective. Scientific Reports 11:1702
  46. Abstract: SARS‑CoV‑2 can transmit efficiently in humans, but it is less clear which other mammals are at risk of being infected. SARS‑CoV‑2 encodes a Spike (S) protein that binds to human ACE2 receptor to mediate cell entry. A species with a human‑like ACE2 receptor could therefore be at risk of being infected by SARS‑CoV‑2. We compared between 132 mammalian ACE2 genes and between 17 coronavirus S proteins. We showed that while global similarities reflected by whole ACE2 gene alignments are poor predictors of high‑risk mammals, local similarities at key S protein‑binding sites highlight several high‑risk mammals that share good ACE2 homology with human. Bats are likely reservoirs of SARS‑CoV‑2, but there are other high‑risk mammals that share better ACE2 homologies with human. Both SARS‑CoV‑2 and SARS‑CoV are closely related to bat coronavirus. Yet, among host‑specific coronaviruses infecting high‑risk mammals, key ACE2‑binding sites on S proteins share highest similarities between SARS‑CoV‑2 and Pangolin‑CoV and between SARS‑CoV and Civet‑CoV. These results suggest that direct coronavirus transmission from bat to human is unlikely, and that rapid adaptation of a bat SARS‑like coronavirus in different high‑risk intermediate hosts could have allowed it to acquire distinct high binding potential between S protein and human‑like ACE2 receptors.

  47. Xia, X. 2021. Domains and Functions of Spike Protein in SARS-Cov-2 in the Context of Vaccine Design. Viruses 13(1), 109
  48. Abstract: The spike protein in SARS-CoV-2 (SARS-2-S) interacts with the human ACE2 receptor to gain entry into a cell to initiate infection. Both Pfizer/BioNTech’s BNT162b2 and Moderna’s mRNA-1273 vaccine candidates are based on stabilized mRNA encoding prefusion SARS-2-S that can be produced after the mRNA is delivered into the human cell and translated. SARS-2-S is cleaved into S1 and S2 subunits, with S1 serving the function of receptor-binding and S2 serving the function of membrane fusion. Here, I dissect in detail the various domains of SARS-2-S and their functions discovered through a variety of different experimental and theoretical approaches to build a foundation for a comprehensive mechanistic understanding of how SARS-2-S works to achieve its function of mediating cell entry and subsequent cell-to-cell transmission. The integration of structure and function of SARS-2-S in this review should enhance our understanding of the dynamic processes involving receptor binding, multiple cleavage events, membrane fusion, viral entry, as well as the emergence of new viral variants. I highlighted the relevance of structural domains and dynamics to vaccine development, and discussed reasons for the spike protein to be frequently featured in the conspiracy theory claiming that SARS-CoV-2 is artificially created.

  49. Wei Y, Silke JR, Aris P, Xia X. 2020. Coronavirus genomes carry the signatures of their habitats. PLos One 15(12): e0244025
  50. Abstract: Coronaviruses such as SARS-CoV-2 regularly infect host tissues that express antiviral proteins (AVPs) in abundance. Understanding how they evolve to adapt or evade host immune responses is important in the effort to control the spread of infection. Two AVPs that may shape viral genomes are the zinc finger antiviral protein (ZAP) and the apolipoprotein B mRNA editing enzyme-catalytic polypeptide-like 3 (APOBEC3). The former binds to CpG dinucleotides to facilitate the degradation of viral transcripts while the latter frequently deaminates C into U residues which could generate notable viral sequence variations. We tested the hypothesis that both APOBEC3 and ZAP impose selective pressures that shape the genome of an infecting coronavirus. Our investigation considered a comprehensive number of publicly available genomes for seven coronaviruses (SARS-CoV-2, SARS-CoV, and MERS infecting Homo sapiens, Bovine CoV infecting Bos taurus, MHV infecting Mus musculus, HEV infecting Sus scrofa, and CRCoV infecting Canis lupus familiaris). We show that coronaviruses that regularly infect tissues with abundant AVPs have CpG-deficient and U-rich genomes; whereas those that do not infect tissues with abundant AVPs do not share these sequence hallmarks. Among the coronaviruses surveyed herein, CpG is most deficient in SARS-CoV-2 and a temporal analysis showed a marked increase in C to U mutations over four months of SARS-CoV-2 genome evolution. Furthermore, the preferred motifs in which these C to U mutations occur are the same as those subjected to APOBEC3 editing in HIV-1. These results suggest that both ZAP and APOBEC3 shape the SARS-CoV-2 genome: ZAP imposes a strong CpG avoidance, and APOBEC3 constantly edits C to U. Evolutionary pressures exerted by host immune systems onto viral genomes may motivate novel strategies for SARS-CoV-2 vaccine development.

  51. Xia, X. 2020 Beyond Trees: Regulons and Regulatory Motif Characterization. Genes 11, 995
  52. Abstract: Trees and their seeds regulate their germination, growth, and reproduction in response to environmental stimuli. These stimuli, through signal transduction, trigger transcription factors that alter the expression of various genes leading to the unfolding of the genetic program. A regulon is conceptually defined as a set of target genes regulated by a transcription factor by physically binding to regulatory motifs to accomplish a specific biological function, such as the CO-FT regulon for flowering timing and fall growth cessation in trees. Only with a clear characterization of regulatory motifs, can candidate target genes be experimentally validated, but motif characterization represents the weakest feature of regulon research, especially in tree genetics. I review here relevant experimental and bioinformatics approaches in characterizing transcription factors and their binding sites, outline problems in tree regulon research, and demonstrate how transcription factor databases can be effectively used to aid the characterization of tree regulons.

  53. Xia, X. 2020 Improving Phylogenetic Signals of Mitochondrial Genes Using a New Method of Codon Degeneration. Life 10, 171.
  54. Abstract: Recovering deep phylogeny is challenging with animal mitochondrial genes because of their rapid evolution. Codon degeneration decreases the phylogenetic noise and bias by aiming to achieve two objectives: (1) alleviate the bias associated with nucleotide composition, which may lead to homoplasy and long-branch attraction, and (2) reduce differences in the phylogenetic results between nucleotide-based and amino acid (AA)-based analyses. The discrepancy between nucleotide-based analysis and AA-based analysis is partially caused by some synonymous codons that differ more from each other at the nucleotide level than from some nonsynonymous codons, e.g., Leu codon TTR in the standard genetic code is more similar to Phe codon TTY than to synonymous CTN codons. Thus, nucleotide similarity conflicts with AA similarity. There are many such examples involving other codon families in various mitochondrial genetic codes. Proper codon degeneration will make synonymous codons more similar to each other at the nucleotide level than they are to nonsynonymous codons. Here, I illustrate a “principled” codon degeneration method that achieves these objectives. The method was applied to resolving the mammalian basal lineage and phylogenetic position of rheas among ratites. The codon degeneration method was implemented in the user-friendly and freely available DAMBE software for all known genetic codes (genetic codes 1 to 33).

  55. Xia, X. 2020 Drug efficacy and toxicity prediction: an innovative application of transcriptomic data. Cell Biology and Toxicology 36(6):591-602
  56. Abstract: Drug toxicity and efficacy are difficult to predict partly because they are both poorly defined, which I aim to remedy here from a transcriptomic perspective. There are two major categories of drugs: (1) restorative drugs aiming to restore an abnormal cell, tissue, or organ to normal function (e.g., restoring normal membrane function of epithelial cells in cystic fibrosis), and (2) disruptive drugs aiming to kill pathogens or malignant cells. These two types of drugs require different definition of efficacy and toxicity. I outlined rationales for defining transcriptomic efficacy and toxicity and illustrated numerically their application with two sets of transcriptomic data, one for restorative drugs (treating cystic fibrosis with lumacaftor/ivacaftor aiming to restore the cellular function of epithelial cells) and the other for disruptive drugs (treating acute myeloid leukemia with prexasertib). The conceptual framework presented will help and sensitize researchers to collect data required for determining drug toxicity.

  57. Xia, X. 2020 Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Molecular Biology and Evolution 37:2699–2705.
  58. Abstract: Wild mammalian species, including bats, constitute the natural reservoir of Betacoronavirus (including SARS, MERS, and the deadly SARS-CoV-2). Different hosts or host tissues provide different cellular environments, especially different antiviral and RNA modification activities that can alter RNA modification signatures observed in the viral RNA genome. The zinc finger antiviral protein (ZAP) binds specifically to CpG dinucleotides and recruits other proteins to degrade a variety of viral RNA genomes. Many mammalian RNA viruses have evolved CpG deficiency. Increasing CpG dinucleotides in these low-CpG viral genomes in the presence of ZAP consistently leads to decreased viral replication and virulence. Because ZAP exhibits tissue-specific expression, viruses infecting different tissues are expected to have different CpG signatures, suggesting a means to identify viral tissue-switching events. I show that SARS-CoV-2 has the most extreme CpG deficiency in all known Betacoronavirus genomes. This suggests that SARS-CoV-2 may have evolved in a new host (or new host tissue) with high ZAP expression. A survey of CpG deficiency in viral genomes identified a virulent canine coronavirus (Alphacoronavirus) as possessing the most extreme CpG deficiency, comparable to that observed in SARS-CoV-2. This suggests that the canine tissue infected by the canine coronavirus may provide a cellular environment strongly selecting against CpG. Thus, viral surveys focused on decreasing CpG in viral RNA genomes may provide important clues about the selective environments and viral defenses in the original hosts.

  59. Katherine E Noah, Jiasheng Hao, Luyan Li, Xiaoyan Sun, Brian Foley, Qun Yang and Xuhua Xia. 2020 Major Revisions in Arthropod Phylogeny Through Improved Supermatrix, With Support for Two Possible Waves of Land Invasion by Chelicerates. Evolutionary Bioinformatics 16:1:12
  60. Abstract: Deep phylogeny involving arthropod lineages is difficult to recover because the erosion of phylogenetic signals over time leads to unreliable multiple sequence alignment (MSA) and subsequent phylogenetic reconstruction. One way to alleviate the problem is to assemble a large number of gene sequences to compensate for the weakness in each individual gene. Such an approach has led to many robustly supported but contradictory phylogenies. A close examination shows that the supermatrix approach often suffers from two shortcomings. The first is that MSA is rarely checked for reliability and, as will be illustrated, can be poor. The second is that, to alleviate the problem of homoplasy at the third codon position of protein-coding genes due to convergent evolution of nucleotide frequencies, phylogeneticists may remove or degenerate the third codon position but may do it improperly and introduce new biases. We performed extensive reanalysis of one of such “big data” sets to highlight these two problems, and demonstrated the power and benefits of correcting or alleviating these problems. Our results support a new group with Xiphosura and Arachnopulmonata (Tetrapulmonata + Scorpiones) as sister taxa. This favors a new hypothesis in which the ancestor of Xiphosura and the extinct Eurypterida (sea scorpions, of which many later forms lived in brackish or freshwater) returned to the sea after the initial chelicerate invasion of land. Our phylogeny is supported even with the original data but processed with a new “principled” codon degeneration. We also show that removing the 1673 codon sites with both AGN and UCN codons (encoding serine) in our alignment can partially reconcile discrepancies between nucleotide-based and AA-based tree, partly because two sequences, one with AGN and the other with UCN, would be identical at the amino acid level but quite different at the nucleotide level.

  61. Xia X, Moriyama EN, Gu X. 2020. Editorial for the special issue “RNA-Seq: Methods and applications” Methods 176:1-3.
  62. Abstract: RNA-Seq is a powerful tool in molecular and evolutionary biology. A well-built tool extends our vision, just like a microscope or a telescope, so that we can see patterns of nature that would otherwise be hidden from us [1, p. xiii]. With proper experimental design, RNA-Seq allows us to see the dynamics of cellular processes at nucleotide resolution......

  63. Xia, X. 2020. RNA-Seq approach for accurate characterization of splicing efficiency of yeast introns. Methods 176:25-33
  64. Abstract: Introns in different genes, or even different introns within the same gene, often have different splice sites and differ in splicing efficiency (SE). One expects mass-transcribed genes to have introns with higher SE than weakly transcribed genes. However, such a simple expectation cannot be tested directly because variable SE for these genes is often not measured. Mechanistically, SE should depend on signal strength at key splice sites (SS) such as 5'SS, 3'SS and branchpoint site (BPS), i.e., SE = F(5'SS, 3'SS, BPS). However, without SE, we again cannot model how these splice sites contribute to SE. Here I present an RNA-Seq approach to quantify SE for each of the 304 introns in yeast (Saccharomyces cerevisiae) genes, including 24 in the 5'UTR, by measuring 1) number of reads mapped to exon-exon junctions (NEE) as a proxy for the abundance of spliced form, and 2) number of reads mapped to exon-intron junction (NEI5 and NEI3 at 5' and 3' ends of intron) as a proxy for the abundance of unspliced form. The total mRNA is NTotal = NEE + p*NEI5 + (1-p)*NEI3, with the simplest p = 0.5 but statistical methods were presented to estimate p from data. An estimated p is needed because NEI5 is expected to be smaller than NEI3 due to 1) step 1 splicing occurs before step 2 so EI5 is broken before EI3, 2) enrichment of poly(A) mRNA by oligo-dT, and 3) 5' degradation. SE is defined as the proportion (NEE/NTotal). Application of the method shows that ribosomal protein messages are efficiently and mostly cotranscriptionally spliced. Yeast genes with long introns are also spliced efficiently. HAC1/YFL031W is poorly spliced partly because its splicing involves a nonspliceosome mechanism and partly because Ire1p, which participate in splicing HAC1, is hardly expressed. Many putative yeast genes have low SE, and some splice sites are incorrectly annotated.

  65. Wei, Y. and X. Xia (2019). "Unique Shine-Dalgarno sequences in Cyanobacteria and chloroplasts reveal evolutionary differences in their translation initiation." Genome Biology and Evolution 11(11):3194-3206.
  66. Abstract: Microorganisms require efficient translation to grow and replicate rapidly, and translation is often rate-limited by initiation. A prominent feature that facilitates translation initiation in bacteria is the Shine-Dalgarno (SD) sequence. However, there is much debate over its conservation in Cyanobacteria and in chloroplasts which presumably originated from endosymbiosis of ancient Cyanobacteria. Elucidating the utilization of SD sequences in Cyanobacteria and in chloroplasts is therefore important to understand whether 1) SD role in Cyanobacterial translation has been reduced prior to chloroplast endosymbiosis or 2) translation in Cyanobacteria and in plastid has been subjected to different evolutionary pressures. To test these alternatives, we employed genomic, proteomic, and transcriptomic data to trace differences in SD usage between Synechocystis species, Microcystis aeruginosa, cyanophages, Nicotiana tabacum chloroplast, and Arabidopsis thaliana chloroplast. We corrected their mis-annotated 16S rRNA 3’ terminus using an RNA-Seq-based approach to determine their SD/anti-SD locational constraints using an improved measurement DtoStart. We found that cyanophages well mimic Cyanobacteria in SD usage because both have been under the same selection pressure for SD-mediated initiation. Whereas chloroplasts lost this similarity because the need for SD-facilitated initiation has been reduced in plastids having much reduced genome size and different ribosomal proteins as a result of host-symbiont co-evolution. Consequently, SD sequence significantly increases protein expression in Cyanobacteria but not in chloroplasts, and only Cyanobacterial genes compensate for a lack of SD sequence by having weaker secondary structures at the 5’ UTR. Our results suggest different evolutionary pressures operate on translation initiation in Cyanobacteria and in chloroplast.

  67. Xia, X. (2019). Starless bias and parameter-estimation bias in the likelihood-based phylogenetic method. AIMS Genetics 5(4):212-223.
  68. Abstract: I analyzed various site pattern combinations in a 4-OTU case to identify sources of starless bias and parameter-estimation bias in likelihood-based phylogenetic methods, and reported three significant contributions. First, the likelihood method is counterintuitive in that it may not generate a star tree with sequences that are equidistant from each other. This behaviour, dubbed starless bias, happens in a 4-OTU tree when there is an excess (i.e., more than expected from a star tree and a substitution model) of conflicting phylogenetic signals supporting the three resolved topologies equally. Special site pattern combinations leading to rejection of a star tree, when sequences are equidistant from each other, were identified. Second, fitting gamma distribution to model rate heterogeneity over sites is strongly confounded with tree topology, especially in conjunction with the starless bias. I present examples to show dramatic differences in the estimated shape parameter α between a star tree and a resolved tree. There may be no rate heterogeneity over sites (with the estimated α > 10000) when a star tree is imposed, but α < 1 (suggesting strong rate heterogeneity over sites) when an (incorrect) resolved tree is imposed. Thus, the dependence of “rate heterogeneity’’ on tree topology implies that “rate heterogeneity’’ is not a sequence-specific feature, cautioning against interpreting a small α to mean that some sites are under strong purifying selection and others not. Thirdly, because there is no existing (and working) likelihood method for evaluating a star tree with continuous gamma-distributed rate, I have implemented the method for JC69 in a self-contained R script for a four-OTU tree (star or resolved), in addition to another R script assuming a constant rate over sites. These R scripts should be useful for teaching and exploring likelihood methods in phylogenetics.

  69. Xia, X. 2019. Translation Control of HAC1 by Regulation of Splicing in Saccharomyces cerevisiae. Int. J. Mol. Sci. 20(12), 2860
  70. Abstract: Hac1p is a key transcription factor regulating the unfolded protein response (UPR) induced by abnormal accumulation of unfolded/misfolded proteins in the endoplasmic reticulum (ER) in Saccharomyces cerevisiae. The accumulation of unfolded/misfolded proteins is sensed by protein Ire1p, which then undergoes trans-autophosphorylation and oligomerization into discrete foci on the ER membrane. HAC1 pre-mRNA, which is exported to the cytoplasm but is blocked from translation by its intron sequence looping back to its 5’UTR to form base-pair interaction, is transported to the Ire1p foci to be spliced, guided by a cis-acting bipartite element at its 3’UTR (3’BE). Spliced HAC1 mRNA can be efficiently translated. The resulting Hac1p enters the nucleus and activates, together with coactivators, a large number of genes encoding proteins such as protein chaperones to restore and maintain ER homeostasis and secretary protein quality control. This review details the translation regulation of Hac1p production, mediated by the nonconventional splicing, in the broad context of translation control and summarizes the evolution and diversification of the UPR signaling pathway among fungal, metazoan and plant lineages.

  71. Xia, X. (2019). Optimizing Phage Translation Initiation. OBM Genetics 3(4):16.
  72. Abstract: Phage as an anti-bacterial agent must be efficient in killing bacteria, and consequently needs to replicate efficiently. Protein production is a limiting step in replication in almost all forms of life, including phages. Efficient protein production depends on the efficiency of translation initiation, elongation and termination, with translation initiation often being rate limiting. Initiation signals such as Shine-Dalgarno (SD) sequences and start codon are decoded by anti-SD sequences and initiation tRNA, respectively. While the decoding machinery cannot be readily modified, the signals can be engineered to increase the efficiency of their decoding. Here I review our understanding of the translation machinery to facilitate the engineering of optimal translation initiation signals for facilitating the design of phage protein-coding genes, including 1) accurate characterization of the 3' end of 16S rRNA by using RNA-Seq data, 2) identification of the optimal SD/aSD interaction, and 3) reduction of secondary structure in sequences flanking the start codon.

  73. Xia, X. 2019. PGT: Visualizing temporal and spatial biogeographic patterns. Global Ecology & Biogeography 28:1195-1199
  74. Aim: A geophylogeny, generated by mapping a phylogeny onto geographic regions, graphically summarizes large-scale genetic variation over space and time, and is consequently crucial for conceptual understanding and visualization of global biogeographic patterns. The rapidly expanding DNA barcoding data with geographic coordinates associated with each specimen have dramatically increased the number of global phylogeographic studies that would benefit from software generating geophylogenies. A number of software programs have been developed, some with advanced features, but they either require additional software or lack in quality, especially in geographic resolution. Innovation: PGT (Phylogeographic Tree), freely available at http://dambe.bio.uottawa.ca/PGT/PGT.aspx, combines the highest map quality and user-friendliness. It accesses Microsoft Bing Maps and Google Maps seamlessly and generates geophylogenies on high-resolution regular or terrain maps. Only a few mouse clicks are needed from PGT installation to the generation of high-resolution geophylogenies, making PGT perfect for both teaching and research in global ecology and biogeography. The input tree can be in NEXUS or Newick format, and the geographic data with latitude and longitude values can be in tab-delimited or comma-delimited format as those exported from spreadsheet programs. A Quick-Start guide is included in the built-in help system. Main conclusions: PGT is simpler, more elegant, and of much higher quality than alternatives for plotting phylogenetic trees over geographic regions for visualizing distribution of biodiversity over space and time.

  75. Wei Y, Silke JR, Xia X. 2019. An improved estimation of tRNA expression to better elucidate the coevolution between tRNA abundance and codon usage in bacteria. Scientific Reports 9:3184
  76. Abstract: The degree to which codon usage can be explained by tRNA abundance in bacterial species is often inadequate, partly because differential tRNA abundance is often approximated by tRNA copy numbers. To better understand the coevolution between tRNA abundance and codon usage, we provide a better estimate of tRNA abundance by profiling tRNA mapped reads (tRNA tpm) using publicly available RNA Sequencing data. To emphasize the feasibility of our approach, we demonstrate that tRNA tpm is consistent with tRNA abundances derived from RNA fingerprinting experiments in Escherichia coli, Bacillus subtilis, and Salmonella enterica. Furthermore, we do not observe an appreciable reduction in tRNA sequencing efficiency due to post-transcriptional methylations in the seven bacteria studied. To determine translationally optimal codons, we calculate codon usage in highly and lowly expressed genes determined by protein per transcript. We found that tRNA tpm identifies more translationally optimal codons than gene copy number and early tRNA fingerprinting abundances. Additionally, tRNA tpm improves the predictive power of tRNA adaptation index over codon preference. Our results suggest that dependence of codon usage on tRNA availability is not always associated with species growth-rate. Conversely, tRNA availability is better optimized to codon usage in fast-growing than slow-growing species.

  77. Xia, X. 2019. Is there a mutation gradient along vertebrate mitochondrial genome mediated by genome replication? Mitochondrion 46:30-40 Data here
  78. Abstract: There is a long-held belief that a mutation gradient exists along vertebrate mtDNA, mediated by mitochondrial replication that leaves different parts of the H-strand exposed in single-stranded state for different durations (DssH). However, the predicted mutation gradient and its tests suffer from both conceptual and empirical problems. I assembled representative mammalian, avian and crocodilian mtDNA to test this prediction. I measured substitution rates at codon positions 1 and 2 (S12) and at codon position 3 (S3), as well as synonymous and nonsynonymous substitution rates, and checked their change along the hypothetical gradient. Mammalian species do not support the predicted mutation gradient, although they should according to the model. Crocodilian species exhibit a pattern closest to the prediction, although they should not because their OL, if present, is not at a fixed position. Correlation between S3 and DssH is much weaker than that between S12 and DssH (contrary to the prediction). This is not due to substitution saturation but is instead due to differential gene conservation, e.g., COX1 is far more conserved than ND6 in all metazoans no matter where they are located along mtDNA. In vertebrates, conserved genes such as COX1 happen to have small DssH and variables genes such as ND6 happen to have large DssH. The observed “mutation gradient” is driven by nonsynonymous substitutions, with synonymous substitutions associated with a much weaker “mutation gradient” likely caused by differential codon re-adaptation after nonsynonymous substitutions. The mammalian and avian results are also confirmed by a much larger compilation and analysis of 691 mammalian and 462 avian mtDNAs. The results, however, does not reject paper is not a test of the strand-displacement model (SDM) of mtDNA replication because a mutation gradient is not a necessary consequence of SDM.

  79. Silke JR, Wei Y, Xia X. 2018. RNA-Seq-Based Analysis Reveals Heterogeneity in Mature 16S rRNA 3' Termini and Extended Anti-Shine-Dalgarno Motifs in Bacterial Species. G3: Genes,Genomes,Genetics 7:17639
  80. Abstract: We present an RNA-Seq based approach to map 3′ end sequences of mature 16S rRNA (3′ TAIL) in bacteria with single-base specificity. Our results show that 3′ TAILs are heterogeneous among species; they contain the core CCUCC anti-Shine-Dalgarno motif, but vary in downstream lengths. Importantly, our findings rectify the mis-annotated 16S rRNAs in 11 out of 13 bacterial species studied herein (covering Cyanobacteria, Deinococcus-Thermus, Firmicutes, Proteobacteria, Tenericutes, and Spirochaetes). Furthermore, our results show that species-specific 3′ TAIL boundaries are retained due to their high complementarity with preferred Shine-Dalgarno sequences, suggesting that 3′ TAIL bases downstream of the canonical CCUCC motif play a more important role in translation initiation than previously reported.

  81. Xia X. (2018) Imputing missing distances in molecular phylogenetics. PeerJ 6:e5321
  82. Abstract: Missing data are frequently encountered in molecular phylogenetics, but there has been no accurate distance imputation method available for distance-based phylogenetic reconstruction. The general framework for distance imputation is to explore tree space and distance values to find an optimal combination of output tree and imputed distances. Here I develop a least-square method coupled with multivariate optimization to impute multiple missing distance in a distance matrix or from a set of aligned sequences with missing genes so that some sequences share no homologous sites (whose distances therefore need to be imputed). I show that phylogenetic trees can be inferred from distance matrices with about 10% of distances missing, and the accuracy of the resulting phylogenetic tree is almost as good as the tree from full information. The new method has the advantage over a recently published one in that it does not assume a molecular clock and is more accurate (comparable to maximum likelihood method based on simulated sequences). I have implemented the function in DAMBE software, which is freely available at http://dambe.bio.uottawa.ca.

  83. Xia, X. 2018. DAMBE7: New and improved tools for data analysis in molecular biology and evolution. Molecular Biology and Evolution 35:1550–1552.
  84. Abstract: DAMBE is a comprehensive software package for genomic and phylogenetic data analysis on Windows, Linux and Macintosh computers. New functions include imputing missing distances and phylogeny simultaneously (paving the way to build large phage and transposon trees), new bootstrapping/jackknifing methods for PhyPA (phylogenetics from pairwise alignments), and an improved function for fast and accurate estimation of the shape parameter of the gamma distribution for fitting rate heterogeneity over sites. Previous method corrects multiple hits for each site independently. DAMBE’s new method uses all sites simultaneously for correction. DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca.

  85. Wei Y, Silke JR, Xia X. 2017. Elucidating the 16S rRNA 3′ boundaries and defining optimal SD/aSD pairing in Escherichia coli and Bacillus subtilis using RNA-Seq data. Scientific Reports 7:17639
  86. Abstract: Bacterial translation initiation is influenced by base pairing between the Shine-Dalgarno (SD) sequence in the 5′ UTR of mRNA and the anti-SD (aSD) sequence at the free 3′ end of the 16S rRNA (3′ TAIL) due to: 1) the SD/aSD sequence binding location and 2) SD/aSD binding affinity. In order to understand what makes an SD/aSD interaction optimal, we must define: 1) terminus of the 3′ TAIL and 2) extent of the core aSD sequence within the 3′ TAIL. Our approach to characterize these components in Escherichia coliand Bacillus subtilis involves 1) mapping the 3′ boundary of the mature 16S rRNA using high-throughput RNA sequencing (RNA-Seq), and 2) identifying the segment within the 3′ TAIL that is strongly preferred in SD/aSD pairing. Using RNA-Seq data, we resolve previous discrepancies in the reported 3′ TAIL in B. subtilis and recovered the established 3′ TAIL in E. coli. Furthermore, we extend previous studies to suggest that both highly and lowly expressed genes favor SD sequences with intermediate binding affinity, but this trend is exclusive to SD sequences that complement the core aSD sequences defined herein.

  87. Xia X (2017) ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data. G3: Genes|Genomes|Genetics 7:3839-3848
  88. Abstract: Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size typically in gigabytes when uncompressed, causing problems in storage, transmission and analysis. However, these files do not need to be so large and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44603541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI's SRA database, one read has 497027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497027 identical reads separately against the Bacillus subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.

  89. Abolbaghaei A, Silke JR, Xia X. 2017 How Changes in Anti-SD Sequences Would Affect SD Sequences in Escherichia coli and Bacillus subtilis. G3: Genes|Genomes|Genetics 7(5):1607–1615
  90. Abstract: The 3' end of the small ribosomal RNAs (ssu rRNA) in bacteria is directly involved in the selection and binding of mRNA transcripts during translation initiation via well-documented interactions between a Shine-Dalgarno (SD) sequence located upstream of the initiation codon and an anti-SD (aSD) sequence at the 3' end of the ssu rRNA. Consequently, the 3' end of ssu rRNA (3'TAIL) is strongly conserved among bacterial species because a change in the region may impact the translation of many protein-coding genes. Escherichia coli and Bacillus subtilis differ in their 3' ends of ssu rRNA, being GAUCACCUCCUUA3' in E. coli and GAUCACCUCCUUUCU3' or GAUCACCUCCUUUCUA3' in B. subtilis. Such differences in 3'TAIL lead to species-specific SDs (designated SDEc for E. coli and SDBs for B. subtilis) that can form strong and well-positioned SD/aSD pairing in one species but not in the other. Selection mediated by the species-specific 3'TAIL is expected to favour SDBs against SDEc in B. subtilis but favour SDEc against SDBs in E. coli. Among well-positioned SDs, SDEc is used more in E. coli than in B. subtilis, and SDBs more in B. subtilis than in E. coli. Highly expressed genes and genes of high translation efficiency tend to have longer SDs than lowly expressed genes and genes with low translation efficiency in both species, but more so in B. subtilis than in E. coli. Both species overuse SDs matching the bolded part of 3'TAIL shown above. The 3'TAIL difference contributes to host-specificity of phages.

  91. Xia X. 2017. Self-Organizing Map for Characterizing Heterogeneous Nucleotide and Amino Acid Sequence Motifs. Computation 5(4):43
  92. Abstract A self-organizing map (SOM) is an artificial neural network algorithm that can learn from the training data consisting of objects expressed as vectors and perform non-hierarchical clustering to represent input vectors into discretized clusters, with vectors assigned to the same cluster sharing similar numeric or alphanumeric features. SOM has been used widely in transcriptomics to identify co-expressed genes as candidates for co-regulated genes. I envision SOM to have great potential in characterizing heterogeneous sequence motifs, and aim to illustrate this potential by a parallel presentation of SOM with a set of numerical vectors and a set of equal-length sequence motifs. While there are numerous biological applications of SOM involving numerical vectors, few studies have used SOM for heterogeneous sequence motif characterization. This paper is intended to encourage (1) researchers to study SOM in this new domain and (2) computer programmers to develop user-friendly motif-characterization SOM tools for biologists.

  93. Xia X. 2017. Bioinformatics and Drug Discovery. Currrent Topics in Medicinal Chemistry 17(15):1709-1726
  94. Abstract Bioinformatic analysis can not only accelerate drug target identification and drug candidate screening and refinement, but also facilitate characterization of side effects and predict drug resistance. High-throughput data such as genomic, epigenetic, genome architecture, cistromic, transcriptomic, proteomic, and ribosome profiling data have all made significant contribution to mechanism-based drug discovery and drug repurposing. Accumulation of protein and RNA structures, as well as development of homology modeling and protein structure simulation, coupled with large structure databases of small molecules and metabolites, paved the way for more realistic protein-ligand docking experiments and more informative virtual screening. I present the conceptual framework that drives the collection of these high-throughput data, summarize the utility and potential of mining these data in drug discovery, outline a few inherent limitations in data and software mining these data, point out news ways to refine analysis of these diverse types of data, and highlight commonly used software and databases relevant to drug discovery.

  95. Wei Y, Xia X 2017 The Role of +4U as an Extended Translation Termination Signal in Bacteria. Genetics 205:539–549
  96. Abstract Termination efficiency of stop codons depends on the first 3’ flanking (+4) base in bacteria and eukaryotes. In both Escherichia coli and Saccharomyces cerevisiae, termination read-through is reduced in the presence of +4U; however, the molecular mechanism underlying +4U function is poorly understood. Here, we perform comparative genomics analysis on 25 bacterial species (covering Actinobacteria, Bacteriodetes, Cyanobacteria, Deinococcus-Thermus, Firmicutes, Proteobacteria and Spirochaetae) with bioinformatics approaches to examine the influence of +4U in bacterial translation termination by contrasting between highly and lowly expressed genes (HEGs and LEGs). We estimated gene expression using the recently formulated Index of Translation Elongation, ITE, and identified stop codon near-cognate tRNAs from well annotated genomes. We show that +4U was consistently over-represented in UAA-ending HEGs relative to LEGs. The result is consistent with the interpretation that +4U enhances termination mainly for UAA. Usage of +4U decreases in GC-rich species where most stop codons are UGA and UAG, with few UAA-ending genes, which is expected if UAA usage in HEGs drives up +4U usage. In highly expressed genes, +4U usage increases significantly with abundance of UAA nc_tRNAs (near-cognate tRNAs which decode codons differing from UAA by a single nucleotide), particularly those with a mismatch at the first stop codon site. UAA is always the preferred stop codon in highly expressed genes, and our results suggest that UAAU is the most efficient translation termination signal in bacteria.

  97. Vlasschaert C, Cook D, Xia X, Gray DA. 2017. The evolution and functional diversification of the deubiquitinating enzyme superfamily. Genome Biology and Evolution 9:558-573
  98. Abstract Ubiquitin and ubiquitin-like molecules are attached to and removed from cellular proteins in a dynamic and highly regulated manner. Deubiquitinating enzymes are critical to this process, and the genetic catalogue of deubiquitinating enzymes expanded greatly over the course of evolution. Extensive functional redundancy has been noted among the 93 members of the human deubiquitinating enzyme (DUB) superfamily. This is especially true of genes that were generated by duplication (termed paralogs) as they often retain considerable sequence similarity. Since complete redundancy in systems should be eliminated by selective pressure we theorized that many overlapping DUBs must have significant and unique spatiotemporal roles that can be evaluated in an evolutionary context. We have determined the evolutionary history of the entire class of deubiquitinating enzymes, including the sequence and means of duplication for all paralogous pairs. To establish their uniqueness, we have investigated cell-type specificity in developmental and adult contexts, and have investigated the co-emergence of substrates from the same duplication events. Our analysis has revealed examples of DUB gene subfunctionalization, neofunctionalization, and nonfunctionalization.

  99. Xia X 2017. Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning. J Genet Genome Res 4:031.
  100. Abstract Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transition probabilities and evolutionary distances from substitution rate matrices. The derivation is essential in molecular phylogenetics and phylogenomics, and represents the only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., BLAST and FASTA). Three different approaches are frequently used for deriving transition probabilities and evolutionary distances: 1) The probability reasoning, 2) Solving partial differential equations, and 3) Matrix exponential and logarithm. The first approach demands the least amount of mathematical skills but offers the best way for conceptual understanding, and can often generate nice mathematical expressions of transition probabilities and evolutionary distances. This review represents the most systematic and comprehensive numerical illustration of the first approach.

  101. Xia X. 2017. DAMBE6: New tools for microbial genomics, phylogenetics and molecular evolution. Journal of Heredity 108(4):431-437.
  102. Abstract DAMBE is a comprehensive software workbench for data analysis in molecular biology, phylogenetics and evolution. Several important new functions have been added since version 5 of DAMBE: 1) comprehensive genomic profiling of translation initiation efficiency of different genes in different prokaryotic species, 2) a new index of translation elongation (ITE) that takes into account both tRNA-mediated selection and background mutation on codon-anticodon adaptation, 3) a new and accurate phylogenetic approach based on pairwise alignment only, which is useful for highly divergent sequences from which a reliable multiple sequence alignment is difficult to obtain. Many other functions have been updated and improved including PWM for motif characterization, Gibbs sampler for de novo motif discovery, hidden Markov models for protein secondary structure prediction, self-organizing map for non-linear clustering of transcriptomic data, comprehensive sequence alignment and phylogenetic functions. DAMBE features a graphic, user-friendly and intuitive interface, and is freely available from http://dambe.bio.uottawa.ca.

  103. Xia X. 2016. PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences. Molecular Phylogenetics and Evolution 102:331–343 .
  104. Abstract While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.

  105. Wei, Y., Wang, J., Xia, X. 2016. Coevolution between stop codon usage and release factors in bacterial species. Molecular Biology and Evolution 33:2357-2367. .
  106. Abstract Three stop codons in bacteria represent different translation termination signals, and their usage is expected to depend on their differences in translation termination efficiency, mutation bias, and relative abundance of release factors (RF1 decoding UAA and UAG, and RF2 decoding UAA and UGA). In 14 bacterial species (covering Proteobacteria, Firmicutes, Cyanobacteria, Actinobacteria and Spirochetes) with cellular RF1 and RF2 quantified, UAA is consistently over-represented in highly expressed genes (HEGs) relative to lowly expressed genes (LEGs), whereas UGA usage is the opposite even in species where RF2 is far more abundant than RF1. UGA usage relative to UAG increases significantly with PRF2 [=RF2/(RF1+RF2)] as expected from adaptation between stop codons and their decoders. PRF2 is greater than 0.5 over a wide range of AT content (measured by PAT3 as the proportion of AT at third codon sites), but decreases rapidly towards zero at the high range of PAT3. This explains why bacterial lineages with high PAT3 often have UGA reassigned because of low RF2. There is no indication that UAG is a minor stop codon in bacteria as claimed in a recent publication. The claim is invalid because of the failure to apply the two key criteria in identifying a minor codon: 1) it is least preferred by HEGs (or most preferred by LEGs) and 2) it corresponds to the least abundant decoder. Our results suggest a more plausible explanation for why UAA usage increases, and UGA usage decreases, with PAT3, but UAG usage remains low over the entire PAT3 range.

  107. Sun X, Xia X, Yang Q. 2016. Dating the origin of the major lineages of Branchiopoda. Palaeoworld 25 (2), 303-317
  108. Abstract Despite the well-established phylogeny and good fossil record of branchiopods, a consistent macro-evolutionary timescale for the group remains elusive. This study focuses on the early branchiopod divergence dates where fossil record is extremely fragmentary or missing. On the basis of a large genomic dataset and carefully evaluated fossil calibration points, we assess the quality of the branchiopod fossil record by calibrating the tree against well-established first occurrences, providing paleontological estimates of divergence times and completeness of their fossil record. The maximum age constraints were set using a quantitative approach of Marshall (2008). We tested the alternative placements of Yicaris and Wujicaris in the referred arthropod tree via the likelihood checkpoints method. Divergence dates were calculated using Bayesian relaxed molecular clock and penalized likelihood methods. Our results show that the stem group of Branchiopoda is rooted in the late Neoproterozoic (563 ± 7 Ma); the crown-Branchiopoda diverged during middle Cambrian to Early Ordovician (478–512 Ma), likely representing the origin of the freshwater biota; the Phyllopoda clade diverged during Ordovician (448–480 Ma) and Diplostraca during Late Ordovician to early Silurian (430–457 Ma). By evaluating the congruence between the observed times of appearance of clade in the fossil record and the results derived from molecular data, we found that the uncorrelated rate model gave more congruent results for shallower divergence events whereas the auto-correlated rate model gives more congruent results for deeper events.

  109. Vlasschaert, C., Xia, X., Gray, D.A. 2016. Selection preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals. Scientific Reports 6:20039 .
  110. Abstract Ubiquitin specific protease 4 (USP4) is a highly networked deubiquitinating enzyme with reported roles in cancer, innate immunity and RNA splicing. In mammals it has two dominant isoforms arising from inclusion or skipping of exon 7 (E7). We evaluated two plausible mechanisms for the generation of these isoforms: (A) E7 skipping due to a long upstream intron and (B) E7 skipping due to inefficient 5′ splice sites (5′SS) and/or branchpoint sites (BPS). We then assessed whether E7 alternative splicing is maintained by selective pressure or arose from genetic drift. Both transcript variants were generated from a USP4-E7 minigene construct with short flanking introns, an observation consistent with the second mechanism whereby differential splice signal strengths are the basis of E7 skipping. Optimization of the downstream 5′SS eliminated E7 skipping. Experimental validation of the correlation between 5′SS identity and exon skipping in vertebrates pinpointed the +6 site as the key splicing determinant. Therian mammals invariably display a 5′SS configuration favouring alternative splicing and the resulting isoforms have distinct subcellular localizations. We conclude that alternative splicing of mammalian USP4 is under selective maintenance and that long and short USP4 isoforms may target substrates in various cellular compartments.

  111. Vlasschaert, C., Xia, X., Coulombe, J., Gray, D.A. 2015. Evolution of the highly networked deubiquitinating enzymes USP4, USP15 and USP11. BMC Evolutionary Biology 15:230 .
  112. Background: USP4, USP15 and USP11 are paralogous deubiquitinating enzymes as evidenced by structural organization and sequence similarity. Based on known interactions and substrates it would appear that they have partially redundant roles in pathways vital to cell proliferation, development and innate immunity, and elevated expression of all three has been reported in various human malignancies. The nature and order of duplication events that gave rise to these extant genes has not been determined, nor has their functional redundancy been established experimentally at the organismal level. Methods We have employed phylogenetic and syntenic reconstruction methods to determine the chronology of the duplication events that generated the three paralogs and have performed genetic crosses to evaluate redundancy in mice. Results Our analyses indicate that USP4 and USP15 arose from whole genome duplication prior to the emergence of jawed vertebrates. Despite having lower sequence identity USP11 was generated later in vertebrate evolution by small-scale duplication of the USP4-encoding region. While USP11 was subsequently lost in many vertebrate species, all available genomes retain a functional copy of either USP4 or USP15, and through genetic crosses of mice with inactivating mutations we have confirmed that viability is contingent on a functional copy of USP4 or USP15. Loss of ubiquitin-exchange regulation, constitutive skipping of the seventh exon and neural-specific expression patterns are derived states of USP11. Post-translational modification sites differ between USP4, USP15 and USP11 throughout evolution. Conclusions In isolation sequence alignments can generate erroneous USP gene phylogenies. Through a combination of methodologies the gene duplication events that gave rise to USP4, USP15, and USP11 have been established. Although it operates in the same molecular pathways as the other USPs, the rapid divergence of the more recently generated USP11 enzyme precludes its functional interchangeability with USP4 and USP15. Given their multiplicity of substrates the emergence (and in some cases subsequent loss) of these USP paralogs would be expected to alter the dynamics of the networks in which they are embedded.

  113. Prabhakaran, R., Chithambaram, S., Xia, X. 2015. Escherichia coli and Staphylococcus phages: Effect of translation initiation efficiency on differential codon adaptation mediated by virulent and temperate lifestyles. Journal of General Virology 96:1169-1179. .
  114. Abstract Rapid biosynthesis is key to the success of bacteria and viruses. Highly expressed genes in bacteria exhibit strong codon bias corresponding to differential availability of tRNAs. However, a large clade of lambdoid coliphages exhibit relatively poor codon adaptation to the host translation machinery, in contrast to other coliphages that exhibit strong codon adaptation to the host. Three possible explanations were previously proposed but dismissed: 1) the phage-borne tRNA genes that reduce the dependence of phage translation on host tRNAs, 2) lack of time needed for evolving codon adaptation due to recent host switching, and 3) strong strand asymmetry with biased mutation disrupting codon adaptation. Here we examine the possibility that phages with relatively poor codon adaptation have poor translation initiation which would weaken the selection on codon adaptation. We measure translation initiation by: 1) the strength and position of the Shine-Dalgarno (SD) sequence and (2) stability of secondary structure of sequences flanking SD and start codon known to affect accessibility of SD and start codon. Phage genes with strong codon adaptation have significantly stronger SD sequences than those with poor codon adaptation. The former also have significantly weaker secondary structure in sequences flanking SD and start codon than the latter. Thus, lambdoid phages do not exhibit strong codon adaptation because they have relatively inefficient translation initiation and would benefit little from increased elongation efficiency. We also provide evidence suggesting that phage lifestyle (virulent versus temperate) affects selection intensity on the efficiency of translation initiation and elongation.

  115. Xia X. 2015. A major controversy in codon-anticodon adaptation resolved by a new codon usage index. Genetics 199:573-579 Access the recommendation on F1000Prime
  116. Abstract Two alternative hypotheses attribute different benefits to codon-anticodon adaptation. The first assumes that protein production is rate-limited by both initiation and elongation, and codon-anticodon adaptation would result in higher elongation efficiency and more efficient and accurate protein production, especially for highly expressed genes. The second claims that protein production is rate-limited only by initiation efficiency, but improved codon adaptation and consequently increased elongation efficiency have the benefit of increasing ribosomal availability for global translation. To test these hypotheses, a recent study engineered a synthetic library of 154 genes, all encoding the same protein but differing in degrees of codon adaptation, to quantify the effect of differential codon adaptation on protein production in Escherichia coli. The surprising conclusion that “codon bias did not correlate with gene expression” and that “translation initiation, not elongation, is rate-limiting for gene expression” contradicts the conclusion reached by many other empirical studies. Here I resolve the contradiction by reanalyzing the data from the 154 sequences. I demonstrate that translation elongation accounts for about 17% of total variation in protein production and that the previous conclusion is due to the use of CAI (codon adaptation index) which does not account for the mutation bias in characterizing codon adaptation. The effect of translation elongation becomes undetectable only when translation initiation is unrealistically slow. A new index of translation elongation (ITE) is formulated to facilitate studies on the efficiency and evolution of the translation machinery.

  117. Nikbakht, H., Xia, X., D. Hickey. 2014. The evolution of genomic GC content undergoes a rapid reversal within the genus Plasmodium. Genome 57:507-511
  118. Abstract The genome of the malarial parasite, Plasmodium falciparum, is extremely AT-rich. This bias toward a low GC content is a characteristic of several - but not all - species within the genus Plasmodium. We compared 4283 orthologous pairs of protein-coding sequences between P. falciparum and the less AT-biased P. vivax. Our results indicate that the common ancestor of these two species was also extremely AT-rich. This means that, although there was a strong bias toward A+T during the early evolution of the ancestral Plasmodium lineage, there was a subsequent reversal of this trend during the more recent evolution of some species, such as P. vivax. Moreover, we show that not only is the P. vivax genome losing its AT richness, it is actually gaining a very significant degree of GC richness. This example illustrates the potential volatility of nucleotide content during the course of molecular evolution. Such reversible fluxes in nucleotide content within lineages could have important implications for phylogenetic reconstruction based on molecular sequence data.

  119. Chithambaram S, Prabhakaran P, Xia X. 2014. Differential codon adaptation between dsDNA and ssDNA phages in E. coli. Molecular Biology and Evolution 31:1606-1617
  120. Abstract Because phages use their host translation machinery, their codon usage should evolve towards that of highly expressed host genes. We used two indices to measure codon adaptation of phages to their host, rRSCU (the correlation in RSCU between phages and their host) and CAI computed with highly expressed host genes as the reference set (because phage translation depends on host translation machinery). These indices used for this purpose are appropriate only when hosts exhibit little mutation bias, so only phages parasitizing Escherichia coli were included in the analysis. For double-stranded (dsDNA) phages, both rRSCU and CAI decrease with increasing number of tRNA genes encoded by the phage genome. rRSCU is greater for dsDNA phages than for ssDNA phages, and the low rRSCU values are mainly due to poor concordance in RSCU values for Y-ending codons between ssDNA phages and the E. coli host, consistent with the predicted effect of C→T mutation bias in the ssDNA phages. Strong C→T mutation bias would improve codon adaptation in codon families (e.g., Gly) where U-ending codons are favored over C-ending codons (“U-friendly” codon families) by highly expressed host genes, but decrease codon adaptation in other codon families where highly expressed host genes favor C-ending codons against U-ending codons (“U-hostile” codon families). It is remarkable that ssDNA phages with increasing C→T mutation bias also increased the usage of codons in the “U-friendly” codon families, thereby achieving CAI values almost as large as those of dsDNA phages. This represents a new type of codon adaptation.

  121. Prabhakaran R, Chithambaram S, Xia X 2014. Aeromonas phages encode tRNAs for their overused codons. Int. J. Computational Biology and Drug Design 7:168-183 .
  122. Abstract The GC-rich bacterial species, Aeromonas salmonicida, is parasitised by both GC-rich phages (Aeromonas phages- phiAS7 and vB_AsaM-56) and GC-poor phages (Aeromonas phages – 25, 31, 44RR2.8t, 65, Aes508, phiAS4 and phiAS5). Both the GC-rich Aeromonas phage phiAS7 and Aeromonas phage vB_AsaM-56 have nearly identical codon usage bias as their host. While all the remaining seven GC-poor Aeromonas phages differ dramatically in codon usage from their GC-rich host. Here, we investigated whether tRNA encoded in the genome of Aeromonas phages facilitate the translation of phage proteins. We found that tRNAs encoded in the phage genome correspond to synonymous codons overused in the phage genes but not in the host genes.

  123. Chithambaram S, Prabhakaran P, Xia X. 2014. The effects of mutation and selection on codon adaptation in E. coli bacteriophage. Genetics 197:301-315
  124. Abstract Studying phage codon adaptation is important not only for understanding the process of translation elongation, but also for re-engineering phages for medical and industrial purposes. To evaluate the effect of mutation and selection on phage codon usage, we developed an index to measure selection imposed by host translation machinery, based on the difference in codon usage between all host genes and highly expressed host genes. We developed linear and nonlinear models to estimate the C→T mutation bias in different phage lineages and to evaluate the relative effect of mutation and host selection on phage codon usage. C→T biased mutations occur more frequently in ssDNA phages than in dsDNA phages, and affect not only synonymous codon usage, but also nonsynonymous substitutions at second codon positions, especially in ssDNA phages. The host translation machinery affects codon adaptation in both dsDNA and ssDNA phages, with stronger effect on dsDNA phages than on ssDNA phages. Strand asymmetry with the associated local variation in mutation bias can significantly interfere with codon adaptation in both dsDNA and ssDNA phages.

  125. Xia, X. 2013. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Molecular Biology and Evolution 30:1720-1728 .

    Abstract Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories: 1) sequence retrieval, editing, manipulation, and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot, and many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely available from http://dambe.bio.uottawa.ca

  126. Sun, X. Y., Yang, Q. Xia, X. 2013. An Improved Implementation of Effective Number of Codons (Nc). Molecular Biology and Evolution 30:191-196.

    Abstract The effective number of codons (Nc) is a widely used index for characterizing codon usage bias because it does not require a set of reference genes as does codon adaptation index (CAI) and because of the freely available computational tools such as CodonW. However, Nc, as originally formulated has many problems. For example, it can have values far greater than the number of sense codons; it treats a 6-fold compound codon family as a single-codon family although it is made of a 2-fold and a 4-fold codon family that can be under dramatically different selection for codon usage bias; the existing implementations do not handle all different genetic codes; it is often biased by codon families with a small number of codons. We developed a new Nc that has a number of advantages over the original Nc. Its maximum value equals the number of sense codons when all synonymous codons are used equally, and its minimum value equals the number of codon families when exactly one codon is used in each synonymous codon family. It handles all known genetic codes. It breaks the compound codon families (e.g., those involving amino acids coded by six synonymous codons) into 2-fold and 4-fold codon families. It reduces the effect of codon families with few codons by introducing pseudocount and weighted averages. The new Nc has significantly improved correlation with CAI than the original Nc from CodonW based on protein-coding genes from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Bacillus subtilis, Micrococcus luteus, and Mycoplasma genitalium. It also correlates better with protein abundance data from the yeast than the original Nc.

  127. Xia, X. 2012. Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction. Scientifica, vol. 2012, Article ID 917540. doi:10.6064/2012/917540.

    Abstract Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.

  128. Vos, R. A., Balhoff, J. P., Caravas, J. A., Holder, M. T., Lapp, H., Maddison, W. P., Midford, P. E., Priyam, A., Sukumaran, S. Xia, X., Stoltzfus, A. 2012. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Systematic Biology 61(4):675–689

    Abstract In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.

  129. Xia, X. 2012. DNA Replication and Strand Asymmetry in Prokaryotic and Mitochondrial Genomes. Current Genomics 13, 16-27

    Abstract Different patterns of strand asymmetry have been documented in a variety of prokaryotic genomes as well as mitochondrial genomes. Because different replication mechanisms often lead to different patterns of strand asymmetry, much can be learned of replication mechanisms by examining strand asymmetry. Here I summarize the diverse patterns of strand asymmetry among different taxonomic groups to suggest that (1) the single-origin replication may not be universal among bacterial species as the endosymbionts Wigglesworthia glossinidia, Wolbachia species, cyanobacterium Synechocystis 6803 and Mycoplasma pulmonis genomes all exhibit strand asymmetry patterns consistent with the multiple origins of replication, (2) different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used, (3) mitochondrial genomes from representative vertebrate species share one strand asymmetry pattern consistent with the strand-displacement replication documented in mammalian mtDNA, suggesting that the mtDNA replication mechanism in mammals may be shared among all vertebrate species, and (4) mitochondrial genomes from primitive forms of metazoans such as the sponge and hydra (representing Porifera and Cnidaria, respectively), as well as those from plants, have strand asymmetry patterns similar to single-origin or multi-origin replications observed in prokaryotes and are drastically different from mitochondrial genomes from other metazoans. This may explain why sponge and hydra mitochondrial genomes, as well as plant mitochondrial genomes, evolves much slower than those from other metazoans.

  130. Xia, X. , MacKay, V., Yao, X., Wu, J., Miura, F. Ito, T. Morris, D. R. 2011. Translation initiation: a regulatory role for poly(A) tracts in front of the AUG codon in Saccharomyces cerevisiae. Genetics 189:469-478

    Abstract The 5'-UTR serves as the loading dock for ribosomes during translation initiation and is the key site for translation regulation. Many genes in the yeast Saccharomyces cerevisiae contain poly(A) tracts in their 5'-UTRs. We studied these pre-AUG poly(A) tracts in a set of 3274 recently identified 5'-UTRs in the yeast to characterize their effect on in vivo protein abundance, ribosomal density, and protein synthesis rate in the yeast. The protein abundance and the protein synthesis rate increase with the length of the poly(A), but exhibit a dramatic decrease when the poly(A) length is ≥12. The ribosomal density also reaches the lowest level when the poly(A) length is ≥12. This supports the hypothesis that a pre-AUG poly(A) tract can bind to translation initiation factors to enhance translation initiation, but a long (≥12) pre-AUG poly(A) tract will bind to Pab1p, whose binding size is 12 consecutive A residues in yeast, resulting in repression of translation. The hypothesis explains why a long pre-AUG poly(A) leads to more efficient translation initiation than a short one when PABP is absent, and why pre-AUG poly(A) is short in the early genes but long in the late genes of vaccinia virus.

  131. Ma, P.,Ma, P., Xia X. 2011. Factors affecting splicing strength of yeast genes. Comparative and Functional Genomics. Article ID 212146, 13 pages

    Abstract Accurate and efficient splicing is of crucial importance for highly-transcribed intron-containing genes (ICGs) in rapidly replicating unicellular eukaryotes such as the budding yeast Saccharomyces cerevisiae. We characterize the 5' and 3' splice sites (ss) by position weight matrix scores (PWMSs), which is the highest for the consensus sequence and the lowest for splice sites differing most from the consensus sequence and used PWMS as a proxy for splicing strength. HAC1, which is known to be spliced by a nonspliceosomal mechanism, has the most negative PWMS for both its 5' ss and 3' ss. Several genes under strong splicing regulation and requiring additional splicing factors for their splicing also have small or negative PWMS values. Splicing strength is higher for highly transcribed ICGs than for lowly transcribed ICGs and higher for transcripts that bind strongly to spliceosomes than those that bind weakly. The 3' splice site features a prominent poly-U tract before the 3'AG. Our results suggest the potential of using PWMS as a screening tool for ICGs that are either spliced by a nonspliceosome mechanism or under strong splicing regulation in yeast and other fungal species.

  132. Xia, X. , Yang, Q. 2011. A Distance-based Least-square Method for Dating Speciation Events. Molecular Phylogenetics and Evolution 59:342-353.

    Abstract Distance-based phylogenetic methods are widely used in biomedical research. However, there has been little development of rigorous statistical methods and software for dating speciation and gene duplication events by using evolutionary distances. Here we present a simple, fast and accurate dating method based on the least-squares (LS) method that has already been widely used in molecular phylogenetic reconstruction. Dating methods with a global clock or two different local clocks are presented. Single or multiple fossil calibration points can be used, and multiple data sets can be integrated in a combined analysis. Variation of the estimated divergence time is estimated by resampling methods such as bootstrapping or jackknifing. Application of the method to dating the divergence time among seven ape species or among 35 mammalian species including major mammalian orders shows that the estimated divergence time with the LS criterion is nearly identical to those obtained by the likelihood method or Bayesian inference.

  133. van Weringh, A, M. Ragonnet-Cronin, E. Pranckeviciene, M. Pavon-Eternod, L. Kleiman, X. Xia. 2011. HIV-1 modulates the tRNA pool to improve translation efficiency. Molecular Biology and Evolution 28:1827-1834

    Abstract Despite its poorly adapted codon usage, HIV-1 replicates and is expressed extremely well in human host cells. HIV-1 has recently been shown to package non-lysyl transfer RNAs (tRNAs) in addition to the tRNA(Lys) needed for priming reverse transcription and integration of the HIV-1 genome. By comparing the codon usage of HIV-1 genes with that of its human host, we found that tRNAs decoding codons that are highly used by HIV-1 but avoided by its host are overrepresented in HIV-1 virions. In particular, tRNAs decoding A-ending codons, required for the expression of HIV's A-rich genome, are highly enriched. Because the affinity of Gag-Pol for all tRNAs is nonspecific, HIV packaging is most likely passive and reflects the tRNA pool at the time of viral particle formation. Codon usage of HIV-1 early genes is similar to that of highly expressed host genes, but codon usage of HIV-1 late genes was better adapted to the selectively enriched tRNA pool, suggesting that alterations in the tRNA pool are induced late in viral infection. If HIV-1 genes are adapting to an altered tRNA pool, codon adaptation of HIV-1 may be better than previously thought.

  134. Palidwor GA, Perkins TJ, Xia X. 2010. A General Model of Codon Bias Due to GC Mutational Bias. PLoS ONE 5(10): e13431.

    BACKGROUND: In spite of extensive research on the effect of mutation and selection on codon usage, a general model of codon usage bias due to mutational bias has been lacking. Because most amino acids allow synonymous GC content changing substitutions in the third codon position, the overall GC bias of a genome or genomic region is highly correlated with GC3, a measure of third position GC content. For individual amino acids as well, G/C ending codons usage generally increases with increasing GC bias and decreases with increasing AT bias. Arginine and leucine, amino acids that allow GC-changing synonymous substitutions in the first and third codon positions, have codons which may be expected to show different usage patterns.PRINCIPAL FINDINGS:In analyzing codon usage bias in hundreds of prokaryotic and plant genomes and in human genes, we find that two G-ending codons, AGG (arginine) and TTG (leucine), unlike all other G/C-ending codons, show overall usage that decreases with increasing GC bias, contrary to the usual expectation that G/C-ending codon usage should increase with increasing genomic GC bias. Moreover, the usage of some codons appears nonlinear, even nonmonotone, as a function of GC bias. To explain these observations, we propose a continuous-time Markov chain model of GC-biased synonymous substitution. This model correctly predicts the qualitative usage patterns of all codons, including nonlinear codon usage in isoleucine, arginine and leucine. The model accounts for 72%, 64% and 52% of the observed variability of codon usage in prokaryotes, plants and human respectively. When codons are grouped based on common GC content, 87%, 80% and 68% of the variation in usage is explained for prokaryotes, plants and human respectively.CONCLUSIONS:The model clarifies the sometimes-counterintuitive effects that GC mutational bias can have on codon usage, quantifies the influence of GC mutational bias and provides a natural null model relative to which other influences on codon bias may be measured.

  135. Jiang, J.-Y., H. Xiong, M. Cao, X. Xia, M.-A. Sirard, B Tsang. 2010. Mural granulosa cell gene expression associated with oocyte developmental competence. Journal of Ovarian Research 2010, 3:6.

    BACKGROUND: Ovarian follicle development is a complex process. Paracrine interactions between somatic and germ cells are critical for normal follicular development and oocyte maturation. Studies have suggested that the health and function of the granulosa and cumulus cells may be reflective of the health status of the enclosed oocyte. The objective of the present study is to assess, using an in vivo immature rat model, gene expression profile in granulosa cells, which may be linked to the developmental competence of the oocyte. We hypothesized that expression of specific genes in granulosa cells may be correlated with the developmental competence of the oocyte.METHODS:Immature rats were injected with eCG and 24 h thereafter with anti-eCG antibody to induce follicular atresia or with pre-immune serum to stimulate follicle development. A high percentage (30-50%, normal developmental competence, NDC) of oocytes from eCG/pre-immune serum group developed to term after embryo transfer compared to those from eCG/anti-eCG (0%, poor developmental competence, PDC). Gene expression profiles of mural granulosa cells from the above oocyte-collected follicles were assessed by Affymetrix rat whole genome array.RESULTS:The result showed that twelve genes were up-regulated, while one gene was down-regulated more than 1.5 folds in the NDC group compared with those in the PDC group. Gene ontology classification showed that the up-regulated genes included lysyl oxidase (Lox) and nerve growth factor receptor associated protein 1 (Ngfrap1), which are important in the regulation of protein-lysine 6-oxidase activity, and in apoptosis induction, respectively. The down-regulated genes included glycoprotein-4-beta galactosyltransferase 2 (Ggbt2), which is involved in the regulation of extracellular matrix organization and biogenesis.CONCLUSIONS:The data in the present study demonstrate a close association between specific gene expression in mural granulosa cells and the developmental competence of oocytes. This finding suggests that the most differentially expressed gene, lysyl oxidase, may be a candidate biomarker of oocyte health and useful for the selection of good quality oocytes for assisted reproduction.

  136. Zhang, D., J. T. Popesku, C. J. Martyniuk, H. Xiong, P. Duarte-Guterman, L. Yao, Xia, X., and V. L. Trudeau. 2009. Profiling neuroendocrine gene expression changes following fadrozole-induced estrogen decline in the female goldfish. Physiol. Genomics 38:351-361.

    Abstract Teleost fish represent unique models to study the role of neuroestrogens because of the extremely high activity of brain aromatase (AroB; the product of cyp19a1b). Aromatase respectively converts androstenedione and testosterone to estrone and 17beta-estradiol (E2). Specific inhibition of aromatase activity by fadrozole has been shown to impair estrogen production and influence neuroendocrine and reproductive functions in fish, amphibians, and rodents. However, very few studies have identified the global transcriptomic response to fadrozole-induced decline of estrogens in a physiological context. In our study, sexually mature prespawning female goldfish were exposed to fadrozole (50 mcirog/l) in March and April when goldfish have the highest AroB activity and maximal gonadal size. Fadrozole treatment significantly decreased serum E2 levels (4.7 times lower; P = 0.027) and depressed AroB mRNA expression threefold in both the telencephalon (P = 0.021) and the hypothalamus (P = 0.006). Microarray expression profiling of the telencephalon identified 98 differentially expressed genes after fadrozole treatment (q value <0.05). Some of these genes have shown previously to be estrogen responsive in either fish or other species, including rat, mouse, and human. Gene ontology analysis together with functional annotations revealed several regulatory themes for physiological estrogen action in fish brain that include the regulation of calcium signaling pathway and autoregulation of estrogen receptor action. Real-time PCR verified microarray data for decreased (activin-betaA) or increased (calmodulin, ornithine decarboxylase 1) mRNA expression. These data have implications for our understanding of estrogen actions in the adult vertebrate brain.

  137. Li, H., G. Liu, and X. Xia. 2009. Correlations between recombination rate and intron distributions along chromosomes of C. elegans. Progress in Natural Science 19:517.
  138. Xia, X. 2009. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Molecular Phylogenetics and Evolution 52:665-676.

    Abstract Distance-based phylogenetic methods are widely used in biomedical research. However, distance-based dating of speciation events and the test of the molecular clock hypothesis are relatively underdeveloped. Here I develop an approximate test of the molecular clock hypothesis for distance-based trees, as well as information-theoretic indices that have been used frequently in model selection, for use with distance matrices. The results are in good agreement with the conventional sequence-based likelihood ratio test. Among the information-theoretic indices, AICu is the most consistent with the sequence-based likelihood ratio test. The confidence in model selection by the indices can be evaluated by bootstrapping. I illustrate the usage of the indices and the approximate significance test with both empirical and simulated sequences. The tests show that distance matrices from protein gel electrophoresis and from genome rearrangement events do not violate the molecular clock hypothesis, and that the evolution of the third codon position conforms to the molecular clock hypothesis better than the second codon position in vertebrate mitochondrial genes. I outlined evolutionary distances that are appropriate for phylogenetic reconstruction and dating.

  139. Xia, X., Holcik, M., 2009. Strong Eukaryotic IRESs Have Weak Secondary Structure. PLoS ONE 4, e4136.

    BACKGROUND: The objective of this work was to investigate the hypothesis that eukaryotic Internal Ribosome Entry Sites (IRES) lack secondary structure and to examine the generality of the hypothesis.METHODOLOGY/PRINCIPAL FINDINGS: IRESs of the yeast and the fruit fly are located in the 5'UTR immediately upstream of the initiation codon. The minimum folding energy (MFE) of 60 nt RNA segments immediately upstream of the initiation codons was calculated as a proxy of secondary structure stability. MFE of the reverse complements of these 60 nt segments was also calculated. The relationship between MFE and empirically determined IRES activity was investigated to test the hypothesis that strong IRES activity is associated with weak secondary structure. We show that IRES activity in the yeast and the fruit fly correlates strongly with the structural stability, with highest IRES activity found in RNA segments that exhibit the weakest secondary structure. CONCLUSIONS: We found that a subset of eukaryotic IRESs exhibits very low secondary structure in the 5'-UTR sequences immediately upstream of the initiation codon. The consistency in results between the yeast and the fruit fly suggests a possible shared mechanism of cap-independent translation initiation that relies on an unstructured RNA segment.

  140. Cong, P., X. Xia, and Q. Yang. 2009. Monophyly of the ring-forming group in Diplopoda (Myriapoda, Arthropoda) based on SSU and LSU ribosomal RNA sequences. Progress in Natural Science 19:1297-1303
  141. Zhang, D., H. Xiong, J. A. Mennigen, J. T. Popesku, V. L. Marlatt, C. J. Martyniuk, K. Crump, A. R. Cossins, X. Xia, and V. L. Trudeau. 2009. Defining Global Neuroendocrine Gene Expression Patterns Associated with Reproductive Seasonality in Fish. PLoS ONE 4:e5816..

    BACKGROUND: Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted.METHODOLOGY/PRINCIPAL FINDINGS: In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays.CONCLUSIONS/SIGNIFICANCE: Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.

  142. Zhang, D., H. Xiong, J. Shan, X. Xia, and V. Trudeau. 2008. Functional insight into Maelstrom in the germline piRNA pathway: a unique domain homologous to the DnaQ-H 3'-5' exonuclease, its lineage-specific expansion/loss and evolutionarily active site switch. Biology Direct 3:48.

    Abstract Maelstrom (MAEL) plays a crucial role in a recently-discovered piRNA pathway; however its specific function remains unknown. Here a novel MAEL-specific domain characterized by a set of conserved residues (Glu-His-His-Cys-His-Cys, EHHCHC) was identified in a broad range of species including vertebrates, sea squirts, insects, nematodes, and protists. It exhibits ancient lineage-specific expansions in several species, however, appears to be lost in all examined teleost fish species. Functional involvement of MAEL domains in DNA- and RNA-related processes was further revealed by its association with HMG, SR-25-like and HDAC_interact domains. A distant similarity to the DnaQ-H 3'-5' exonuclease family with the RNase H fold was discovered based on the evidence that all MAEL domains adopt the canonical RNase H fold; and several protist MAEL domains contain the conserved 3'-5' exonuclease active site residues (Asp-Glu-Asp-His-Asp, DEDHD). This evolutionary link together with structural examinations leads to a hypothesis that MAEL domains may have a potential nuclease activity or RNA-binding ability that may be implicated in piRNA biogenesis. The observed transition of two sets of characteristic residues between the ancestral DnaQ-H and the descendent MAEL domains may suggest a new mode for protein function evolution called "active site switch", in which the protist MAEL homologues are the likely evolutionary intermediates due to harboring the specific characteristics of both 3'-5' exonuclease and MAEL domains.

  143. Popesku, J. T., C. J. Martyniuk, J. Mennigen, H. Xiong, D. Zhang, X. Xia, A. R. Cossins, and V. L. Trudeau. 2008. The goldfish (Carassius auratus) as a model for neuroendocrine signaling. Mol Cell Endocrinol. 293(1-2):43-56

    Abstract Goldfish (Carassius auratus) are excellent model organisms for the neuroendocrine signaling and the regulation of reproduction in vertebrates. Goldfish also serve as useful model organisms in numerous other fields. In contrast to mammals, teleost fish do not have a median eminence; the anterior pituitary is innervated by numerous neuronal cell types and thus, pituitary hormone release is directly regulated. Here we briefly describe the neuroendocrine control of luteinizing hormone. Stimulation by gonadotropin-releasing hormone and a multitude of classical neurotransmitters and neuropeptides is opposed by the potent inhibitory actions of dopamine. The stimulatory actions of gamma-aminobutyric acid and serotonin are also discussed. We will focus on the development of a cDNA microarray composed of carp and goldfish sequences which has allowed us to examine neurotransmitter-regulated gene expression in the neuroendocrine brain and to investigate potential genomic interactions between these key neurotransmitter systems. We observed that isotocin (fish homologue of oxytocin) and activins are regulated by multiple neurotransmitters, which is discussed in light of their roles in reproduction in other species. We have also found that many novel and uncharacterized goldfish expressed sequence tags in the brain are also regulated by neurotransmitters. Their sites of production and whether they play a role in neuroendocrine signaling and control of reproduction remain to be determined. The transcriptomic tools developed to study reproduction could also be used to advance our understanding of neuroendocrine-immune interactions and the relationship between growth and food intake in fish.

  144. Vinci, G., X. Xia, and R. A. Veitia. 2008. Preservation of Genes Involved in Sterol Metabolism in Cholesterol Auxotrophs: Facts and Hypotheses. PLoS ONE 3:e2883.

    BACKGROUND: It is known that primary sequences of enzymes involved in sterol biosynthesis are well conserved in organisms that produce sterols de novo. However, we provide evidence for a preservation of the corresponding genes in two animals unable to synthesize cholesterol (auxotrophs): Drosophila melanogaster and Caenorhabditis elegans. Principal Findings: We have been able to detect bona fide orthologs of several ERG genes in both organisms using a series of complementary approaches. We have detected strong sequence divergence between the orthologs of the nematode and of the fruitfly; they are also very divergent with respect to the orthologs in organisms able to synthesize sterols de novo (prototrophs). Interestingly, the orthologs in both the nematode and the fruitfly are still under selective pressure. It is possible that these genes, which are not involved in cholesterol synthesis anymore, have been recruited to perform different new functions. We propose a more parsimonious way to explain their accelerated evolution and subsequent stabilization. The products of ERG genes in prototrophs might be involved in several biological roles, in addition to sterol synthesis. In the case of the nematode and the fruitfly, the relevant genes would have lost their ancestral function in cholesterogenesis but would have retained the other function(s), which keep them under pressure. Conclusions: By exploiting microarray data we have noticed a strong expressional correlation between the orthologs of ERG24 and ERG25 in D. melanogaster and genes encoding factors involved in intracellular protein trafficking and folding and with Start1 involved in ecdysteroid synthesis. These potential functional connections are worth being explored not only in Drosophila, but also in Caenorhabditis as well as in sterol prototrophs.

  145. Xia, X. 2008. The cost of wobble translation in fungal mitochondrial genomes: integration of two traditional hypotheses. BMC Evolutionary Biology 8:211.

    BACKGROUND: Fungal and animal mitochondrial genomes typically have one tRNA for each synonymous codon family. The codon-anticodon adaptation hypothesis predicts that the wobble nucleotide of a tRNA anticodon should evolve towards maximizing Watson-Crick base pairing with the most frequently used codon within each synonymous codon family, whereas the wobble versatility hypothesis argues that the nucleotide at the wobble site should be occupied by a nucleotide most versatile in wobble pairing, i.e., the tRNA wobble nucleotide should be G for NNY codon families, and U for NNR and NNN codon families (where Y stands for C or U, R for A or G and N for any nucleotide). RESULTS: We here integrate these two traditional hypotheses on tRNA anticodons into a unified model based on an analysis of the wobble costs associated with different wobble base pairs. This novel approach allows the relative cost of wobble pairing to be qualitatively evaluated. A comprehensive study of 36 fungal genomes suggests very different costs between two kinds of U:G wobble pairs, i.e., (1) between a G at the wobble site of a tRNA anticodon and a U at the third codon position (designated MU3:G) and (2) between a U at the wobble site of a tRNA anticodon and a G at the third codon position (designated MG3:U). CONCLUSION: In general, MU3:G is much smaller than MG3:U, suggesting no selection against U-ending codons in NNY codon families with a wobble G in the tRNA anticodon but strong selection against G-ending codons in NNR codon families with a wobble U at the tRNA anticodon. This finding resolves several puzzling observations in fungal genomics and corroborates previous studies showing that U3:G wobble is energetically more favorable than G3:U wobble.

  146. Mennigen, J. A., C. J. Martyniuk, K. Crump, H. Xiong, E. Zhao, J. Popesku, H. Anisman, A. R. Cossins, X. Xia, and V. L. Trudeau. 2008. Effects of fluoxetine on the reproductive axis of female goldfish (Carassius auratus). Physiol. Genomics 35:273-282.

    Abstract We investigated the effects of fluoxetine, a selective serotonin reuptake inhibitor, on neuroendocrine function and the reproductive axis in female goldfish. Fish were given intraperitoneal injections of fluoxetine twice a week for 14 days, resulting in five injections of 5 microg fluoxetine/g body wt. We measured the monoamine neurotransmitters serotonin, dopamine, and norepinephrine in addition to their metabolites with HPLC. Homovanillic acid, a metabolite in the dopaminergic pathway, increased significantly in the hypothalamus. Plasma estradiol levels were measured by radioimmunoassay and were significantly reduced approximately threefold after fluoxetine treatment. We found that fluoxetine also significantly reduced the expression of estrogen receptor (ER)beta1 mRNA by 4-fold in both the hypothalamus and the telencephalon and ERalpha mRNA by 1.7-fold in the telencephalon. Fluoxetine had no effect on the expression of ERbeta2 mRNA in the hypothalamus or telencephalon. Microarray analysis identified isotocin, a neuropeptide that stimulates reproductive behavior in fish, as a candidate gene affected by fluoxetine treatment. Real-time RT-PCR verified that isotocin mRNA was downregulated approximately sixfold in the hypothalamus and fivefold in the telencephalon. Intraperitoneal injection of isotocin (1 microg/g) increased plasma estradiol, providing a potential link between changes in isotocin gene expression and decreased circulating estrogen in fluoxetine-injected fish. Our results reveal targets of serotonergic modulation in the neuroendocrine brain and indicate that fluoxetine has the potential to affect sex hormones and modulate genes involved in reproductive function and behavior in the brain of female goldfish. We discuss these findings in the context of endocrine disruption because fluoxetine has been detected in the environment.

  147. Marin, A. and Xia, X. 2008. GC skew in protein-coding genes between the leading and lagging strands in bacterial genomes: new substitution models incorporating strand-bias. Journal of Theoretical Biology 253(3):508-513

    Abstract The DNA strands in most prokaryotic genomes experience strand-biased spontaneous mutation, especially C→T mutations produced by deamination that occur preferentially in the leading strand. This has often been invoked to account for the asymmetry in nucleotide composition, typically measured by GC skew, between the leading and the lagging strand. Casting such strand asymmetry in the framework of a nucleotide substitution model is important for understanding genomic evolution and phylogenetic reconstruction. We present a substitution model showing that the increased C→T mutation will lead to positive GC skew in one strand but negative GC skew in the other, with greater C→T mutation pressure associated with greater differences in GC skew between the leading and the lagging strand. However, the model based on mutation bias alone does not predict any positive correlation in GC skew between the leading and lagging strands. We computed GC skew for coding sequences collinear with the leading and lagging strands across 339 prokaryotic genomes and found a strong and positive correlation in GC skew between the two strands. We show that the observed positive correlation can be satisfactorily explained by an improved substitution model with one additional parameter incorporating a general trend of C avoidance.

  148. Aris-Brosou, S. and Xia, X. 2008 Phylogenetic analyses: a toolbox expanding towards Bayesian methods. International Journal of Plant Genomics Article ID 683509.

    Abstract The reconstruction of phylogenies is becoming an increasingly simple activity. This is mainly due to two reasons: the democratization of computing power and the increased availability of sophisticated yet user-friendly software. This review describes some of the latest additions to the phylogenetic toolbox, along with some of their theoretical and practical limitations. It is shown that Bayesian methods are under heavy development, as they offer the possibility to solve a number of long-standing issues and to integrate several steps of the phylogenetic analyses into a single framework. Specific topics include not only phylogenetic reconstruction, but also the comparison of phylogenies, the detection of adaptive evolution, and the estimation of divergence times between species.

  149. Carullo, M. and Xia, X. 2008 An extensive study of mutation and selection on the wobble nucleotide in tRNA anticodons in fungal mitochondrial genomes. Journal of Molecular Evolution 66:484-493 .

    Abstract Two alternative hypotheses aim to predict the wobble nucleotide of tRNA anticodons in mitochondrion. The codon-anticodon adaptation hypothesis predicts that the wobble nucleotide of tRNA anticodon should evolve toward maximizing the Watson-Crick base pairing with the most frequently used codon within each synonymous codon family. In contrast, the wobble versatility hypothesis argues that the nucleotide at the wobble site should be occupied by a nucleotide most versatile in wobble pairing, i.e., the wobble site of the tRNA anticodon should be G for NNY codon families and U for NNR and NNN codon families (where Y stands for C or U, R for A or G, and N for any nucleotide). We examined codon usage and anticodon wobble sites in 36 fungal genomes to evaluate these two alternative hypotheses and identify exceptional cases that deserve new explanations. While the wobble versatility hypothesis is generally supported, there are interesting exceptions involving tRNA(Arg) translating the CGN codon family, tRNA(Trp) translating the UGR codon family, and tRNA(Met) translating the AUR codon family. Our results suggest that the potential to suppress stop codons, the historical inertia, and the conflict between translation initiation and elongation can all contribute to determining the wobble nucleotide of tRNA anticodons.

  150. Marlatt, V.L., Martyniuk, C. J., Zhang, D., Xiong, H., Watt, J., Xia, X., Moon, T., Trudeau, V.L. 2008. Auto-regulation of estrogen receptor subtypes and gene expression profiling of 17beta-estradiol action in the neuroendocrine axis of male goldfish. Molecular and Cellular Endocrinology 283:38-48.

    Abstract Auto-regulation of the three goldfish estrogen receptor (ER) subtypes was examined simultaneously in multiple tissues, in relation to mRNA levels of liver vitellogenin (VTG) and brain transcripts. Male goldfish were implanted with a silastic implant containing either no steroid or 17beta-estradiol (E2) (100 microg/g body mass) for one and seven days. Liver transcript levels of ERalpha were the most highly up-regulated of the ERs, and a parallel induction of liver VTG was observed. In the testes (7d) and telencephalon (7d), E2 induced ERalpha. In the liver (1d) and hypothalamus (7d) ERbeta1 was down-regulated, while ERbeta2 remained unchanged under all conditions. Although aromatase B levels increased in the brain, the majority of candidate genes identified by microarray in the hypothalamus (1d) decreased. These results demonstrate that ER subtypes are differentially regulated by E2, and several brain transcripts decrease upon short-term elevation of circulating E2 levels.

  151. Xiong, H., Zhang D., Martyniuk, C.J., Trudeau, V.L., Xia, X.. 2008. Using Generalized Procrustes Analysis (GPA) for normalization of cDNA microarray data. BMC Bioinformatics, 9(2008) 25

    BACKGROUND: Normalization is essential in dual-labelled microarray data analysis to remove non-biological variations and systematic biases. Many normalization methods have been used to remove such biases within slides (Global, Lowess) and across slides (Scale, Quantile and VSN). However, all these popular approaches have critical assumptions about data distribution, which is often not valid in practice. RESULTS: In this study, we propose a novel assumption-free normalization method based on the Generalized Procrustes Analysis (GPA) algorithm. Using experimental and simulated normal microarray data and boutique array data, we systemically evaluate the ability of the GPA method in normalization compared with six other popular normalization methods including Global, Lowess, Scale, Quantile, VSN, and one boutique array-specific housekeeping gene method. The assessment of these methods is based on three different empirical criteria: across-slide variability, the Kolmogorov-Smirnov (K-S) statistic and the mean square error (MSE). Compared with other methods, the GPA method performs effectively and consistently better in reducing across-slide variability and removing systematic bias. CONCLUSION: The GPA method is an effective normalization approach for microarray data analysis. In particular, it is free from the statistical and biological assumptions inherent in other normalization methods that are often difficult to validate. Therefore, the GPA method has a major advantage in that it can be applied to diverse types of array sets, especially to the boutique array where the majority of genes may be differentially expressed.

  152. Khalouei, S., Xia, X.. 2008. Selective pressure against AUG triplets in the 5' untranslated region of human immunodeficiency virus type 1 supports cap-dependent translation initiation mechanism. Retrovirology: Research and Treatment 2:1-8.
  153. Xia, X. 2007. An Improved Implementation of Codon Adaptation Index. Evolutionary Bioinformatics 3:53–58.

    Abstract Codon adaptation index is a widely used index for characterizing gene expression in general and translation efficiency in particular. Current computational implementations have a number of problems leading to various systematic biases. I illustrate these problems and provide a better computer implementation to solve these problems. The improved CAI can predict protein production better than CAI from other commonly used implementations.

    Correction:In discussing the problem arising when a codon is not used in the reference set of highly expressed genes, which would yield w=0, I stated that Sharp & Li (1987) suggested using w=0.5 in that situation. Sharp & Li (1987) actually suggested using Xij=0.5. Michael Bulmer (1988, J.Evol.Biol.) suggested an alternative modification, which is to set the minimum value of w to be 0.01.

  154. Xia, X., Huang H.,Carullo, M.,Betran, E.,Moriyama, E. 2007. Conflict between translation initiation and elongation in vertebrate mitochondrial genomes. PLoS ONE 2(2): e227.

    Abstract The strand-biased mutation spectrum in vertebrate mitochondrial genomes results in an AC-rich L-strand and a GT-rich H-strand. Because the L-strand is the sense strand of 12 protein-coding genes out of the 13, the third codon position is overall strongly AC-biased. The wobble site of the anticodon of the 22 mitochondrial tRNAs is either U or G to pair with the most abundant synonymous codon, with only one exception. The wobble site of Met-tRNA is C instead of U, forming the Watson-Crick match with AUG instead of AUA, the latter being much more frequent than the former. This has been attributed to a compromise between translation initiation and elongation; i.e., AUG is not only a methionine codon, but also an initiation codon, and an anticodon matching AUG will increase the initiation rate. However, such an anticodon would impose selection against the use of AUA codons because AUA needs to be wobble-translated. According to this translation conflict hypothesis, AUA should be used relatively less frequently compared to UUA in the UUR codon family. A comprehensive analysis of mitochondrial genomes from a variety of vertebrate species revealed a general deficiency of AUA codons relative to UUA codons. In contrast, urochordate mitochondrial genomes with two tRNA(Met) genes with CAU and UAU anticodons exhibit increased AUA codon usage. Furthermore, six bivalve mitochondrial genomes with both of their tRNA-Met genes with a CAU anticodon have reduced AUA usage relative to three other bivalve mitochondrial genomes with one of their two tRNA-Met genes having a CAU anticodon and the other having a UAU anticodon. We conclude that the translation conflict hypothesis is empirically supported, and our results highlight the fine details of selection in shaping molecular evolution.

  155. Xia, X. 2007. The +4G site in Kozak consensus is not related to the efficiency of translation initiation. PLoS ONE 2(2):e188.

    Abstract The optimal context for translation initiation in mammalian species is GCCRCCaugG (where R = purine and "aug" is the initiation codon), with the -3R and +4G being particularly important. The presence of +4G has been interpreted as necessary for efficient translation initiation. Accumulated experimental and bioinformatic evidence has suggested an alternative explanation based on amino acid constraint on the second codon, i.e., amino acid Ala or Gly are needed as the second amino acid in the nascent peptide for the cleavage of the initiator Met, and the consequent overuse of Ala and Gly codons (GCN and GGN) leads to the +4G consensus. I performed a critical test of these alternative hypotheses on +4G based on 34169 human protein-coding genes and published gene expression data. The result shows that the prevalence of +4G is not related to translation initiation. Among the five G-starting codons, only alanine codons (GCN), and glycine codons (GGN) to a much smaller extent, are overrepresented at the second codon, whereas the other three codons are not overrepresented. While highly expressed genes have more +4G than lowly expressed genes, the difference is caused by GCN and GGN codons at the second codon. These results are inconsistent with +4G being needed for efficient translation initiation, but consistent with the proposal of amino acid constraint hypothesis.

  156. Khalouei, S., X. Yao, J. Mennigen, M. Carullo, P. Ma, Z. Song, H. Xiong, and Xia, X.. 2007. Bioinformatic Approach to Identify Penultimate Amino Acids Efficient for N-Terminal Methionine Excision. Pp. 386-389. Bioinformatics and Biomedical Engineering, 2007, IEEE. The 1st International Conference on Bioinformatics and Biomedical Engineering (ICBBE2007).
  157. Martyniuk, C. J., Xiong H., Crump, K., Chiu, S., Sardana, R., Nadler, A., Gerrie, E. R., Xia, X., Trudeau, V. L. 2006. Gene expression profiling in the neuroendocrine brain of male goldfish (Carassius auratus) exposed to 17-alpha-ethinylestradiol. Physiol. Genomics 27(3):328-336.

    Abstract 17-alpha ethinylestradiol (EE2), a pharmaceutical estrogen, is detectable in water systems worldwide. Although studies report on the effects of xenoestrogens in tissues such as liver and gonad, few studies to date have investigated the effects of EE2 in the vertebrate brain at a large scale. The purpose of this study was to develop a goldfish brain-enriched cDNA array and use this in conjunction with a mixed tissue carp microarray to study the genomic response to EE2 in the brain. Gonad-intact male goldfish were exposed to nominal concentrations of 0.1 nM (29.6 ng/l) and 1.0 nM (296 ng/l) EE2 for 15 days. Male goldfish treated with the higher dose of EE2 had significantly smaller gonads compared with controls. Males also had a significantly reduced level of circulating testosterone (T) and 17beta-estradiol (E2) in both treatment groups. Candidate genes identified by microarray analysis fall into functional categories that include neuropeptides, cell metabolism, and transcription/translation factors. Differentially expressed genes verified by real-time RT-PCR included brain aromatase, secretogranin-III, and interferon-related developmental regulator 1. Our results suggest that the expression of genes in the sexually mature adult brain appears to be resistant to low EE2 exposure but is affected significantly at higher doses of EE2. This study demonstrates that microarray technology is a useful tool to study the effects of endocrine disrupting chemicals on neuroendocrine function and suggest that exposure to EE2 may have significant effects on localized E2 synthesis in the brain by affecting transcription of brain aromatase.

  158. Xia, X. 2006. Topological Bias in Distance-Based Phylogenetic Methods: Problems with Over- and Underestimated Genetic Distances. Evolutionary Bioinformatics 2006: 2 375–387.

    Abstract I show several types of topological biases in distance-based methods that use the least-squares method to evaluate branch lengths and the minimum evolution (ME) or the Fitch-Margoliash (FM) criterion to choose the best tree. For a 6-species tree, there are two tree shapes, one with three cherries (a cherry is a pair of adjacent leaves descending from the most recent common ancestor), and the other with two. When genetic distances are underestimated, the 3-cherry tree shape is favored with either the ME or FM criterion. When the genetic distances are overestimated, the ME criterion favors the 2-cherry tree, but the direction of bias with the FM criterion depends on whether negative branches are allowed, i.e. allowing negative branches favors the 3-cherry tree shape but disallowing negative branches favors the 2-cherry tree shape. The extent of the bias is explored by computer simulation of sequence evolution.

  159. Wang, H. C., Xia, X. , D. Hickey. 2006. Thermal adaptation of small ribosomal RNA genes: a comparative study. Journal of Molecular Evolution 63(1):120-126

    Abstract We carried out a comprehensive survey of small subunit ribosomal RNA sequences from archaeal, bacterial, and eukaryotic lineages in order to understand the general patterns of thermal adaptation in the rRNA genes. Within each lineage, we compared sequences from mesophilic, moderately thermophilic, and hyperthermophilic species. We carried out a more detailed study of the archaea, because of the wide range of growth temperatures within this group. Our results confirmed that there is a clear correlation between the GC content of the paired stem regions of the 16S rRNA genes and the optimal growth temperature, and we show that this correlation cannot be explained simply by phylogenetic relatedness among the thermophilic archaeal species. In addition, we found a significant, positive relationship between rRNA stem length and growth temperature. These correlations are found in both bacterial and archaeal rRNA genes. Finally, we compared rRNA sequences from warm-blooded and cold-blooded vertebrates. We found that, while rRNA sequences from the warm-blooded vertebrates have a higher overall GC content than those from the cold-blooded vertebrates, this difference is not concentrated in the paired regions of the molecule, suggesting that thermal adaptation is not the cause of the nucleotide differences between the vertebrate lineages.

  160. Xia, X. , Wang, H. C., Z. Xie, M. Carullo, H. Huang, and D. Hickey. 2006. Cytosine usage modulates the correlation between CDS length and CG content in prokaryotic genomes. Molecular Biology and Evolution 23:1450-1454

    Abstract Previous studies have argued that, given the AT-rich nature of stop codons, the length and CG% of coding sequences (CDSs) should be positively correlated. This prediction is generally supported empirically by prokaryotic genomes. However, the correlation is weak for a number of species, with 4 species showing a negative correlation. Here we formulate a more general hypothesis incorporating selection against cytosine (C) usage to explain the lack of strong positive correlation between the length and GC% of CDSs. Two factors contribute to the selection against C usage in long CDSs. First, C is the least abundant nucleotide in the cell, and a long CDS should have fewer Cs to increase transcription efficiency. Second, C is prone to mutation to U/T and selection for increased reliability should reduce C usage in long CDSs. Empirical data from prokaryotic genomes lend strong support for this new hypothesis.

  161. Cai, J. J., Smith, D. K., Xia, X. , Yuen, K. Y. 2006. MBEToolbox 2.0: An enhanced version of a MATLAB toolbox for Molecular Biology and Evolution. Evolutionary Bioinformatics 2:187-190.

    Abstract MBEToolbox is an extensible MATLAB-based software package for analysis of DNA and protein sequences. MBEToolbox version 2.0 includes enhanced functions for phylogenetic analyses by the maximum likelihood method. For example, it is capable of estimating the synonymous and nonsynonymous substitution rates using a novel or several known codon substitution models. MBEToolbox 2.0 introduces new functions for estimating site-specific evolutionary rates by using a maximum likelihood method or an empirical Bayesian method. It also incorporates several different methods for recombination detection. Multi-platform versions of the software are freely available at http://www.bioinformatics.org/mbetoolbox/.

  162. Xia, X. and G. Palidwor. 2005. Genomic Adaptation to Acidic Environment: Evidence from Helicobacter pylori. American Naturalist 166:776-784

    Abstract The origin of new functions is fundamental in understanding evolution, and three processes known as adaptation, preadaptation, and exaptation have been proposed as possible evolutionary pathways leading to the origin of new functions. Here we examine the origin of an acid resistance mechanism in the mammalian gastric pathogen Helicobacter pylori, with reference to these three evolutionary pathways. The mechanism involved is that H. pylori, when exposed to the acidic environment in mammalian stomach, restricts the acute proton entry across its membrane by its increased usage of positively charged amino acids in the inner and outer membrane proteins. The results of our comparative genomic analysis between H. pylori, the two closely related species Helicobacter hepaticus and Campylobacter jejuni, and other relevant proteobacterial species are incompatible with the hypotheses invoking preadaptation or exaptation. The acid resistance mechanism most likely arose by selection favoring an increased usage of positively charged lysine in membrane proteins.

  163. Shi, B., and X. Xia. 2005. Genetic variation in clones of Pseudomonas pseudoalcaligenes after ten months of selection in different thermal environments in the laboratory. Curr Microbiol 50:238-45.

    Abstract The random amplification of polymorphic DNA (RAPD) method was used to examine genetic variation in experimental clones of Pseudomonas pseudoalcaligenes in two experimental groups, as well as their common ancestor. Six clones derived from a single colony of P. pseudoalcaligenes were cultured in two different thermal regimes for 10 months. Three clones in the Control group were cultured at constant temperature of 35 degrees C and another three clones in the High Temperature (HT) group were propagated at incremental temperature ranging from 41 to 47 degrees C for 10 months. A total of 45 RAPD primers generated 146 polymorphic markers. Analysis of molecular variance (AMOVA) revealed mild (11%) but significant (P < 0.001) genetic difference between the Control and the HT clones. Phylogenetic analysis based on pairwise genetic distances showed that the HT clones were more divergent from the ancestor and from each other than the Control clones, implying that the HT clones of P. pseudoalcaligenes may have evolved faster than the Control clones.

  164. Xia. X. and K. Y. Yuen. 2005. Differential selection and mutation between dsDNA and ssDNA phages shape the evolution of their genomic AT percentage. BMC Genetics 6:20.

    BACKGROUND: Bacterial genomes differ dramatically in AT%. We have developed a model to show that the genomic AT% in rapidly replicating bacterial species can be used as an index of the availability of nucleotides A and T for DNA replication in cellular medium. This index is then used to (1) study the evolution and adaptation of the bacteriophage genomic AT% in response to the differential nucleotide availability of the host and (2) test the prediction that double-stranded DNA (dsDNA) phage should exhibit better adaptation than single-stranded DNA (ssDNA) phage because the rate of spontaneous deamination, which leads to C→T or C→U mutations depending on whether C is methylated or not, is about 100-fold greater in ssDNA than in dsDNA. RESULTS: We retrieved 79 dsDNA phage and 27 ssDNA phage genomes together with their host genomic sequences. The dsDNA phages have their genomic AT% better adapted to the host genomic AT% than ssDNA phage. The poorer adaptation of the ssDNA phage can be partially accounted for by the C→T(U) mutations mediated by the spontaneous deamination. For ssDNA phage, the genomic A% is more strongly correlated with their host genomic AT% than the genomic T%. CONCLUSION: A significant fraction of variation in the genomic AT% in the dsDNA phage, and that in the genomic A% and T% of the ssDNA phage, can be explained by the difference in selection and mutation between them.

  165. Cai, J., Smith, D., X. Xia, and K. Y. Yuen. 2005. MBEToolbox: a Matlab toolbox for sequence data analysis of molecular biology and evolution. BMC Bioinformatics 6:64.

    BACKGROUND: MATLAB is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as mathematics and computation, algorithm development, data acquisition, modeling, simulation, and scientific and engineering graphics. However, few functions are freely available in MATLAB to perform the sequence data analyses specifically required for molecular biology and evolution. RESULTS: We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this gap by offering efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible, functional framework for users with more specialized requirements to explore and analyze aligned nucleotide or protein sequences from an evolutionary perspective. The full functions in the toolbox are accessible through the command-line for seasoned MATLAB users. A graphical user interface, that may be especially useful for non-specialist end users, is also provided. CONCLUSION: MBEToolbox is a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software is publicly available at http://web.hku.hk/~jamescai/mbetoolbox/ and http://bioinformatics.org/project/?group_id=454

  166. Xia, X. 2005. Mutation and Selection on the Anticodon of tRNA Genes in Vertebrate Mitochondrial Genomes. Gene 345:13-20.

    Abstract The H-strand of vertebrate mitochondrial DNA is left single-stranded for hours during the slow DNA replication. This facilitates C→U mutations on the H-strand (and consequently G→A mutations on the L-strand) via spontaneous deamination which occurs much more frequently on single-stranded than on double-stranded DNA. For the 12 coding sequences (CDS) collinear with the L-strand, NNY synonymous codon families (where N stands for any of the four nucleotides and Y stands for either C or U) end mostly with C, and NNR and NNN codon families (where R stands for either A or G) end mostly with A. For the lone ND6 gene on the other strand, the codon bias is the opposite, with NNY codon families ending mostly with U and NNR and NNN codon families ending mostly with G. These patterns are consistent with the strand-specific mutation bias. The codon usage biased towards C-ending and A-ending in the 12 CDS sequences affects the codon-anticodon adaptation. The wobble site of the anticodon is always G for NNY codon families dominated by C-ending codons and U for NNR and NNN codon families dominated by A-ending codons. The only, but consistent, exception is the anticodon of tRNA-Met which consistently has a 5'-CAU-3' anticodon base-pairing with the AUG codon (the translation initiation codon) instead of the more frequent AUA. The observed CAU anticodon (matching AUG) would increase the rate of translation initiation but would reduce the rate of peptide elongation because most methionine codons are AUA, whereas the unobserved UAU anticodon (matching AUA) would increase the elongation rate at the cost of translation initiation rate. The consistent CAU anticodon in tRNA-Met suggests the importance of maximizing the rate of translation initiation.

  167. Baron, D., J. Cocquet, X. Xia, M. Fellous, Y. Guiguen, and R. A. Veitia. 2004. An evolutionary and functional analysis of FoxL2 in rainbow trout gonad differentiation. J. Mol. Endocrinol. 33:705-715.

    Abstract FOXL2 is a forkhead transcription factor involved in ovarian development and function. Here, we have studied the evolution and pattern of expression of the FOXL2 gene and its paralogs in fish. We found well conserved FoxL2 sequences (FoxL2a) and divergent genes, whose forkhead domains belonged to the class L2 and were shown to be paralogs of the FoxL2a sequences (named FoxL2b). In the rainbow trout, FoxL2a and FoxL2b were specifically expressed in the ovary, but displayed different temporal patterns of expression. FoxL2a expression correlated with the level of aromatase, the key enzyme in estrogen production, and an estrogen treatment used to feminize genetically male individuals elicited the up-regulation of both paralogs. Conversely, androgens or an aromatase inhibitor down-regulated FoxL2a and FoxL2b in females. We speculate that there is a direct link between estrogens and FoxL2 expression in fish, at least during the period where the identity of the gonad is sensitive to hormonal treatments.

  168. Xia, X. 2004. A peculiar codon usage pattern revealed after removing the effect of DNA methylation. Proceedings of the 4th International Conference on Bioinformatics of Genome Regulation and Structure 1:216-220.
  169. Cocquet, J., E. De Baere, M. Gareil, M. Pannetier, X. Xia, M. Fellous, R. Veitia. 2003. Structure, evolution and expression of the FOXL2 transcription unit. Cytogenetic Genome Res 101:206-211.

    Abstract FOXL2 is a putative transcription factor involved in ovarian development and function. Its mutations in humans are responsible for the blepharophimosis syndrome, characterized by eyelid malformations and premature ovarian failure (POF). Here we have performed a comparative sequence analysis of FOXL2 sequences of ten vertebrate species. We demonstrate that the entire open reading frame (ORF) is under purifying selection leading to strong protein conservation. We also review recent data on FOXL2 transcript and protein expression. FOXL2 has been shown 1) to be the earliest known sex dimorphic marker of ovarian determination/differentiation in vertebrates, 2) to have, at least in mammals, an ovarian expression persisting until adulthood. The conservation of its sequence and pattern of expression suggests that FOXL2 might be a key factor in the early development of the vertebrate female gonad and involved later in adult ovarian function. Finally, we provide arguments for the existence of an alternative transcript in rodents, that may arise from a differential polyadenylation. Although it has only been demonstrated in rodents, its presence/absence in other species deserves further investigation.

  170. X. Xia. 2003. DNA methylation and Mycoplasma genomes. Journal of Molecular Evolution 57:S21-S28.

    Abstract DNA methylation is one of the many hypotheses proposed to explain the observed deficiency in CpG dinucleotides in a variety of genomes covering a wide taxonomic distribution. Recent studies challenged the methylation hypothesis on empirical grounds. First, it cannot explain why the Mycoplasma genitalium genome exhibits strong CpG deficiency without DNA methylation. Second, it cannot explain the great variation in CpG deficiency between M. genitalium and M. pneumoniae that also does not have CpG-specific methyltransferase genes. I analyzed the genomic sequences of these Mycoplasma species together with the recently sequenced genomes of M. pulmonis, Ureaplasma urealyticum, and Staphylococcus aureus, and found the results fully compatible with the methylation hypothesis. In particular, I present compelling empirical evidence to support the following scenario. The common ancestor of the three Mycoplasma species has CpG-specific methyltransferases, and has evolved strong CpG deficiency as a result of the specific DNA methylation. Subsequently, this ancestral genome diverged into M. pulmonis and the common ancestor of M. pneumoniae and M. genitalium. M. pulmonis has retained methyltransferases and exhibits the strongest CpG deficiency. The common ancestor lost the methyltransferase gene and then diverged into M. genitalium and M. pneumoniae. M. genitalium and M. pneumoniae, after losing methylation activities, began to regain CpG dinucleotides through random mutation. M. genitalium evolved more slowly than M. pneumoniae, gained relatively fewer CpG dinucleotides, and is more CpG-deficient.

  171. Shi, B., X. Xia. 2003. Changes in growth parameters of Pseudomonas pseudoalcaligenes after ten months culturing at increasing temperature in the laboratory. FEMS Microbiology Ecology 45:127-134.

    Abstract In this paper, we report the thermal adaptation of Pseudomonas pseudoalcaligenes, characterized as changes in growth parameters. Six clones derived from a single colony of P. pseudoalcaligenes were cultured in two different temperature regimes for 10 months, with three clones forming the control group, cultured at a constant temperature, and another three clones forming the high-temperature (HT) group, cultured at increasing temperature (from 41 to 47 degrees C). Three growth parameters were measured: the lag time (lambda), which is the period between the time of transfer to a new medium and the time when the cell replication starts; the maximum growth rate (mu(m)); and the maximum yield (A). These three parameters are major components of bacterial fitness. The Gompertz and logistic models were used to estimate these three parameters. The two models gave almost identical estimates, but the Gompertz model had R(2) values consistently larger than the logistic model. The HT clones had significantly shorter lambda, but higher mu(m) and A than the control clones when both were grown at the originally stressful temperature of 45 degrees C, suggesting significant thermal adaptation. Interestingly, the HT clones grew equally well as the control clones at 35 degrees C, i.e. improved performance at 45 degrees C was not associated with a reduced performance at 35 degrees C.

  172. Xia, X., Z. Xie, K. Kjer. 2003. 18S rRNA and Tetrapod Phylogeny. Systematic Biology 52(3):283-295 (Editor's choice in )

    Abstract Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species.

  173. Xia, X., Z. Xie, W. H. Li. 2003. Effects of GC Content and Mutational Pressure on the Lengths of Exons and Coding Sequences. Journal of Molecular Evolution 56:362-370.

    Abstract It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).

  174. Shi, B., X. Xia. 2003. Morphological changes of Pseudomonas pseudoalcaligenes in response to temperature selection. Current Microbiology 46:120-123.

    Abstract Adaptation to novel environments usually entails morphological changes. The cell morphology of six experimental populations of Pseudomonas pseudoalcaligenes and their common ancestor were examined with scanning electron microscopy (SEM). The six experimental populations were propagated under different temperatures for 10 months: three of them cultured at constant normal temperature (35 degrees C) forming the control group, and the other three cultured at incremental higher temperatures (from 41 degrees to 47 degrees C) as the HT group. SEM showed the deformed and elongated cells in the 6-h cultures of both ancestral and control populations at 45 degrees C, indicating that 45 degrees C is stressful for the ancestral and the control populations. In contrast, the HT populations retained normal cell shape in the 6-h cultures at both 35 degrees C and 45 degrees C. The mean cell volumes of control and HT populations increased 29% and 34%, respectively, relative to the ancestor at their respective thermal regimens, suggestion that the culturing conditions might favor larger cells.

  175. Xia, X., Z. Xie, M. Salemi, L. Chen, Y. Wang. 2003. An index of substitution saturation and its application. Molecular Phylogenetics and Evolution 26:1-7.

    Abstract We introduce a new index to measure substitution saturation in a set of aligned nucleotide sequences. The index is based on the notion of entropy in information theory. We derive the critical values of the index based on computer simulation with different sequence lengths, different number of OTUs and different topologies. The critical value enables researchers to quickly judge whether a set of aligned sequences is useful in phylogenetics. We illustrate the index by applying it to an analysis of the aligned sequences of the elongation factor-1alpha gene originally used to resolve the deep phylogeny of major arthropod groups. The method has been implemented in DAMBE.

  176. Cocquet, J., E. Pailhoux, F. Jaubert, N. Servel, X. Xia, M. Pannetier, E. De Baere, L. Messiaen, C. Cotinot, M. Fellous, R. Veitia. 2002. Evolution and expression of FOXL2. Journal of Medical Genetics.39:916-921.

    Abstract Mutations in FOXL2, a forkhead transcription factor gene, have recently been shown to cause the blepharophimosis-ptosis-epicanthus inversus syndrome (BPES). This rare genetic disorder leads to a complex eyelid malformation associated or not with premature ovarian failure (BPES type I or II, respectively). We performed a comparative analysis of the FOXL2 sequence in several species (human, goat, mouse, and pufferfish) showing that the FOXL2 coding region is highly conserved in these species. The FOXL2 protein contains a polyalanine tract whose role has not yet been elucidated. Recurrent mutations leading to its expansion result in BPES type II and account for 30% of the deleterious alterations detected in the open reading frame (ORF) of FOXL2. We showed that the number of alanine residues is strictly conserved among the mammals studied, suggesting the existence of strong functional or structural constraints. We provide immunohistochemical evidence indicating that FOXL2 is a nuclear protein specifically expressed in eyelids and in fetal and adult ovarian follicular cells. It does not undergo any major post-translational maturation. FOXL2 is the earliest known marker of ovarian differentiation in mammals and may play a role in ovarian somatic cell differentiation and in further follicle development and/or maintenance.

  177. Xia, X., T. Wei, Z. Xie and A. Danchin. 2002. Genomic changes in nucleotide and di-nucleotide frequencies in Pasteurella multocida cultured under high temperature. Genetics 161:1385-94.

    Abstract We used 94 RAPD primers of different nucleotide composition to probe the genomic differences between a highly virulent P. multocida strain and an attenuated vaccine strain derived from the virulent strain after culturing the latter under increasing temperature for approximately 14,400 generations. The GC content of the vaccine strain is significantly (P < 0.05) lower than that of the virulent strain, contrary to the popular hypothesis of covariation between the GC content and temperature. The frequencies of AA, TA, and TT dinucleotides were higher, and those of AT, GC, and CG dinucleotides were lower, in the vaccine strain than in the virulent strain. A statistic called genomic RAPD entropy is formulated to measure the randomness of the genome, and the genome of the vaccine strain is more random than that of the virulent strain. These differences between the virulent and vaccine strains are interpreted in terms of mutation and selection under increased culturing temperature. A method for estimating substitution rates is developed in the appendix.

  178. Xia, X. and Z. Xie. 2002. Protein Structure, Neighbor Effect, and a New Index of Amino Acid Dissimilarities. Molecular Biology and Evolution 19:58-67.

    Abstract Amino acids interact with each other, especially with neighboring amino acids, to generate protein structures. We studied the pattern of association and repulsion of amino acids based on 24,748 protein-coding genes from human, 11,321 from mouse, and 15,028 from Escherichia coli, and documented the pattern of neighbor preference of amino acids. All amino acids have different preferences for neighbors. We have also analyzed 7,342 proteins with known secondary structure and estimated the propensity of the 20 amino acids occurring in three of the major secondary structures, i.e., helices, sheets, and turns. Much of the neighbor preference can be explained by the propensity of the amino acids in forming different secondary structures, but there are also a number of intriguing association and repulsion patterns. The similarity in neighbor preference among amino acids is significantly correlated with the number of amino acid substitutions in both mitochondrial and nuclear genes, with amino acids having similar sets of neighbors replacing each other more frequently than those having very different sets of neighbors. This similarity in neighbor preference is incorporated into a new index of amino acid dissimilarities that can predict nonsynonymous codon substitutions better than the two existing indices of amino acid dissimilarities, i.e., Grantham's and Miyata's distances.

  179. Xia, X. and Z. Xie. 2001. AMADA: Analysis of microarray data. Bioinformatics 17:569-570.

    Abstract AMADA is a Windows program for identifying co-expressed genes from microarray data. It performs data transformation, principal component analysis, a variety of cluster analyses and extensive graphic functions for visualizing expression profiles.

  180. Xia, X. and Z. Xie. 2001. DAMBE: Data analysis in molecular biology and evoluiton. Journal of Heredity 92:371-373.

    Abstract DAMBE (data analysis in molecular biology and evolution) is an integrated software package for converting, manipulating, statistically and graphically describing, and analyzing molecular sequence data with a user-friendly Windows 95/98/2000/NT interface. DAMBE is free and can be downloaded from http://web.hku.hk/~xxia/software/software.htm.

  181. Chen, B., and X. Xia, 2001 The genus Schevodera Borchmann: Phylogeny, historic biogeography and new Chinese records, with description of a new species (Coleoptera: Tenebrionidae: Lagriinae). Oriental Insects 35: 3-27.

    Abstract Schevodera Borchmann belongs to the subfamily Lagriinae and its members are phytophagous. A new species, S. glabricollis is described from China. Redescriptions of the genus and two known species, S. gracilicornis and S. inflata with new records for China are given. A key to Chinese species is given. The phylogeny of the nine known species and one subspecies is ç ladistically analysed based on 21 morphological characters from adults. The confidence of the phylogram obtained from the cladistic analysis and its monophylies are examined with PTP and T-PTP tests. The ancestral distribution of the genus is also reconstructed based on the dispersal-vicariance analysis. The results suggest that the genus would be monophyletic. In the late Permian — late Triassic period around 255–220 million years ago, it is hypothesized to have originated from a Lagria-like ancestral species between western Yunnan, China and Burma in the Shan-Thai terrain. It dispersed from western Yunnan and northern Burma to Sumatra and Java, and then northward through Borneo to Palawan, Luzon and finally Mindanao. Based on phylogeny and historic biogeography, the genus is divided into three species groups: Yunnan, Indonesia and Philippines groups. The Yunnan group is the most primitive, consisting of S. inflata, S. glabricollis and S. gracilicornis, and is mainly distributed in Yunnan and Burma. The Indonesia group includes S. hirticollis and S. hirticollis salvazai, S. curticollis and S. dohrni, and occurs primarily in Indonesia but also reaches into Burma and the Philippines. The S. hirticollis salvazai has dispersed from Burma to Laos. The group originated from the ancestor of Yunnan group after Ecocene, i.e. no longer than 50 million years ago. The monophyletic Philippines group is composed of three endemic species: S. setosa, S. spoliata and S. insularis. It originated from the ancestor of the Indonesian group after the Miocene around 20 million years ago and dispersed from Palawan to Luzon and then Mindanao. The synapomorphies between these groups, interspecific phylogenetic relationships, time and place of origin and potential distribution of each species are also discussed in detail.

  182. Xia, X. 2000. Phylogenetic Relationship among Horseshoe Crab Species: The Effect of Substitution Models on Phylogenetic Analyses. Systematic Biology 49:87-100.

    Abstract The horseshoe crabs, known as living fossils, have maintained their morphology almost unchanged for the past 150 million years. The little morphological differentiation among horseshoe crab lineages has resulted in substantial controversy concerning the phylogenetic relationship among the extant species of horseshoe crabs, especially among the three species in the Indo-Pacific region. Previous studies suggest that the three species constitute a phylogenetically unresolvable trichotomy, the result of a cladogenetic process leading to the formation of all three Indo-Pacific species in a short geological time. Data from two mitochondrial genes (for 16S ribosomal rRNA and cytochrome oxidase subunit I) and one nuclear gene (for coagulogen) in the four species of horseshoe crabs and outgroup species were used in a phylogenetic analysis with various substitution models. All three genes yield the same tree topology, with Tachypleus-gigas and Carcinoscorpius-rotundicauda grouped together as a monophyletic taxon. This topology is significantly better than all the alternatives when evaluated with the RELL (resampling estimated log-likelihood) method.

  183. Xia, X. and W.-H. Li 1998. What amino acid properties affect protein evolution? Journal of Molecular Evolution 47:557-564.

    Abstract We studied 10 protein-coding mitochondrial genes from 19 mammalian species to evaluate the effects of 10 amino acid properties on the evolution of the genetic code, the amino acid composition of proteins, and the pattern of nonsynonymous substitutions. The 10 amino acid properties studied are the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation. The genetic code appears to have evolved toward minimizing polarity and hydropathy but not the other seven properties. This can be explained by our finding that the presumably primitive amino acids differed much only in polarity and hydropathy, but little in the other properties. Only the chemical composition (C) and isoelectric point (IE) appear to have affected the amino acid composition of the proteins studied, that is, these proteins tend to have more amino acids with typical C and IE values, so that nonsynonymous mutations tend to result in small differences in C and IE. All properties, except for hydroxythiolation, affect the rate of nonsynonymous substitution, with the observed amino acid changes having only small differences in these properties, relative to the spectrum of all possible nonsynonymous mutations.

  184. Xia, X. 1998. How optimized is the translational machinery in E. coli, S. typhimurium, and S. cerevisiae? Genetics 149: 37-44.

    Abstract The optimization of the translational machinery in cells requires the mutual adaptation of codon usage and tRNA concentration, and the adaptation of tRNA concentration to amino acid usage. Two predictions were derived based on a simple deterministic model of translation which assumes that elongation of the peptide chain is rate-limiting. The highest translational efficiency is achieved when the codon recognized by the most abundant tRNA reaches the maximum frequency. For each codon family, the tRNA concentration is optimally adapted to codon usage when the concentration of different tRNA species matches the square-root of the frequency of their corresponding synonymous codons. When tRNA concentration and codon usage are well adapted to each other, the optimal content of all tRNA species carrying the same amino acid should match the square-root of the frequency of the amino acid. These predictions are examined against empirical data from Escherichia coli, Salmonella typhimurium, and Saccharomyces cerevisiae.

  185. Xia, X. 1998. The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes. Molecular Biology and Evolution 15:336-344.

    Abstract Substitution rates at the three codon positions (r1, r2, and r3) of mammalian mitochondrial genes are in the order of r3 > r1 > r2, and the rate heterogeneity at the three positions, as measured by the shape parameter of the gamma distribution (alpha 1, alpha 2, and alpha 3), is in the order of alpha 3 > alpha 1 > alpha 2. The causes for the rate heterogeneity at the three codon positions remain unclear and, in particular, there has been no satisfactory explanation for the observation of alpha 1 > alpha 2. I attempted to dissect the causes of rate heterogeneity by studying the pattern of nonsynonymous substitutions with respect to codon positions in 10 mitochondrial genes from 19 mammalian species. Nonsynonymous substitutions involve more different amino acid replacements at the second than at the first codon position, which results in r1 > r2. The difference between r1 and r2 increases with the intensity of purifying selection, and so does the rate heterogeneity in nonsynonymous substitutions among sites at the same codon position. All mitochondrial genes appear to have functionally important and unimportant codons, with the latter having all three codon positions prone to nonsynonymous substitutions. Within the functionally important codons, the second codon position is much more conservative than the codon position. This explains why alpha 1 > alpha 2. The result suggests that overweighting of the second codon position in phylogenetic analysis may be a misguided practice.

  186. Xia, X. 1996. Maximising transcription efficiency causes codon usage bias. Genetics 144:1309-1320.

    Abstract The rate of protein synthesis depends on both the rate of initiation of translation and the rate of elongation of the peptide chain. The rate of initiation depends on the encountering rate between ribosomes and mRNA; this rate in turn depends on the concentration of ribosomes and mRNA. Thus, patterns of codon usage that increase transcriptional efficiency should increase mRNA concentration, which in turn would increase the initiation rate and the rate of protein synthesis. An optimality model of the transcriptional process is presented with the prediction that the most frequently used ribonucleotide at the third codon sites in mRNA molecules should be the same as the most abundant ribonucleotide in the cellular matrix where mRNA is transcribed. This prediction is supported by four kinds of evidence. First, A-ending codons are the most frequently used synonymous codons in mitochondria, where ATP is much more abundant than that of the three other ribonucleotides. Second, A-ending codons are more frequently used in mitochondrial genes than in nuclear genes. Third, protein genes from organisms with a high metabolic rate use more A-ending codons and have higher A content in their introns than those from organisms with a low metabolic rate.

  187. Xia, X., Hafner, M. S. and P. D. Sudman. 1996. On transition bias in mitochondrial genes of pocket gophers. Journal of Molecular Evolution 43:32-40.

    Abstract The relative contribution of mutation and purifying selection to transition bias has not been quantitatively assessed in mitochondrial protein genes. The observed transition/transversion (s/v) ratio is (micros Ps)/(microv Pv), where micros and microv denote mutation rate of transitions and transversions, respectively, and Ps and Pv denote fixation probabilities of transitions and transversions, respectively. Because selection against synonymous transitions can be assumed to be roughly equal to that against synonymous transversions, Ps/Pv approximately 1 at fourfold degenerate sites, so that the s/v ratio at fourfold degenerate sites is approximately micros/microv, which is a measure of mutational contribution to transition bias. Similarly, the s/v ratio at nondegenerate sites is also an estimate of micros/microv if we assume that selection against nonsynonymous transitions is roughly equal to that against nonsynonymous transversions. In two mitochondrial genes, cytochrome oxidase subunit I (COI) and cytochrome b (cyt-b) in pocket gophers, the s/v ratio is about two at nondegenerate and fourfold degenerate sites for both the COI and the cyt-b genes. This implies that mutation contribution to transition bias is relatively small. In contrast, the s/v ratio is much greater at twofold degenerate sites, being 48 for COI and 40 for cyt-b. Given that the micros/microv ratio is about 2, the Ps/Pv ratio at twofold degenerate sites must be on the order of 20 or greater. This suggests a great effect of purifying selection on transition bias in mitochondrial protein genes because transitions are synonymous and transversions are nonsynonymous at twofold degenerate sites in mammalian mitochondrial genes. We also found that nonsynonymous mutations at twofold degenerate sites are more neutral than nonsynonymous mutations at nondegenerate sites, and that the COI gene is subject to stronger purifying selection than is the cyt-b gene. A model is presented to integrate the effect of purifying selection, codon bias, DNA repair and GC content on s/v ratio of protein-coding genes.

  188. Xia, X. 1995. Body temperature, rate of biosynthesis, and evolution of genome size. Molecular Biology and Evolution 12:834-842.

    Abstract An optimality model relating the rate of biosynthesis to body temperature and gene duplication is presented to account for several observed patterns of genome size variation. The model predicts (1) that poikilotherms living in a warm climate should have a smaller genome than poikilotherms living in a cold climate, (2) that homeotherms should have a small genome as well as a small variation in genome size relative to their poikilothermic ancestors, (3) that cold geological periods should favor the evolution of poikilotherms with a large genome and that warm geological periods should do the opposite, and (4) that poikilotherms with a small genome should be more sensitive to changes in temperature than poikilotherms with a large genome. The model also offers two explanations for the empirically documented trend that organisms with a large cell volume have larger genomes than those with a small cell volume. Relevant empirical evidence is summarized to support these predictions.

  189. Xia, X. 1995. Revisiting Hamilton's rule. American Naturalist 145:483-492.
  190. Xia, X. 1993. A full sibling is not as valuable as an offspring: on Hamilton's rule. American Naturalist 142:174-185.
  191. Boonstra, R., Xia, X. and L. Pavone. 1992. Mating system of the meadow vole, Microtus pennsylvanicus. Behavioral Ecology 4:83-89.
  192. Xia, X. and R. Boonstra. 1992. Measuring temporal variation in population density: a critique. American Naturalist 140:883-892.
  193. Xia, X. 1992. Uncertainty of paternity can select against paternal care. American Naturalist 139:1126-1129.
  194. Xia, X. and J. S. Millar. 1991. Genetic evidence of promiscuity in Peromyscus leucopus. Behavioral Ecology and Sociobiology 28:171-178.
  195. Millar, J. S., Xia, X. and M. B. Norrie. 1991. Relationship among reproductive status, nutritional status and food characteristics in a natural population of Peromyscus maniculatus. Canadian Journal of Zoology 69:555-559.
  196. Xia, X. and J. S. Millar. 1990. Infestation of wild Peromyscus leucopus by bot fly larvae. Journal of Mammalogy 71:255-258.
  197. Xia, X. and J. S. Millar. 1989. Dispersion of adult males in relation to female reproductive status in Peromyscus leucopus. Canadian Journal of Zoology 67:1047-1052.
  198. El-Haddad, M., J. S. Millar and X. Xia. 1989. Offspring recognition by male Peromyscus maniculatus. Journal of Mammalogy 69:811-813.
  199. Xia, X. and J. S. Millar. 1988. Paternal behaviour by Peromyscus leucopus in enclosures. Canadian Journal of Zoology 66:1184-1187.
  200. Xia, X. and J. S. Millar. 1987. Morphological variation in deer mice in relation to sex and habitat. Canadian Journal of Zoology 65:527-533.
  201. Xia, X. and J. S. Millar. 1986. Sex-related dispersion in Peromyscus maniculatus. Canadian Journal of Zoology 64:933-936.
© 2016. XiaLab. All Rights Reserved.