XiaLab at University of Ottawa

Skip Navigation Links

Publications

Refereed papers (My students in red italics)

  1. Xia X. 2017. DAMBE6: New tools for microbial genomics, phylogenetics and molecular evolution. J Hered esx033. doi: 10.1093/jhered/esx033.
  2. DAMBE is a comprehensive software workbench for data analysis in molecular biology, phylogenetics and evolution. Several important new functions have been added since version 5 of DAMBE: 1) comprehensive genomic profiling of translation initiation efficiency of different genes in different prokaryotic species, 2) a new index of translation elongation (ITE) that takes into account both tRNA-mediated selection and background mutation on codon-anticodon adaptation, 3) a new and accurate phylogenetic approach based on pairwise alignment only, which is useful for highly divergent sequences from which a reliable multiple sequence alignment is difficult to obtain. Many other functions have been updated and improved including PWM for motif characterization, Gibbs sampler for de novo motif discovery, hidden Markov models for protein secondary structure prediction, self-organizing map for non-linear clustering of transcriptomic data, comprehensive sequence alignment and phylogenetic functions. DAMBE features a graphic, user-friendly and intuitive interface, and is freely available from http://dambe.bio.uottawa.ca.

  3. Abolbaghaei A, Silke JR, Xia X. 2017 How Changes in Anti-SD Sequences Would Affect SD Sequences in Escherichia coli and Bacillus subtilis. G3: Genes, Genomes Genetics
  4. The 3' end of the small ribosomal RNAs (ssu rRNA) in bacteria is directly involved in the selection and binding of mRNA transcripts during translation initiation via well-documented interactions between a Shine-Dalgarno (SD) sequence located upstream of the initiation codon and an anti-SD (aSD) sequence at the 3' end of the ssu rRNA. Consequently, the 3' end of ssu rRNA (3'TAIL) is strongly conserved among bacterial species because a change in the region may impact the translation of many protein-coding genes. Escherichia coli and Bacillus subtilis differ in their 3' ends of ssu rRNA, being GAUCACCUCCUUA3' in E. coli and GAUCACCUCCUUUCU3' or GAUCACCUCCUUUCUA3' in B. subtilis. Such differences in 3'TAIL lead to species-specific SDs (designated SDEc for E. coli and SDBs for B. subtilis) that can form strong and well-positioned SD/aSD pairing in one species but not in the other. Selection mediated by the species-specific 3'TAIL is expected to favour SDBs against SDEc in B. subtilis but favour SDEc against SDBs in E. coli. Among well-positioned SDs, SDEc is used more in E. coli than in B. subtilis, and SDBs more in B. subtilis than in E. coli. Highly expressed genes and genes of high translation efficiency tend to have longer SDs than lowly expressed genes and genes with low translation efficiency in both species, but more so in B. subtilis than in E. coli. Both species overuse SDs matching the bolded part of 3'TAIL shown above. The 3'TAIL difference contributes to host-specificity of phages.

  5. Xia X. 2017. Bioinformatics and Drug Discovery. Curr Top Med Chem 17(15):1709-1726
  6. Bioinformatic analysis can not only accelerate drug target identification and drug candidate screening and refinement, but also facilitate characterization of side effects and predict drug resistance. High-throughput data such as genomic, epigenetic, genome architecture, cistromic, transcriptomic, proteomic, and ribosome profiling data have all made significant contribution to mechanism-based drug discovery and drug repurposing. Accumulation of protein and RNA structures, as well as development of homology modeling and protein structure simulation, coupled with large structure databases of small molecules and metabolites, paved the way for more realistic protein-ligand docking experiments and more informative virtual screening. I present the conceptual framework that drives the collection of these high-throughput data, summarize the utility and potential of mining these data in drug discovery, outline a few inherent limitations in data and software mining these data, point out news ways to refine analysis of these diverse types of data, and highlight commonly used software and databases relevant to drug discovery.

  7. Wei Y, Xia X 2017 The Role of +4U as an Extended Translation Termination Signal in Bacteria. Genetics 205:539–549
  8. Termination efficiency of stop codons depends on the first 3’ flanking (+4) base in bacteria and eukaryotes. In both Escherichia coli and Saccharomyces cerevisiae, termination read-through is reduced in the presence of +4U; however, the molecular mechanism underlying +4U function is poorly understood. Here, we perform comparative genomics analysis on 25 bacterial species (covering Actinobacteria, Bacteriodetes, Cyanobacteria, Deinococcus-Thermus, Firmicutes, Proteobacteria and Spirochaetae) with bioinformatics approaches to examine the influence of +4U in bacterial translation termination by contrasting between highly and lowly expressed genes (HEGs and LEGs). We estimated gene expression using the recently formulated Index of Translation Elongation, ITE, and identified stop codon near-cognate tRNAs from well annotated genomes. We show that +4U was consistently over-represented in UAA-ending HEGs relative to LEGs. The result is consistent with the interpretation that +4U enhances termination mainly for UAA. Usage of +4U decreases in GC-rich species where most stop codons are UGA and UAG, with few UAA-ending genes, which is expected if UAA usage in HEGs drives up +4U usage. In highly expressed genes, +4U usage increases significantly with abundance of UAA nc_tRNAs (near-cognate tRNAs which decode codons differing from UAA by a single nucleotide), particularly those with a mismatch at the first stop codon site. UAA is always the preferred stop codon in highly expressed genes, and our results suggest that UAAU is the most efficient translation termination signal in bacteria.

  9. Vlasschaert C, Cook D, Xia X, Gray DA. 2017. The evolution and functional diversification of the deubiquitinating enzyme superfamily. Genome Biol Evol. 9:558-573
  10. Ubiquitin and ubiquitin-like molecules are attached to and removed from cellular proteins in a dynamic and highly regulated manner. Deubiquitinating enzymes are critical to this process, and the genetic catalogue of deubiquitinating enzymes expanded greatly over the course of evolution. Extensive functional redundancy has been noted among the 93 members of the human deubiquitinating enzyme (DUB) superfamily. This is especially true of genes that were generated by duplication (termed paralogs) as they often retain considerable sequence similarity. Since complete redundancy in systems should be eliminated by selective pressure we theorized that many overlapping DUBs must have significant and unique spatiotemporal roles that can be evaluated in an evolutionary context. We have determined the evolutionary history of the entire class of deubiquitinating enzymes, including the sequence and means of duplication for all paralogous pairs. To establish their uniqueness, we have investigated cell-type specificity in developmental and adult contexts, and have investigated the co-emergence of substrates from the same duplication events. Our analysis has revealed examples of DUB gene subfunctionalization, neofunctionalization, and nonfunctionalization.

  11. Xia X. 2016. PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences. Molecular Phylogenetics and Evolution 102:331–343 .
  12. While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.

  13. Wei, Y., Wang, J., Xia, X. 2016. Coevolution between stop codon usage and release factors in bacterial species. Molecular Biology and Evolution 33:2357-2367. .
  14. Three stop codons in bacteria represent different translation termination signals, and their usage is expected to depend on their differences in translation termination efficiency, mutation bias, and relative abundance of release factors (RF1 decoding UAA and UAG, and RF2 decoding UAA and UGA). In 14 bacterial species (covering Proteobacteria, Firmicutes, Cyanobacteria, Actinobacteria and Spirochetes) with cellular RF1 and RF2 quantified, UAA is consistently over-represented in highly expressed genes (HEGs) relative to lowly expressed genes (LEGs), whereas UGA usage is the opposite even in species where RF2 is far more abundant than RF1. UGA usage relative to UAG increases significantly with PRF2 [=RF2/(RF1+RF2)] as expected from adaptation between stop codons and their decoders. PRF2 is greater than 0.5 over a wide range of AT content (measured by PAT3 as the proportion of AT at third codon sites), but decreases rapidly towards zero at the high range of PAT3. This explains why bacterial lineages with high PAT3 often have UGA reassigned because of low RF2. There is no indication that UAG is a minor stop codon in bacteria as claimed in a recent publication. The claim is invalid because of the failure to apply the two key criteria in identifying a minor codon: 1) it is least preferred by HEGs (or most preferred by LEGs) and 2) it corresponds to the least abundant decoder. Our results suggest a more plausible explanation for why UAA usage increases, and UGA usage decreases, with PAT3, but UAG usage remains low over the entire PAT3 range.

  15. Vlasschaert, C., Xia, X., Gray, D.A. 2016. Selection preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals. Scientific Reports 6:20039 .
  16. Ubiquitin specific protease 4 (USP4) is a highly networked deubiquitinating enzyme with reported roles in cancer, innate immunity and RNA splicing. In mammals it has two dominant isoforms arising from inclusion or skipping of exon 7 (E7). We evaluated two plausible mechanisms for the generation of these isoforms: (A) E7 skipping due to a long upstream intron and (B) E7 skipping due to inefficient 5′ splice sites (5′SS) and/or branchpoint sites (BPS). We then assessed whether E7 alternative splicing is maintained by selective pressure or arose from genetic drift. Both transcript variants were generated from a USP4-E7 minigene construct with short flanking introns, an observation consistent with the second mechanism whereby differential splice signal strengths are the basis of E7 skipping. Optimization of the downstream 5′SS eliminated E7 skipping. Experimental validation of the correlation between 5′SS identity and exon skipping in vertebrates pinpointed the +6 site as the key splicing determinant. Therian mammals invariably display a 5′SS configuration favouring alternative splicing and the resulting isoforms have distinct subcellular localizations. We conclude that alternative splicing of mammalian USP4 is under selective maintenance and that long and short USP4 isoforms may target substrates in various cellular compartments.

  17. Vlasschaert, C., Xia, X., Coulombe, J., Gray, D.A. 2015. Evolution of the highly networked deubiquitinating enzymes USP4, USP15 and USP11. BMC Evolutionary Biology 15:230 .
  18. Background: USP4, USP15 and USP11 are paralogous deubiquitinating enzymes as evidenced by structural organization and sequence similarity. Based on known interactions and substrates it would appear that they have partially redundant roles in pathways vital to cell proliferation, development and innate immunity, and elevated expression of all three has been reported in various human malignancies. The nature and order of duplication events that gave rise to these extant genes has not been determined, nor has their functional redundancy been established experimentally at the organismal level. Methods We have employed phylogenetic and syntenic reconstruction methods to determine the chronology of the duplication events that generated the three paralogs and have performed genetic crosses to evaluate redundancy in mice. Results Our analyses indicate that USP4 and USP15 arose from whole genome duplication prior to the emergence of jawed vertebrates. Despite having lower sequence identity USP11 was generated later in vertebrate evolution by small-scale duplication of the USP4-encoding region. While USP11 was subsequently lost in many vertebrate species, all available genomes retain a functional copy of either USP4 or USP15, and through genetic crosses of mice with inactivating mutations we have confirmed that viability is contingent on a functional copy of USP4 or USP15. Loss of ubiquitin-exchange regulation, constitutive skipping of the seventh exon and neural-specific expression patterns are derived states of USP11. Post-translational modification sites differ between USP4, USP15 and USP11 throughout evolution. Conclusions In isolation sequence alignments can generate erroneous USP gene phylogenies. Through a combination of methodologies the gene duplication events that gave rise to USP4, USP15, and USP11 have been established. Although it operates in the same molecular pathways as the other USPs, the rapid divergence of the more recently generated USP11 enzyme precludes its functional interchangeability with USP4 and USP15. Given their multiplicity of substrates the emergence (and in some cases subsequent loss) of these USP paralogs would be expected to alter the dynamics of the networks in which they are embedded.

  19. Prabhakaran, R., Chithambaram, S., Xia, X. 2015. Escherichia coli and Staphylococcus phages: Effect of translation initiation efficiency on differential codon adaptation mediated by virulent and temperate lifestyles. Journal of General Virology 96:1169-1179. .
  20. Rapid biosynthesis is key to the success of bacteria and viruses. Highly expressed genes in bacteria exhibit strong codon bias corresponding to differential availability of tRNAs. However, a large clade of lambdoid coliphages exhibit relatively poor codon adaptation to the host translation machinery, in contrast to other coliphages that exhibit strong codon adaptation to the host. Three possible explanations were previously proposed but dismissed: 1) the phage-borne tRNA genes that reduce the dependence of phage translation on host tRNAs, 2) lack of time needed for evolving codon adaptation due to recent host switching, and 3) strong strand asymmetry with biased mutation disrupting codon adaptation. Here we examine the possibility that phages with relatively poor codon adaptation have poor translation initiation which would weaken the selection on codon adaptation. We measure translation initiation by: 1) the strength and position of the Shine-Dalgarno (SD) sequence and (2) stability of secondary structure of sequences flanking SD and start codon known to affect accessibility of SD and start codon. Phage genes with strong codon adaptation have significantly stronger SD sequences than those with poor codon adaptation. The former also have significantly weaker secondary structure in sequences flanking SD and start codon than the latter. Thus, lambdoid phages do not exhibit strong codon adaptation because they have relatively inefficient translation initiation and would benefit little from increased elongation efficiency. We also provide evidence suggesting that phage lifestyle (virulent versus temperate) affects selection intensity on the efficiency of translation initiation and elongation.

  21. Sun,X, Xia H, Yang Q. 2015. Dating the origin of the major lineages of Branchiopoda. Palaeoworld 25:303–317 .
  22. Despite the well-established phylogeny and good fossil record of branchiopods, a consistent macro-evolutionary timescale for the group remains elusive. This study focuses on the early branchiopod divergence dates where fossil record is extremely fragmentary or missing. On the basis of a large genomic dataset and carefully evaluated fossil calibration points, we assess the quality of the branchiopod fossil record by calibrating the tree against well-established first occurrences, providing paleontological estimates of divergence times and completeness of their fossil record. The maximum age constraints were set using a quantitative approach of Marshall (2008). We tested the alternative placements of Yicaris and Wujicaris in the referred arthropod tree via the likelihood checkpoints method. Divergence dates were calculated using Bayesian relaxed molecular clock and penalized likelihood methods. Our results show that the stem group of Branchiopoda is rooted in the late Neoproterozoic (563 ± 7 Ma); the crown-Branchiopoda diverged during middle Cambrian to Early Ordovician (478–512 Ma), likely representing the origin of the freshwater biota; the Phyllopoda clade diverged during Ordovician (448–480 Ma) and Diplostraca during Late Ordovician to early Silurian (430–457 Ma). By evaluating the congruence between the observed times of appearance of clade in the fossil record and the results derived from molecular data, we found that the uncorrelated rate model gave more congruent results for shallower divergence events whereas the auto-correlated rate model gives more congruent results for deeper events.

  23. Xia X. 2015. A major controversy in codon-anticodon adaptation resolved by a new codon usage index. Genetics 199:573-579 Access the recommendation on F1000Prime
  24. Two alternative hypotheses attribute different benefits to codon-anticodon adaptation. The first assumes that protein production is rate-limited by both initiation and elongation, and codon-anticodon adaptation would result in higher elongation efficiency and more efficient and accurate protein production, especially for highly expressed genes. The second claims that protein production is rate-limited only by initiation efficiency, but improved codon adaptation and consequently increased elongation efficiency have the benefit of increasing ribosomal availability for global translation. To test these hypotheses, a recent study engineered a synthetic library of 154 genes, all encoding the same protein but differing in degrees of codon adaptation, to quantify the effect of differential codon adaptation on protein production in Escherichia coli. The surprising conclusion that “codon bias did not correlate with gene expression” and that “translation initiation, not elongation, is rate-limiting for gene expression” contradicts the conclusion reached by many other empirical studies. Here I resolve the contradiction by reanalyzing the data from the 154 sequences. I demonstrate that translation elongation accounts for about 17% of total variation in protein production and that the previous conclusion is due to the use of CAI (codon adaptation index) which does not account for the mutation bias in characterizing codon adaptation. The effect of translation elongation becomes undetectable only when translation initiation is unrealistically slow. A new index of translation elongation (ITE) is formulated to facilitate studies on the efficiency and evolution of the translation machinery.

  25. Nikbakht, H., Xia, X., D. Hickey. 2014. The evolution of genomic GC content undergoes a rapid reversal within the genus Plasmodium. Genome 57:507-511
  26. The genome of the malarial parasite, Plasmodium falciparum, is extremely AT-rich. This bias toward a low GC content is a characteristic of several - but not all - species within the genus Plasmodium. We compared 4283 orthologous pairs of protein-coding sequences between P. falciparum and the less AT-biased P. vivax. Our results indicate that the common ancestor of these two species was also extremely AT-rich. This means that, although there was a strong bias toward A+T during the early evolution of the ancestral Plasmodium lineage, there was a subsequent reversal of this trend during the more recent evolution of some species, such as P. vivax. Moreover, we show that not only is the P. vivax genome losing its AT richness, it is actually gaining a very significant degree of GC richness. This example illustrates the potential volatility of nucleotide content during the course of molecular evolution. Such reversible fluxes in nucleotide content within lineages could have important implications for phylogenetic reconstruction based on molecular sequence data.

  27. Chithambaram S, Prabhakaran P, Xia X. 2014. Differential codon adaptation between dsDNA and ssDNA phages in E. coli. Molecular Biology and Evolution 31:1606-1617
  28. Because phages use their host translation machinery, their codon usage should evolve towards that of highly expressed host genes. We used two indices to measure codon adaptation of phages to their host, rRSCU (the correlation in RSCU between phages and their host) and CAI computed with highly expressed host genes as the reference set (because phage translation depends on host translation machinery). These indices used for this purpose are appropriate only when hosts exhibit little mutation bias, so only phages parasitizing Escherichia coli were included in the analysis. For double-stranded (dsDNA) phages, both rRSCU and CAI decrease with increasing number of tRNA genes encoded by the phage genome. rRSCU is greater for dsDNA phages than for ssDNA phages, and the low rRSCU values are mainly due to poor concordance in RSCU values for Y-ending codons between ssDNA phages and the E. coli host, consistent with the predicted effect of C→T mutation bias in the ssDNA phages. Strong C→T mutation bias would improve codon adaptation in codon families (e.g., Gly) where U-ending codons are favored over C-ending codons (“U-friendly” codon families) by highly expressed host genes, but decrease codon adaptation in other codon families where highly expressed host genes favor C-ending codons against U-ending codons (“U-hostile” codon families). It is remarkable that ssDNA phages with increasing C→T mutation bias also increased the usage of codons in the “U-friendly” codon families, thereby achieving CAI values almost as large as those of dsDNA phages. This represents a new type of codon adaptation.

  29. Prabhakaran R, Chithambaram S, Xia X 2014. Aeromonas phages encode tRNAs for their overused codons. Int. J. Computational Biology and Drug Design 7:168-183 .
  30. The GC-rich bacterial species, Aeromonas salmonicida, is parasitised by both GC-rich phages (Aeromonas phages- phiAS7 and vB_AsaM-56) and GC-poor phages (Aeromonas phages – 25, 31, 44RR2.8t, 65, Aes508, phiAS4 and phiAS5). Both the GC-rich Aeromonas phage phiAS7 and Aeromonas phage vB_AsaM-56 have nearly identical codon usage bias as their host. While all the remaining seven GC-poor Aeromonas phages differ dramatically in codon usage from their GC-rich host. Here, we investigated whether tRNA encoded in the genome of Aeromonas phages facilitate the translation of phage proteins. We found that tRNAs encoded in the phage genome correspond to synonymous codons overused in the phage genes but not in the host genes.

  31. Chithambaram S, Prabhakaran P, Xia X. 2014. The effects of mutation and selection on codon adaptation in E. coli bacteriophage. Genetics 197:301-315
  32. Studying phage codon adaptation is important not only for understanding the process of translation elongation, but also for re-engineering phages for medical and industrial purposes. To evaluate the effect of mutation and selection on phage codon usage, we developed an index to measure selection imposed by host translation machinery, based on the difference in codon usage between all host genes and highly expressed host genes. We developed linear and nonlinear models to estimate the C→T mutation bias in different phage lineages and to evaluate the relative effect of mutation and host selection on phage codon usage. C→T biased mutations occur more frequently in ssDNA phages than in dsDNA phages, and affect not only synonymous codon usage, but also nonsynonymous substitutions at second codon positions, especially in ssDNA phages. The host translation machinery affects codon adaptation in both dsDNA and ssDNA phages, with stronger effect on dsDNA phages than on ssDNA phages. Strand asymmetry with the associated local variation in mutation bias can significantly interfere with codon adaptation in both dsDNA and ssDNA phages.

  33. Xia, X. 2013. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Molecular Biology and Evolution 30:1720-1728 .

    Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories: 1) sequence retrieval, editing, manipulation, and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot, and many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely available from http://dambe.bio.uottawa.ca

  34. Sun, X. Y., Yang, Q. Xia, X. 2013. An Improved Implementation of Effective Number of Codons (Nc). Molecular Biology and Evolution 30:191-196.

    The effective number of codons (Nc) is a widely used index for characterizing codon usage bias because it does not require a set of reference genes as does codon adaptation index (CAI) and because of the freely available computational tools such as CodonW. However, Nc, as originally formulated has many problems. For example, it can have values far greater than the number of sense codons; it treats a 6-fold compound codon family as a single-codon family although it is made of a 2-fold and a 4-fold codon family that can be under dramatically different selection for codon usage bias; the existing implementations do not handle all different genetic codes; it is often biased by codon families with a small number of codons. We developed a new Nc that has a number of advantages over the original Nc. Its maximum value equals the number of sense codons when all synonymous codons are used equally, and its minimum value equals the number of codon families when exactly one codon is used in each synonymous codon family. It handles all known genetic codes. It breaks the compound codon families (e.g., those involving amino acids coded by six synonymous codons) into 2-fold and 4-fold codon families. It reduces the effect of codon families with few codons by introducing pseudocount and weighted averages. The new Nc has significantly improved correlation with CAI than the original Nc from CodonW based on protein-coding genes from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Bacillus subtilis, Micrococcus luteus, and Mycoplasma genitalium. It also correlates better with protein abundance data from the yeast than the original Nc.

  35. Xia, X. 2012. Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction. Scientifica, vol. 2012, Article ID 917540. doi:10.6064/2012/917540.

    Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.

  36. Vos, R. A., Balhoff, J. P., Caravas, J. A., Holder, M. T., Lapp, H., Maddison, W. P., Midford, P. E., Priyam, A., Sukumaran, S. Xia, X., Stoltzfus, A. 2012. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Systematic Biology 61(4):675–689

    In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.

  37. Xia, X. 2012. DNA Replication and Strand Asymmetry in Prokaryotic and Mitochondrial Genomes. Current Genomics 13, 16-27

    Different patterns of strand asymmetry have been documented in a variety of prokaryotic genomes as well as mitochondrial genomes. Because different replication mechanisms often lead to different patterns of strand asymmetry, much can be learned of replication mechanisms by examining strand asymmetry. Here I summarize the diverse patterns of strand asymmetry among different taxonomic groups to suggest that (1) the single-origin replication may not be universal among bacterial species as the endosymbionts Wigglesworthia glossinidia, Wolbachia species, cyanobacterium Synechocystis 6803 and Mycoplasma pulmonis genomes all exhibit strand asymmetry patterns consistent with the multiple origins of replication, (2) different replication origins in some archaeal genomes leave quite different patterns of strand asymmetry, suggesting that different replication origins in the same genome may be differentially used, (3) mitochondrial genomes from representative vertebrate species share one strand asymmetry pattern consistent with the strand-displacement replication documented in mammalian mtDNA, suggesting that the mtDNA replication mechanism in mammals may be shared among all vertebrate species, and (4) mitochondrial genomes from primitive forms of metazoans such as the sponge and hydra (representing Porifera and Cnidaria, respectively), as well as those from plants, have strand asymmetry patterns similar to single-origin or multi-origin replications observed in prokaryotes and are drastically different from mitochondrial genomes from other metazoans. This may explain why sponge and hydra mitochondrial genomes, as well as plant mitochondrial genomes, evolves much slower than those from other metazoans.

  38. Xia, X. , MacKay, V., Yao, X., Wu, J., Miura, F. Ito, T. Morris, D. R. 2011. Translation initiation: a regulatory role for poly(A) tracts in front of the AUG codon in Saccharomyces cerevisiae. Genetics 189:469-478

    The 5'-UTR serves as the loading dock for ribosomes during translation initiation and is the key site for translation regulation. Many genes in the yeast Saccharomyces cerevisiae contain poly(A) tracts in their 5'-UTRs. We studied these pre-AUG poly(A) tracts in a set of 3274 recently identified 5'-UTRs in the yeast to characterize their effect on in vivo protein abundance, ribosomal density, and protein synthesis rate in the yeast. The protein abundance and the protein synthesis rate increase with the length of the poly(A), but exhibit a dramatic decrease when the poly(A) length is ≥12. The ribosomal density also reaches the lowest level when the poly(A) length is ≥12. This supports the hypothesis that a pre-AUG poly(A) tract can bind to translation initiation factors to enhance translation initiation, but a long (≥12) pre-AUG poly(A) tract will bind to Pab1p, whose binding size is 12 consecutive A residues in yeast, resulting in repression of translation. The hypothesis explains why a long pre-AUG poly(A) leads to more efficient translation initiation than a short one when PABP is absent, and why pre-AUG poly(A) is short in the early genes but long in the late genes of vaccinia virus.

  39. Ma, P.,Ma, P., Xia X. 2011. Factors affecting splicing strength of yeast genes. Comparative and Functional Genomics. Article ID 212146, 13 pages

    Accurate and efficient splicing is of crucial importance for highly-transcribed intron-containing genes (ICGs) in rapidly replicating unicellular eukaryotes such as the budding yeast Saccharomyces cerevisiae. We characterize the 5' and 3' splice sites (ss) by position weight matrix scores (PWMSs), which is the highest for the consensus sequence and the lowest for splice sites differing most from the consensus sequence and used PWMS as a proxy for splicing strength. HAC1, which is known to be spliced by a nonspliceosomal mechanism, has the most negative PWMS for both its 5' ss and 3' ss. Several genes under strong splicing regulation and requiring additional splicing factors for their splicing also have small or negative PWMS values. Splicing strength is higher for highly transcribed ICGs than for lowly transcribed ICGs and higher for transcripts that bind strongly to spliceosomes than those that bind weakly. The 3' splice site features a prominent poly-U tract before the 3'AG. Our results suggest the potential of using PWMS as a screening tool for ICGs that are either spliced by a nonspliceosome mechanism or under strong splicing regulation in yeast and other fungal species.

  40. Xia, X. , Yang, Q. 2011. A Distance-based Least-square Method for Dating Speciation Events. Molecular Phylogenetics and Evolution 59:342-353.

    Distance-based phylogenetic methods are widely used in biomedical research. However, there has been little development of rigorous statistical methods and software for dating speciation and gene duplication events by using evolutionary distances. Here we present a simple, fast and accurate dating method based on the least-squares (LS) method that has already been widely used in molecular phylogenetic reconstruction. Dating methods with a global clock or two different local clocks are presented. Single or multiple fossil calibration points can be used, and multiple data sets can be integrated in a combined analysis. Variation of the estimated divergence time is estimated by resampling methods such as bootstrapping or jackknifing. Application of the method to dating the divergence time among seven ape species or among 35 mammalian species including major mammalian orders shows that the estimated divergence time with the LS criterion is nearly identical to those obtained by the likelihood method or Bayesian inference.

  41. van Weringh, A, M. Ragonnet-Cronin, E. Pranckeviciene, M. Pavon-Eternod, L. Kleiman, X. Xia. 2011. HIV-1 modulates the tRNA pool to improve translation efficiency. Molecular Biology and Evolution 28:1827-1834

    Despite its poorly adapted codon usage, HIV-1 replicates and is expressed extremely well in human host cells. HIV-1 has recently been shown to package non-lysyl transfer RNAs (tRNAs) in addition to the tRNA(Lys) needed for priming reverse transcription and integration of the HIV-1 genome. By comparing the codon usage of HIV-1 genes with that of its human host, we found that tRNAs decoding codons that are highly used by HIV-1 but avoided by its host are overrepresented in HIV-1 virions. In particular, tRNAs decoding A-ending codons, required for the expression of HIV's A-rich genome, are highly enriched. Because the affinity of Gag-Pol for all tRNAs is nonspecific, HIV packaging is most likely passive and reflects the tRNA pool at the time of viral particle formation. Codon usage of HIV-1 early genes is similar to that of highly expressed host genes, but codon usage of HIV-1 late genes was better adapted to the selectively enriched tRNA pool, suggesting that alterations in the tRNA pool are induced late in viral infection. If HIV-1 genes are adapting to an altered tRNA pool, codon adaptation of HIV-1 may be better than previously thought.

  42. Palidwor GA, Perkins TJ, Xia X. 2010. A General Model of Codon Bias Due to GC Mutational Bias. PLoS ONE 5(10): e13431.

    BACKGROUND: In spite of extensive research on the effect of mutation and selection on codon usage, a general model of codon usage bias due to mutational bias has been lacking. Because most amino acids allow synonymous GC content changing substitutions in the third codon position, the overall GC bias of a genome or genomic region is highly correlated with GC3, a measure of third position GC content. For individual amino acids as well, G/C ending codons usage generally increases with increasing GC bias and decreases with increasing AT bias. Arginine and leucine, amino acids that allow GC-changing synonymous substitutions in the first and third codon positions, have codons which may be expected to show different usage patterns.PRINCIPAL FINDINGS:In analyzing codon usage bias in hundreds of prokaryotic and plant genomes and in human genes, we find that two G-ending codons, AGG (arginine) and TTG (leucine), unlike all other G/C-ending codons, show overall usage that decreases with increasing GC bias, contrary to the usual expectation that G/C-ending codon usage should increase with increasing genomic GC bias. Moreover, the usage of some codons appears nonlinear, even nonmonotone, as a function of GC bias. To explain these observations, we propose a continuous-time Markov chain model of GC-biased synonymous substitution. This model correctly predicts the qualitative usage patterns of all codons, including nonlinear codon usage in isoleucine, arginine and leucine. The model accounts for 72%, 64% and 52% of the observed variability of codon usage in prokaryotes, plants and human respectively. When codons are grouped based on common GC content, 87%, 80% and 68% of the variation in usage is explained for prokaryotes, plants and human respectively.CONCLUSIONS:The model clarifies the sometimes-counterintuitive effects that GC mutational bias can have on codon usage, quantifies the influence of GC mutational bias and provides a natural null model relative to which other influences on codon bias may be measured.

  43. Jiang, J.-Y., H. Xiong, M. Cao, X. Xia, M.-A. Sirard, B Tsang. 2010. Mural granulosa cell gene expression associated with oocyte developmental competence. Journal of Ovarian Research 2010, 3:6.

    BACKGROUND: Ovarian follicle development is a complex process. Paracrine interactions between somatic and germ cells are critical for normal follicular development and oocyte maturation. Studies have suggested that the health and function of the granulosa and cumulus cells may be reflective of the health status of the enclosed oocyte. The objective of the present study is to assess, using an in vivo immature rat model, gene expression profile in granulosa cells, which may be linked to the developmental competence of the oocyte. We hypothesized that expression of specific genes in granulosa cells may be correlated with the developmental competence of the oocyte.METHODS:Immature rats were injected with eCG and 24 h thereafter with anti-eCG antibody to induce follicular atresia or with pre-immune serum to stimulate follicle development. A high percentage (30-50%, normal developmental competence, NDC) of oocytes from eCG/pre-immune serum group developed to term after embryo transfer compared to those from eCG/anti-eCG (0%, poor developmental competence, PDC). Gene expression profiles of mural granulosa cells from the above oocyte-collected follicles were assessed by Affymetrix rat whole genome array.RESULTS:The result showed that twelve genes were up-regulated, while one gene was down-regulated more than 1.5 folds in the NDC group compared with those in the PDC group. Gene ontology classification showed that the up-regulated genes included lysyl oxidase (Lox) and nerve growth factor receptor associated protein 1 (Ngfrap1), which are important in the regulation of protein-lysine 6-oxidase activity, and in apoptosis induction, respectively. The down-regulated genes included glycoprotein-4-beta galactosyltransferase 2 (Ggbt2), which is involved in the regulation of extracellular matrix organization and biogenesis.CONCLUSIONS:The data in the present study demonstrate a close association between specific gene expression in mural granulosa cells and the developmental competence of oocytes. This finding suggests that the most differentially expressed gene, lysyl oxidase, may be a candidate biomarker of oocyte health and useful for the selection of good quality oocytes for assisted reproduction.

  44. Zhang, D., J. T. Popesku, C. J. Martyniuk, H. Xiong, P. Duarte-Guterman, L. Yao, Xia, X., and V. L. Trudeau. 2009. Profiling neuroendocrine gene expression changes following fadrozole-induced estrogen decline in the female goldfish. Physiol. Genomics 38:351-361.

    Teleost fish represent unique models to study the role of neuroestrogens because of the extremely high activity of brain aromatase (AroB; the product of cyp19a1b). Aromatase respectively converts androstenedione and testosterone to estrone and 17beta-estradiol (E2). Specific inhibition of aromatase activity by fadrozole has been shown to impair estrogen production and influence neuroendocrine and reproductive functions in fish, amphibians, and rodents. However, very few studies have identified the global transcriptomic response to fadrozole-induced decline of estrogens in a physiological context. In our study, sexually mature prespawning female goldfish were exposed to fadrozole (50 mcirog/l) in March and April when goldfish have the highest AroB activity and maximal gonadal size. Fadrozole treatment significantly decreased serum E2 levels (4.7 times lower; P = 0.027) and depressed AroB mRNA expression threefold in both the telencephalon (P = 0.021) and the hypothalamus (P = 0.006). Microarray expression profiling of the telencephalon identified 98 differentially expressed genes after fadrozole treatment (q value <0.05). Some of these genes have shown previously to be estrogen responsive in either fish or other species, including rat, mouse, and human. Gene ontology analysis together with functional annotations revealed several regulatory themes for physiological estrogen action in fish brain that include the regulation of calcium signaling pathway and autoregulation of estrogen receptor action. Real-time PCR verified microarray data for decreased (activin-betaA) or increased (calmodulin, ornithine decarboxylase 1) mRNA expression. These data have implications for our understanding of estrogen actions in the adult vertebrate brain.

  45. Li, H., G. Liu, and X. Xia. 2009. Correlations between recombination rate and intron distributions along chromosomes of C. elegans. Progress in Natural Science 19:517.
  46. Xia, X. 2009. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Molecular Phylogenetics and Evolution 52:665-676.

    Distance-based phylogenetic methods are widely used in biomedical research. However, distance-based dating of speciation events and the test of the molecular clock hypothesis are relatively underdeveloped. Here I develop an approximate test of the molecular clock hypothesis for distance-based trees, as well as information-theoretic indices that have been used frequently in model selection, for use with distance matrices. The results are in good agreement with the conventional sequence-based likelihood ratio test. Among the information-theoretic indices, AICu is the most consistent with the sequence-based likelihood ratio test. The confidence in model selection by the indices can be evaluated by bootstrapping. I illustrate the usage of the indices and the approximate significance test with both empirical and simulated sequences. The tests show that distance matrices from protein gel electrophoresis and from genome rearrangement events do not violate the molecular clock hypothesis, and that the evolution of the third codon position conforms to the molecular clock hypothesis better than the second codon position in vertebrate mitochondrial genes. I outlined evolutionary distances that are appropriate for phylogenetic reconstruction and dating.

  47. Xia, X., Holcik, M., 2009. Strong Eukaryotic IRESs Have Weak Secondary Structure. PLoS ONE 4, e4136.

    BACKGROUND: The objective of this work was to investigate the hypothesis that eukaryotic Internal Ribosome Entry Sites (IRES) lack secondary structure and to examine the generality of the hypothesis.METHODOLOGY/PRINCIPAL FINDINGS: IRESs of the yeast and the fruit fly are located in the 5'UTR immediately upstream of the initiation codon. The minimum folding energy (MFE) of 60 nt RNA segments immediately upstream of the initiation codons was calculated as a proxy of secondary structure stability. MFE of the reverse complements of these 60 nt segments was also calculated. The relationship between MFE and empirically determined IRES activity was investigated to test the hypothesis that strong IRES activity is associated with weak secondary structure. We show that IRES activity in the yeast and the fruit fly correlates strongly with the structural stability, with highest IRES activity found in RNA segments that exhibit the weakest secondary structure. CONCLUSIONS: We found that a subset of eukaryotic IRESs exhibits very low secondary structure in the 5'-UTR sequences immediately upstream of the initiation codon. The consistency in results between the yeast and the fruit fly suggests a possible shared mechanism of cap-independent translation initiation that relies on an unstructured RNA segment.

  48. Cong, P., X. Xia, and Q. Yang. 2009. Monophyly of the ring-forming group in Diplopoda (Myriapoda, Arthropoda) based on SSU and LSU ribosomal RNA sequences. Progress in Natural Science 19:1297-1303
  49. Zhang, D., H. Xiong, J. A. Mennigen, J. T. Popesku, V. L. Marlatt, C. J. Martyniuk, K. Crump, A. R. Cossins, X. Xia, and V. L. Trudeau. 2009. Defining Global Neuroendocrine Gene Expression Patterns Associated with Reproductive Seasonality in Fish. PLoS ONE 4:e5816..

    BACKGROUND: Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted.METHODOLOGY/PRINCIPAL FINDINGS: In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays.CONCLUSIONS/SIGNIFICANCE: Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.

  50. Zhang, D., H. Xiong, J. Shan, X. Xia, and V. Trudeau. 2008. Functional insight into Maelstrom in the germline piRNA pathway: a unique domain homologous to the DnaQ-H 3'-5' exonuclease, its lineage-specific expansion/loss and evolutionarily active site switch. Biology Direct 3:48.

    Maelstrom (MAEL) plays a crucial role in a recently-discovered piRNA pathway; however its specific function remains unknown. Here a novel MAEL-specific domain characterized by a set of conserved residues (Glu-His-His-Cys-His-Cys, EHHCHC) was identified in a broad range of species including vertebrates, sea squirts, insects, nematodes, and protists. It exhibits ancient lineage-specific expansions in several species, however, appears to be lost in all examined teleost fish species. Functional involvement of MAEL domains in DNA- and RNA-related processes was further revealed by its association with HMG, SR-25-like and HDAC_interact domains. A distant similarity to the DnaQ-H 3'-5' exonuclease family with the RNase H fold was discovered based on the evidence that all MAEL domains adopt the canonical RNase H fold; and several protist MAEL domains contain the conserved 3'-5' exonuclease active site residues (Asp-Glu-Asp-His-Asp, DEDHD). This evolutionary link together with structural examinations leads to a hypothesis that MAEL domains may have a potential nuclease activity or RNA-binding ability that may be implicated in piRNA biogenesis. The observed transition of two sets of characteristic residues between the ancestral DnaQ-H and the descendent MAEL domains may suggest a new mode for protein function evolution called "active site switch", in which the protist MAEL homologues are the likely evolutionary intermediates due to harboring the specific characteristics of both 3'-5' exonuclease and MAEL domains.

  51. Popesku, J. T., C. J. Martyniuk, J. Mennigen, H. Xiong, D. Zhang, X. Xia, A. R. Cossins, and V. L. Trudeau. 2008. The goldfish (Carassius auratus) as a model for neuroendocrine signaling. Mol Cell Endocrinol. 293(1-2):43-56

    Goldfish (Carassius auratus) are excellent model organisms for the neuroendocrine signaling and the regulation of reproduction in vertebrates. Goldfish also serve as useful model organisms in numerous other fields. In contrast to mammals, teleost fish do not have a median eminence; the anterior pituitary is innervated by numerous neuronal cell types and thus, pituitary hormone release is directly regulated. Here we briefly describe the neuroendocrine control of luteinizing hormone. Stimulation by gonadotropin-releasing hormone and a multitude of classical neurotransmitters and neuropeptides is opposed by the potent inhibitory actions of dopamine. The stimulatory actions of gamma-aminobutyric acid and serotonin are also discussed. We will focus on the development of a cDNA microarray composed of carp and goldfish sequences which has allowed us to examine neurotransmitter-regulated gene expression in the neuroendocrine brain and to investigate potential genomic interactions between these key neurotransmitter systems. We observed that isotocin (fish homologue of oxytocin) and activins are regulated by multiple neurotransmitters, which is discussed in light of their roles in reproduction in other species. We have also found that many novel and uncharacterized goldfish expressed sequence tags in the brain are also regulated by neurotransmitters. Their sites of production and whether they play a role in neuroendocrine signaling and control of reproduction remain to be determined. The transcriptomic tools developed to study reproduction could also be used to advance our understanding of neuroendocrine-immune interactions and the relationship between growth and food intake in fish.

  52. Vinci, G., X. Xia, and R. A. Veitia. 2008. Preservation of Genes Involved in Sterol Metabolism in Cholesterol Auxotrophs: Facts and Hypotheses. PLoS ONE 3:e2883.

    BACKGROUND: It is known that primary sequences of enzymes involved in sterol biosynthesis are well conserved in organisms that produce sterols de novo. However, we provide evidence for a preservation of the corresponding genes in two animals unable to synthesize cholesterol (auxotrophs): Drosophila melanogaster and Caenorhabditis elegans. Principal Findings: We have been able to detect bona fide orthologs of several ERG genes in both organisms using a series of complementary approaches. We have detected strong sequence divergence between the orthologs of the nematode and of the fruitfly; they are also very divergent with respect to the orthologs in organisms able to synthesize sterols de novo (prototrophs). Interestingly, the orthologs in both the nematode and the fruitfly are still under selective pressure. It is possible that these genes, which are not involved in cholesterol synthesis anymore, have been recruited to perform different new functions. We propose a more parsimonious way to explain their accelerated evolution and subsequent stabilization. The products of ERG genes in prototrophs might be involved in several biological roles, in addition to sterol synthesis. In the case of the nematode and the fruitfly, the relevant genes would have lost their ancestral function in cholesterogenesis but would have retained the other function(s), which keep them under pressure. Conclusions: By exploiting microarray data we have noticed a strong expressional correlation between the orthologs of ERG24 and ERG25 in D. melanogaster and genes encoding factors involved in intracellular protein trafficking and folding and with Start1 involved in ecdysteroid synthesis. These potential functional connections are worth being explored not only in Drosophila, but also in Caenorhabditis as well as in sterol prototrophs.

  53. Xia, X. 2008. The cost of wobble translation in fungal mitochondrial genomes: integration of two traditional hypotheses. BMC Evolutionary Biology 8:211.

    BACKGROUND: Fungal and animal mitochondrial genomes typically have one tRNA for each synonymous codon family. The codon-anticodon adaptation hypothesis predicts that the wobble nucleotide of a tRNA anticodon should evolve towards maximizing Watson-Crick base pairing with the most frequently used codon within each synonymous codon family, whereas the wobble versatility hypothesis argues that the nucleotide at the wobble site should be occupied by a nucleotide most versatile in wobble pairing, i.e., the tRNA wobble nucleotide should be G for NNY codon families, and U for NNR and NNN codon families (where Y stands for C or U, R for A or G and N for any nucleotide). RESULTS: We here integrate these two traditional hypotheses on tRNA anticodons into a unified model based on an analysis of the wobble costs associated with different wobble base pairs. This novel approach allows the relative cost of wobble pairing to be qualitatively evaluated. A comprehensive study of 36 fungal genomes suggests very different costs between two kinds of U:G wobble pairs, i.e., (1) between a G at the wobble site of a tRNA anticodon and a U at the third codon position (designated MU3:G) and (2) between a U at the wobble site of a tRNA anticodon and a G at the third codon position (designated MG3:U). CONCLUSION: In general, MU3:G is much smaller than MG3:U, suggesting no selection against U-ending codons in NNY codon families with a wobble G in the tRNA anticodon but strong selection against G-ending codons in NNR codon families with a wobble U at the tRNA anticodon. This finding resolves several puzzling observations in fungal genomics and corroborates previous studies showing that U3:G wobble is energetically more favorable than G3:U wobble.

  54. Mennigen, J. A., C. J. Martyniuk, K. Crump, H. Xiong, E. Zhao, J. Popesku, H. Anisman, A. R. Cossins, X. Xia, and V. L. Trudeau. 2008. Effects of fluoxetine on the reproductive axis of female goldfish (Carassius auratus). Physiol. Genomics 35:273-282.

    We investigated the effects of fluoxetine, a selective serotonin reuptake inhibitor, on neuroendocrine function and the reproductive axis in female goldfish. Fish were given intraperitoneal injections of fluoxetine twice a week for 14 days, resulting in five injections of 5 microg fluoxetine/g body wt. We measured the monoamine neurotransmitters serotonin, dopamine, and norepinephrine in addition to their metabolites with HPLC. Homovanillic acid, a metabolite in the dopaminergic pathway, increased significantly in the hypothalamus. Plasma estradiol levels were measured by radioimmunoassay and were significantly reduced approximately threefold after fluoxetine treatment. We found that fluoxetine also significantly reduced the expression of estrogen receptor (ER)beta1 mRNA by 4-fold in both the hypothalamus and the telencephalon and ERalpha mRNA by 1.7-fold in the telencephalon. Fluoxetine had no effect on the expression of ERbeta2 mRNA in the hypothalamus or telencephalon. Microarray analysis identified isotocin, a neuropeptide that stimulates reproductive behavior in fish, as a candidate gene affected by fluoxetine treatment. Real-time RT-PCR verified that isotocin mRNA was downregulated approximately sixfold in the hypothalamus and fivefold in the telencephalon. Intraperitoneal injection of isotocin (1 microg/g) increased plasma estradiol, providing a potential link between changes in isotocin gene expression and decreased circulating estrogen in fluoxetine-injected fish. Our results reveal targets of serotonergic modulation in the neuroendocrine brain and indicate that fluoxetine has the potential to affect sex hormones and modulate genes involved in reproductive function and behavior in the brain of female goldfish. We discuss these findings in the context of endocrine disruption because fluoxetine has been detected in the environment.

  55. Marin, A. and Xia, X. 2008. GC skew in protein-coding genes between the leading and lagging strands in bacterial genomes: new substitution models incorporating strand-bias. Journal of Theoretical Biology 253(3):508-513

    The DNA strands in most prokaryotic genomes experience strand-biased spontaneous mutation, especially C→T mutations produced by deamination that occur preferentially in the leading strand. This has often been invoked to account for the asymmetry in nucleotide composition, typically measured by GC skew, between the leading and the lagging strand. Casting such strand asymmetry in the framework of a nucleotide substitution model is important for understanding genomic evolution and phylogenetic reconstruction. We present a substitution model showing that the increased C→T mutation will lead to positive GC skew in one strand but negative GC skew in the other, with greater C→T mutation pressure associated with greater differences in GC skew between the leading and the lagging strand. However, the model based on mutation bias alone does not predict any positive correlation in GC skew between the leading and lagging strands. We computed GC skew for coding sequences collinear with the leading and lagging strands across 339 prokaryotic genomes and found a strong and positive correlation in GC skew between the two strands. We show that the observed positive correlation can be satisfactorily explained by an improved substitution model with one additional parameter incorporating a general trend of C avoidance.

  56. Aris-Brosou, S. and Xia, X. 2008 Phylogenetic analyses: a toolbox expanding towards Bayesian methods. International Journal of Plant Genomics Article ID 683509.

    The reconstruction of phylogenies is becoming an increasingly simple activity. This is mainly due to two reasons: the democratization of computing power and the increased availability of sophisticated yet user-friendly software. This review describes some of the latest additions to the phylogenetic toolbox, along with some of their theoretical and practical limitations. It is shown that Bayesian methods are under heavy development, as they offer the possibility to solve a number of long-standing issues and to integrate several steps of the phylogenetic analyses into a single framework. Specific topics include not only phylogenetic reconstruction, but also the comparison of phylogenies, the detection of adaptive evolution, and the estimation of divergence times between species.

  57. Carullo, M. and Xia, X. 2008 An extensive study of mutation and selection on the wobble nucleotide in tRNA anticodons in fungal mitochondrial genomes. Journal of Molecular Evolution 66:484-493 .

    Two alternative hypotheses aim to predict the wobble nucleotide of tRNA anticodons in mitochondrion. The codon-anticodon adaptation hypothesis predicts that the wobble nucleotide of tRNA anticodon should evolve toward maximizing the Watson-Crick base pairing with the most frequently used codon within each synonymous codon family. In contrast, the wobble versatility hypothesis argues that the nucleotide at the wobble site should be occupied by a nucleotide most versatile in wobble pairing, i.e., the wobble site of the tRNA anticodon should be G for NNY codon families and U for NNR and NNN codon families (where Y stands for C or U, R for A or G, and N for any nucleotide). We examined codon usage and anticodon wobble sites in 36 fungal genomes to evaluate these two alternative hypotheses and identify exceptional cases that deserve new explanations. While the wobble versatility hypothesis is generally supported, there are interesting exceptions involving tRNA(Arg) translating the CGN codon family, tRNA(Trp) translating the UGR codon family, and tRNA(Met) translating the AUR codon family. Our results suggest that the potential to suppress stop codons, the historical inertia, and the conflict between translation initiation and elongation can all contribute to determining the wobble nucleotide of tRNA anticodons.

  58. Marlatt, V.L., Martyniuk, C. J., Zhang, D., Xiong, H., Watt, J., Xia, X., Moon, T., Trudeau, V.L. 2008. Auto-regulation of estrogen receptor subtypes and gene expression profiling of 17beta-estradiol action in the neuroendocrine axis of male goldfish. Molecular and Cellular Endocrinology 283:38-48.

    Auto-regulation of the three goldfish estrogen receptor (ER) subtypes was examined simultaneously in multiple tissues, in relation to mRNA levels of liver vitellogenin (VTG) and brain transcripts. Male goldfish were implanted with a silastic implant containing either no steroid or 17beta-estradiol (E2) (100 microg/g body mass) for one and seven days. Liver transcript levels of ERalpha were the most highly up-regulated of the ERs, and a parallel induction of liver VTG was observed. In the testes (7d) and telencephalon (7d), E2 induced ERalpha. In the liver (1d) and hypothalamus (7d) ERbeta1 was down-regulated, while ERbeta2 remained unchanged under all conditions. Although aromatase B levels increased in the brain, the majority of candidate genes identified by microarray in the hypothalamus (1d) decreased. These results demonstrate that ER subtypes are differentially regulated by E2, and several brain transcripts decrease upon short-term elevation of circulating E2 levels.

  59. Xiong, H., Zhang D., Martyniuk, C.J., Trudeau, V.L., Xia, X.. 2008. Using Generalized Procrustes Analysis (GPA) for normalization of cDNA microarray data. BMC Bioinformatics, 9(2008) 25

    BACKGROUND: Normalization is essential in dual-labelled microarray data analysis to remove non-biological variations and systematic biases. Many normalization methods have been used to remove such biases within slides (Global, Lowess) and across slides (Scale, Quantile and VSN). However, all these popular approaches have critical assumptions about data distribution, which is often not valid in practice. RESULTS: In this study, we propose a novel assumption-free normalization method based on the Generalized Procrustes Analysis (GPA) algorithm. Using experimental and simulated normal microarray data and boutique array data, we systemically evaluate the ability of the GPA method in normalization compared with six other popular normalization methods including Global, Lowess, Scale, Quantile, VSN, and one boutique array-specific housekeeping gene method. The assessment of these methods is based on three different empirical criteria: across-slide variability, the Kolmogorov-Smirnov (K-S) statistic and the mean square error (MSE). Compared with other methods, the GPA method performs effectively and consistently better in reducing across-slide variability and removing systematic bias. CONCLUSION: The GPA method is an effective normalization approach for microarray data analysis. In particular, it is free from the statistical and biological assumptions inherent in other normalization methods that are often difficult to validate. Therefore, the GPA method has a major advantage in that it can be applied to diverse types of array sets, especially to the boutique array where the majority of genes may be differentially expressed.

  60. Khalouei, S., Xia, X.. 2008. Selective pressure against AUG triplets in the 5' untranslated region of human immunodeficiency virus type 1 supports cap-dependent translation initiation mechanism. Retrovirology: Research and Treatment 2:1-8.
  61. Xia, X. 2007. An Improved Implementation of Codon Adaptation Index. Evolutionary Bioinformatics 3:53–58.

    Codon adaptation index is a widely used index for characterizing gene expression in general and translation efficiency in particular. Current computational implementations have a number of problems leading to various systematic biases. I illustrate these problems and provide a better computer implementation to solve these problems. The improved CAI can predict protein production better than CAI from other commonly used implementations.

    Correction:In discussing the problem arising when a codon is not used in the reference set of highly expressed genes, which would yield w=0, I stated that Sharp & Li (1987) suggested using w=0.5 in that situation. Sharp & Li (1987) actually suggested using Xij=0.5. Michael Bulmer (1988, J.Evol.Biol.) suggested an alternative modification, which is to set the minimum value of w to be 0.01.

  62. Xia, X., Huang H.,Carullo, M.,Betran, E.,Moriyama, E. 2007. Conflict between translation initiation and elongation in vertebrate mitochondrial genomes. PLoS ONE 2(2): e227.

    The strand-biased mutation spectrum in vertebrate mitochondrial genomes results in an AC-rich L-strand and a GT-rich H-strand. Because the L-strand is the sense strand of 12 protein-coding genes out of the 13, the third codon position is overall strongly AC-biased. The wobble site of the anticodon of the 22 mitochondrial tRNAs is either U or G to pair with the most abundant synonymous codon, with only one exception. The wobble site of Met-tRNA is C instead of U, forming the Watson-Crick match with AUG instead of AUA, the latter being much more frequent than the former. This has been attributed to a compromise between translation initiation and elongation; i.e., AUG is not only a methionine codon, but also an initiation codon, and an anticodon matching AUG will increase the initiation rate. However, such an anticodon would impose selection against the use of AUA codons because AUA needs to be wobble-translated. According to this translation conflict hypothesis, AUA should be used relatively less frequently compared to UUA in the UUR codon family. A comprehensive analysis of mitochondrial genomes from a variety of vertebrate species revealed a general deficiency of AUA codons relative to UUA codons. In contrast, urochordate mitochondrial genomes with two tRNA(Met) genes with CAU and UAU anticodons exhibit increased AUA codon usage. Furthermore, six bivalve mitochondrial genomes with both of their tRNA-Met genes with a CAU anticodon have reduced AUA usage relative to three other bivalve mitochondrial genomes with one of their two tRNA-Met genes having a CAU anticodon and the other having a UAU anticodon. We conclude that the translation conflict hypothesis is empirically supported, and our results highlight the fine details of selection in shaping molecular evolution.

  63. Xia, X. 2007. The +4G site in Kozak consensus is not related to the efficiency of translation initiation. PLoS ONE 2(2):e188.

    The optimal context for translation initiation in mammalian species is GCCRCCaugG (where R = purine and "aug" is the initiation codon), with the -3R and +4G being particularly important. The presence of +4G has been interpreted as necessary for efficient translation initiation. Accumulated experimental and bioinformatic evidence has suggested an alternative explanation based on amino acid constraint on the second codon, i.e., amino acid Ala or Gly are needed as the second amino acid in the nascent peptide for the cleavage of the initiator Met, and the consequent overuse of Ala and Gly codons (GCN and GGN) leads to the +4G consensus. I performed a critical test of these alternative hypotheses on +4G based on 34169 human protein-coding genes and published gene expression data. The result shows that the prevalence of +4G is not related to translation initiation. Among the five G-starting codons, only alanine codons (GCN), and glycine codons (GGN) to a much smaller extent, are overrepresented at the second codon, whereas the other three codons are not overrepresented. While highly expressed genes have more +4G than lowly expressed genes, the difference is caused by GCN and GGN codons at the second codon. These results are inconsistent with +4G being needed for efficient translation initiation, but consistent with the proposal of amino acid constraint hypothesis.

  64. Khalouei, S., X. Yao, J. Mennigen, M. Carullo, P. Ma, Z. Song, H. Xiong, and Xia, X.. 2007. Bioinformatic Approach to Identify Penultimate Amino Acids Efficient for N-Terminal Methionine Excision. Pp. 386-389. Bioinformatics and Biomedical Engineering, 2007, IEEE. The 1st International Conference on Bioinformatics and Biomedical Engineering (ICBBE2007).
  65. Martyniuk, C. J., Xiong H., Crump, K., Chiu, S., Sardana, R., Nadler, A., Gerrie, E. R., Xia, X., Trudeau, V. L. 2006. Gene expression profiling in the neuroendocrine brain of male goldfish (Carassius auratus) exposed to 17-alpha-ethinylestradiol. Physiol. Genomics 27(3):328-336.

    17-alpha ethinylestradiol (EE2), a pharmaceutical estrogen, is detectable in water systems worldwide. Although studies report on the effects of xenoestrogens in tissues such as liver and gonad, few studies to date have investigated the effects of EE2 in the vertebrate brain at a large scale. The purpose of this study was to develop a goldfish brain-enriched cDNA array and use this in conjunction with a mixed tissue carp microarray to study the genomic response to EE2 in the brain. Gonad-intact male goldfish were exposed to nominal concentrations of 0.1 nM (29.6 ng/l) and 1.0 nM (296 ng/l) EE2 for 15 days. Male goldfish treated with the higher dose of EE2 had significantly smaller gonads compared with controls. Males also had a significantly reduced level of circulating testosterone (T) and 17beta-estradiol (E2) in both treatment groups. Candidate genes identified by microarray analysis fall into functional categories that include neuropeptides, cell metabolism, and transcription/translation factors. Differentially expressed genes verified by real-time RT-PCR included brain aromatase, secretogranin-III, and interferon-related developmental regulator 1. Our results suggest that the expression of genes in the sexually mature adult brain appears to be resistant to low EE2 exposure but is affected significantly at higher doses of EE2. This study demonstrates that microarray technology is a useful tool to study the effects of endocrine disrupting chemicals on neuroendocrine function and suggest that exposure to EE2 may have significant effects on localized E2 synthesis in the brain by affecting transcription of brain aromatase.

  66. Xia, X. 2006. Topological Bias in Distance-Based Phylogenetic Methods: Problems with Over- and Underestimated Genetic Distances. Evolutionary Bioinformatics 2006: 2 375–387.

    I show several types of topological biases in distance-based methods that use the least-squares method to evaluate branch lengths and the minimum evolution (ME) or the Fitch-Margoliash (FM) criterion to choose the best tree. For a 6-species tree, there are two tree shapes, one with three cherries (a cherry is a pair of adjacent leaves descending from the most recent common ancestor), and the other with two. When genetic distances are underestimated, the 3-cherry tree shape is favored with either the ME or FM criterion. When the genetic distances are overestimated, the ME criterion favors the 2-cherry tree, but the direction of bias with the FM criterion depends on whether negative branches are allowed, i.e. allowing negative branches favors the 3-cherry tree shape but disallowing negative branches favors the 2-cherry tree shape. The extent of the bias is explored by computer simulation of sequence evolution.

  67. Wang, H. C., Xia, X. , D. Hickey. 2006. Thermal adaptation of small ribosomal RNA genes: a comparative study. Journal of Molecular Evolution 63(1):120-126

    We carried out a comprehensive survey of small subunit ribosomal RNA sequences from archaeal, bacterial, and eukaryotic lineages in order to understand the general patterns of thermal adaptation in the rRNA genes. Within each lineage, we compared sequences from mesophilic, moderately thermophilic, and hyperthermophilic species. We carried out a more detailed study of the archaea, because of the wide range of growth temperatures within this group. Our results confirmed that there is a clear correlation between the GC content of the paired stem regions of the 16S rRNA genes and the optimal growth temperature, and we show that this correlation cannot be explained simply by phylogenetic relatedness among the thermophilic archaeal species. In addition, we found a significant, positive relationship between rRNA stem length and growth temperature. These correlations are found in both bacterial and archaeal rRNA genes. Finally, we compared rRNA sequences from warm-blooded and cold-blooded vertebrates. We found that, while rRNA sequences from the warm-blooded vertebrates have a higher overall GC content than those from the cold-blooded vertebrates, this difference is not concentrated in the paired regions of the molecule, suggesting that thermal adaptation is not the cause of the nucleotide differences between the vertebrate lineages.

  68. Xia, X. , Wang, H. C., Z. Xie, M. Carullo, H. Huang, and D. Hickey. 2006. Cytosine usage modulates the correlation between CDS length and CG content in prokaryotic genomes. Molecular Biology and Evolution 23:1450-1454

    Previous studies have argued that, given the AT-rich nature of stop codons, the length and CG% of coding sequences (CDSs) should be positively correlated. This prediction is generally supported empirically by prokaryotic genomes. However, the correlation is weak for a number of species, with 4 species showing a negative correlation. Here we formulate a more general hypothesis incorporating selection against cytosine (C) usage to explain the lack of strong positive correlation between the length and GC% of CDSs. Two factors contribute to the selection against C usage in long CDSs. First, C is the least abundant nucleotide in the cell, and a long CDS should have fewer Cs to increase transcription efficiency. Second, C is prone to mutation to U/T and selection for increased reliability should reduce C usage in long CDSs. Empirical data from prokaryotic genomes lend strong support for this new hypothesis.

  69. Cai, J. J., Smith, D. K., Xia, X. , Yuen, K. Y. 2006. MBEToolbox 2.0: An enhanced version of a MATLAB toolbox for Molecular Biology and Evolution. Evolutionary Bioinformatics 2:187-190.

    MBEToolbox is an extensible MATLAB-based software package for analysis of DNA and protein sequences. MBEToolbox version 2.0 includes enhanced functions for phylogenetic analyses by the maximum likelihood method. For example, it is capable of estimating the synonymous and nonsynonymous substitution rates using a novel or several known codon substitution models. MBEToolbox 2.0 introduces new functions for estimating site-specific evolutionary rates by using a maximum likelihood method or an empirical Bayesian method. It also incorporates several different methods for recombination detection. Multi-platform versions of the software are freely available at http://www.bioinformatics.org/mbetoolbox/.

  70. Xia, X. and G. Palidwor. 2005. Genomic Adaptation to Acidic Environment: Evidence from Helicobacter pylori. American Naturalist 166:776-784

    The origin of new functions is fundamental in understanding evolution, and three processes known as adaptation, preadaptation, and exaptation have been proposed as possible evolutionary pathways leading to the origin of new functions. Here we examine the origin of an acid resistance mechanism in the mammalian gastric pathogen Helicobacter pylori, with reference to these three evolutionary pathways. The mechanism involved is that H. pylori, when exposed to the acidic environment in mammalian stomach, restricts the acute proton entry across its membrane by its increased usage of positively charged amino acids in the inner and outer membrane proteins. The results of our comparative genomic analysis between H. pylori, the two closely related species Helicobacter hepaticus and Campylobacter jejuni, and other relevant proteobacterial species are incompatible with the hypotheses invoking preadaptation or exaptation. The acid resistance mechanism most likely arose by selection favoring an increased usage of positively charged lysine in membrane proteins.

  71. Shi, B., and X. Xia. 2005. Genetic variation in clones of Pseudomonas pseudoalcaligenes after ten months of selection in different thermal environments in the laboratory. Curr Microbiol 50:238-45.

    The random amplification of polymorphic DNA (RAPD) method was used to examine genetic variation in experimental clones of Pseudomonas pseudoalcaligenes in two experimental groups, as well as their common ancestor. Six clones derived from a single colony of P. pseudoalcaligenes were cultured in two different thermal regimes for 10 months. Three clones in the Control group were cultured at constant temperature of 35 degrees C and another three clones in the High Temperature (HT) group were propagated at incremental temperature ranging from 41 to 47 degrees C for 10 months. A total of 45 RAPD primers generated 146 polymorphic markers. Analysis of molecular variance (AMOVA) revealed mild (11%) but significant (P < 0.001) genetic difference between the Control and the HT clones. Phylogenetic analysis based on pairwise genetic distances showed that the HT clones were more divergent from the ancestor and from each other than the Control clones, implying that the HT clones of P. pseudoalcaligenes may have evolved faster than the Control clones.

  72. Xia. X. and K. Y. Yuen. 2005. Differential selection and mutation between dsDNA and ssDNA phages shape the evolution of their genomic AT percentage. BMC Genetics 6:20.

    BACKGROUND: Bacterial genomes differ dramatically in AT%. We have developed a model to show that the genomic AT% in rapidly replicating bacterial species can be used as an index of the availability of nucleotides A and T for DNA replication in cellular medium. This index is then used to (1) study the evolution and adaptation of the bacteriophage genomic AT% in response to the differential nucleotide availability of the host and (2) test the prediction that double-stranded DNA (dsDNA) phage should exhibit better adaptation than single-stranded DNA (ssDNA) phage because the rate of spontaneous deamination, which leads to C→T or C→U mutations depending on whether C is methylated or not, is about 100-fold greater in ssDNA than in dsDNA. RESULTS: We retrieved 79 dsDNA phage and 27 ssDNA phage genomes together with their host genomic sequences. The dsDNA phages have their genomic AT% better adapted to the host genomic AT% than ssDNA phage. The poorer adaptation of the ssDNA phage can be partially accounted for by the C→T(U) mutations mediated by the spontaneous deamination. For ssDNA phage, the genomic A% is more strongly correlated with their host genomic AT% than the genomic T%. CONCLUSION: A significant fraction of variation in the genomic AT% in the dsDNA phage, and that in the genomic A% and T% of the ssDNA phage, can be explained by the difference in selection and mutation between them.

  73. Cai, J., Smith, D., X. Xia, and K. Y. Yuen. 2005. MBEToolbox: a Matlab toolbox for sequence data analysis of molecular biology and evolution. BMC Bioinformatics 6:64.

    BACKGROUND: MATLAB is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as mathematics and computation, algorithm development, data acquisition, modeling, simulation, and scientific and engineering graphics. However, few functions are freely available in MATLAB to perform the sequence data analyses specifically required for molecular biology and evolution. RESULTS: We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this gap by offering efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible, functional framework for users with more specialized requirements to explore and analyze aligned nucleotide or protein sequences from an evolutionary perspective. The full functions in the toolbox are accessible through the command-line for seasoned MATLAB users. A graphical user interface, that may be especially useful for non-specialist end users, is also provided. CONCLUSION: MBEToolbox is a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software is publicly available at http://web.hku.hk/~jamescai/mbetoolbox/ and http://bioinformatics.org/project/?group_id=454

  74. Xia, X. 2005. Mutation and Selection on the Anticodon of tRNA Genes in Vertebrate Mitochondrial Genomes. Gene 345:13-20.

    The H-strand of vertebrate mitochondrial DNA is left single-stranded for hours during the slow DNA replication. This facilitates C→U mutations on the H-strand (and consequently G→A mutations on the L-strand) via spontaneous deamination which occurs much more frequently on single-stranded than on double-stranded DNA. For the 12 coding sequences (CDS) collinear with the L-strand, NNY synonymous codon families (where N stands for any of the four nucleotides and Y stands for either C or U) end mostly with C, and NNR and NNN codon families (where R stands for either A or G) end mostly with A. For the lone ND6 gene on the other strand, the codon bias is the opposite, with NNY codon families ending mostly with U and NNR and NNN codon families ending mostly with G. These patterns are consistent with the strand-specific mutation bias. The codon usage biased towards C-ending and A-ending in the 12 CDS sequences affects the codon-anticodon adaptation. The wobble site of the anticodon is always G for NNY codon families dominated by C-ending codons and U for NNR and NNN codon families dominated by A-ending codons. The only, but consistent, exception is the anticodon of tRNA-Met which consistently has a 5'-CAU-3' anticodon base-pairing with the AUG codon (the translation initiation codon) instead of the more frequent AUA. The observed CAU anticodon (matching AUG) would increase the rate of translation initiation but would reduce the rate of peptide elongation because most methionine codons are AUA, whereas the unobserved UAU anticodon (matching AUA) would increase the elongation rate at the cost of translation initiation rate. The consistent CAU anticodon in tRNA-Met suggests the importance of maximizing the rate of translation initiation.

  75. Baron, D., J. Cocquet, X. Xia, M. Fellous, Y. Guiguen, and R. A. Veitia. 2004. An evolutionary and functional analysis of FoxL2 in rainbow trout gonad differentiation. J. Mol. Endocrinol. 33:705-715.

    FOXL2 is a forkhead transcription factor involved in ovarian development and function. Here, we have studied the evolution and pattern of expression of the FOXL2 gene and its paralogs in fish. We found well conserved FoxL2 sequences (FoxL2a) and divergent genes, whose forkhead domains belonged to the class L2 and were shown to be paralogs of the FoxL2a sequences (named FoxL2b). In the rainbow trout, FoxL2a and FoxL2b were specifically expressed in the ovary, but displayed different temporal patterns of expression. FoxL2a expression correlated with the level of aromatase, the key enzyme in estrogen production, and an estrogen treatment used to feminize genetically male individuals elicited the up-regulation of both paralogs. Conversely, androgens or an aromatase inhibitor down-regulated FoxL2a and FoxL2b in females. We speculate that there is a direct link between estrogens and FoxL2 expression in fish, at least during the period where the identity of the gonad is sensitive to hormonal treatments.

  76. Xia, X. 2004. A peculiar codon usage pattern revealed after removing the effect of DNA methylation. Proceedings of the 4th International Conference on Bioinformatics of Genome Regulation and Structure 1:216-220.
  77. Cocquet, J., E. De Baere, M. Gareil, M. Pannetier, X. Xia, M. Fellous, R. Veitia. 2003. Structure, evolution and expression of the FOXL2 transcription unit. Cytogenetic Genome Res 101:206-211.

    FOXL2 is a putative transcription factor involved in ovarian development and function. Its mutations in humans are responsible for the blepharophimosis syndrome, characterized by eyelid malformations and premature ovarian failure (POF). Here we have performed a comparative sequence analysis of FOXL2 sequences of ten vertebrate species. We demonstrate that the entire open reading frame (ORF) is under purifying selection leading to strong protein conservation. We also review recent data on FOXL2 transcript and protein expression. FOXL2 has been shown 1) to be the earliest known sex dimorphic marker of ovarian determination/differentiation in vertebrates, 2) to have, at least in mammals, an ovarian expression persisting until adulthood. The conservation of its sequence and pattern of expression suggests that FOXL2 might be a key factor in the early development of the vertebrate female gonad and involved later in adult ovarian function. Finally, we provide arguments for the existence of an alternative transcript in rodents, that may arise from a differential polyadenylation. Although it has only been demonstrated in rodents, its presence/absence in other species deserves further investigation.

  78. X. Xia. 2003. DNA methylation and Mycoplasma genomes. Journal of Molecular Evolution 57:S21-S28.

    DNA methylation is one of the many hypotheses proposed to explain the observed deficiency in CpG dinucleotides in a variety of genomes covering a wide taxonomic distribution. Recent studies challenged the methylation hypothesis on empirical grounds. First, it cannot explain why the Mycoplasma genitalium genome exhibits strong CpG deficiency without DNA methylation. Second, it cannot explain the great variation in CpG deficiency between M. genitalium and M. pneumoniae that also does not have CpG-specific methyltransferase genes. I analyzed the genomic sequences of these Mycoplasma species together with the recently sequenced genomes of M. pulmonis, Ureaplasma urealyticum, and Staphylococcus aureus, and found the results fully compatible with the methylation hypothesis. In particular, I present compelling empirical evidence to support the following scenario. The common ancestor of the three Mycoplasma species has CpG-specific methyltransferases, and has evolved strong CpG deficiency as a result of the specific DNA methylation. Subsequently, this ancestral genome diverged into M. pulmonis and the common ancestor of M. pneumoniae and M. genitalium. M. pulmonis has retained methyltransferases and exhibits the strongest CpG deficiency. The common ancestor lost the methyltransferase gene and then diverged into M. genitalium and M. pneumoniae. M. genitalium and M. pneumoniae, after losing methylation activities, began to regain CpG dinucleotides through random mutation. M. genitalium evolved more slowly than M. pneumoniae, gained relatively fewer CpG dinucleotides, and is more CpG-deficient.

  79. Shi, B., X. Xia. 2003. Changes in growth parameters of Pseudomonas pseudoalcaligenes after ten months culturing at increasing temperature in the laboratory. FEMS Microbiology Ecology 45:127-134.

    In this paper, we report the thermal adaptation of Pseudomonas pseudoalcaligenes, characterized as changes in growth parameters. Six clones derived from a single colony of P. pseudoalcaligenes were cultured in two different temperature regimes for 10 months, with three clones forming the control group, cultured at a constant temperature, and another three clones forming the high-temperature (HT) group, cultured at increasing temperature (from 41 to 47 degrees C). Three growth parameters were measured: the lag time (lambda), which is the period between the time of transfer to a new medium and the time when the cell replication starts; the maximum growth rate (mu(m)); and the maximum yield (A). These three parameters are major components of bacterial fitness. The Gompertz and logistic models were used to estimate these three parameters. The two models gave almost identical estimates, but the Gompertz model had R(2) values consistently larger than the logistic model. The HT clones had significantly shorter lambda, but higher mu(m) and A than the control clones when both were grown at the originally stressful temperature of 45 degrees C, suggesting significant thermal adaptation. Interestingly, the HT clones grew equally well as the control clones at 35 degrees C, i.e. improved performance at 45 degrees C was not associated with a reduced performance at 35 degrees C.

  80. Xia, X., Z. Xie, K. Kjer. 2003. 18S rRNA and Tetrapod Phylogeny. Systematic Biology 52(3):283-295 (Editor's choice in )

    Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species.

  81. Xia, X., Z. Xie, W. H. Li. 2003. Effects of GC Content and Mutational Pressure on the Lengths of Exons and Coding Sequences. Journal of Molecular Evolution 56:362-370.

    It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).

  82. Shi, B., X. Xia. 2003. Morphological changes of Pseudomonas pseudoalcaligenes in response to temperature selection. Current Microbiology 46:120-123.

    Adaptation to novel environments usually entails morphological changes. The cell morphology of six experimental populations of Pseudomonas pseudoalcaligenes and their common ancestor were examined with scanning electron microscopy (SEM). The six experimental populations were propagated under different temperatures for 10 months: three of them cultured at constant normal temperature (35 degrees C) forming the control group, and the other three cultured at incremental higher temperatures (from 41 degrees to 47 degrees C) as the HT group. SEM showed the deformed and elongated cells in the 6-h cultures of both ancestral and control populations at 45 degrees C, indicating that 45 degrees C is stressful for the ancestral and the control populations. In contrast, the HT populations retained normal cell shape in the 6-h cultures at both 35 degrees C and 45 degrees C. The mean cell volumes of control and HT populations increased 29% and 34%, respectively, relative to the ancestor at their respective thermal regimens, suggestion that the culturing conditions might favor larger cells.

  83. Xia, X., Z. Xie, M. Salemi, L. Chen, Y. Wang. 2003. An index of substitution saturation and its application. Molecular Phylogenetics and Evolution 26:1-7. We introduce a new index to measure substitution saturation in a set of aligned nucleotide sequences. The index is based on the notion of entropy in information theory. We derive the critical values of the index based on computer simulation with different sequence lengths, different number of OTUs and different topologies. The critical value enables researchers to quickly judge whether a set of aligned sequences is useful in phylogenetics. We illustrate the index by applying it to an analysis of the aligned sequences of the elongation factor-1alpha gene originally used to resolve the deep phylogeny of major arthropod groups. The method has been implemented in DAMBE.

  84. Cocquet, J., E. Pailhoux, F. Jaubert, N. Servel, X. Xia, M. Pannetier, E. De Baere, L. Messiaen, C. Cotinot, M. Fellous, R. Veitia. 2002. Evolution and expression of FOXL2. Journal of Medical Genetics.39:916-921.

    Mutations in FOXL2, a forkhead transcription factor gene, have recently been shown to cause the blepharophimosis-ptosis-epicanthus inversus syndrome (BPES). This rare genetic disorder leads to a complex eyelid malformation associated or not with premature ovarian failure (BPES type I or II, respectively). We performed a comparative analysis of the FOXL2 sequence in several species (human, goat, mouse, and pufferfish) showing that the FOXL2 coding region is highly conserved in these species. The FOXL2 protein contains a polyalanine tract whose role has not yet been elucidated. Recurrent mutations leading to its expansion result in BPES type II and account for 30% of the deleterious alterations detected in the open reading frame (ORF) of FOXL2. We showed that the number of alanine residues is strictly conserved among the mammals studied, suggesting the existence of strong functional or structural constraints. We provide immunohistochemical evidence indicating that FOXL2 is a nuclear protein specifically expressed in eyelids and in fetal and adult ovarian follicular cells. It does not undergo any major post-translational maturation. FOXL2 is the earliest known marker of ovarian differentiation in mammals and may play a role in ovarian somatic cell differentiation and in further follicle development and/or maintenance.

  85. Xia, X., T. Wei, Z. Xie and A. Danchin. 2002. Genomic changes in nucleotide and di-nucleotide frequencies in Pasteurella multocida cultured under high temperature. Genetics 161:1385-94.

    We used 94 RAPD primers of different nucleotide composition to probe the genomic differences between a highly virulent P. multocida strain and an attenuated vaccine strain derived from the virulent strain after culturing the latter under increasing temperature for approximately 14,400 generations. The GC content of the vaccine strain is significantly (P < 0.05) lower than that of the virulent strain, contrary to the popular hypothesis of covariation between the GC content and temperature. The frequencies of AA, TA, and TT dinucleotides were higher, and those of AT, GC, and CG dinucleotides were lower, in the vaccine strain than in the virulent strain. A statistic called genomic RAPD entropy is formulated to measure the randomness of the genome, and the genome of the vaccine strain is more random than that of the virulent strain. These differences between the virulent and vaccine strains are interpreted in terms of mutation and selection under increased culturing temperature. A method for estimating substitution rates is developed in the appendix.

  86. Xia, X. and Z. Xie. 2002. Protein Structure, Neighbor Effect, and a New Index of Amino Acid Dissimilarities. Molecular Biology and Evolution 19:58-67.

    Amino acids interact with each other, especially with neighboring amino acids, to generate protein structures. We studied the pattern of association and repulsion of amino acids based on 24,748 protein-coding genes from human, 11,321 from mouse, and 15,028 from Escherichia coli, and documented the pattern of neighbor preference of amino acids. All amino acids have different preferences for neighbors. We have also analyzed 7,342 proteins with known secondary structure and estimated the propensity of the 20 amino acids occurring in three of the major secondary structures, i.e., helices, sheets, and turns. Much of the neighbor preference can be explained by the propensity of the amino acids in forming different secondary structures, but there are also a number of intriguing association and repulsion patterns. The similarity in neighbor preference among amino acids is significantly correlated with the number of amino acid substitutions in both mitochondrial and nuclear genes, with amino acids having similar sets of neighbors replacing each other more frequently than those having very different sets of neighbors. This similarity in neighbor preference is incorporated into a new index of amino acid dissimilarities that can predict nonsynonymous codon substitutions better than the two existing indices of amino acid dissimilarities, i.e., Grantham's and Miyata's distances.

  87. Xia, X. and Z. Xie. 2001. AMADA: Analysis of microarray data. Bioinformatics 17:569-570.

    AMADA is a Windows program for identifying co-expressed genes from microarray data. It performs data transformation, principal component analysis, a variety of cluster analyses and extensive graphic functions for visualizing expression profiles.

  88. Xia, X. and Z. Xie. 2001. DAMBE: Data analysis in molecular biology and evoluiton. Journal of Heredity 92:371-373.

    DAMBE (data analysis in molecular biology and evolution) is an integrated software package for converting, manipulating, statistically and graphically describing, and analyzing molecular sequence data with a user-friendly Windows 95/98/2000/NT interface. DAMBE is free and can be downloaded from http://web.hku.hk/~xxia/software/software.htm.

  89. Chen, B., and X. Xia, 2001 The genus Schevodera Borchmann: Phylogeny, historic biogeography and new Chinese records, with description of a new species (Coleoptera: Tenebrionidae: Lagriinae). Oriental Insects 35: 3-27.

    Schevodera Borchmann belongs to the subfamily Lagriinae and its members are phytophagous. A new species, S. glabricollis is described from China. Redescriptions of the genus and two known species, S. gracilicornis and S. inflata with new records for China are given. A key to Chinese species is given. The phylogeny of the nine known species and one subspecies is ç ladistically analysed based on 21 morphological characters from adults. The confidence of the phylogram obtained from the cladistic analysis and its monophylies are examined with PTP and T-PTP tests. The ancestral distribution of the genus is also reconstructed based on the dispersal-vicariance analysis. The results suggest that the genus would be monophyletic. In the late Permian — late Triassic period around 255–220 million years ago, it is hypothesized to have originated from a Lagria-like ancestral species between western Yunnan, China and Burma in the Shan-Thai terrain. It dispersed from western Yunnan and northern Burma to Sumatra and Java, and then northward through Borneo to Palawan, Luzon and finally Mindanao. Based on phylogeny and historic biogeography, the genus is divided into three species groups: Yunnan, Indonesia and Philippines groups. The Yunnan group is the most primitive, consisting of S. inflata, S. glabricollis and S. gracilicornis, and is mainly distributed in Yunnan and Burma. The Indonesia group includes S. hirticollis and S. hirticollis salvazai, S. curticollis and S. dohrni, and occurs primarily in Indonesia but also reaches into Burma and the Philippines. The S. hirticollis salvazai has dispersed from Burma to Laos. The group originated from the ancestor of Yunnan group after Ecocene, i.e. no longer than 50 million years ago. The monophyletic Philippines group is composed of three endemic species: S. setosa, S. spoliata and S. insularis. It originated from the ancestor of the Indonesian group after the Miocene around 20 million years ago and dispersed from Palawan to Luzon and then Mindanao. The synapomorphies between these groups, interspecific phylogenetic relationships, time and place of origin and potential distribution of each species are also discussed in detail.

  90. Xia, X. 2000. Phylogenetic Relationship among Horseshoe Crab Species: The Effect of Substitution Models on Phylogenetic Analyses. Systematic Biology 49:87-100.

    The horseshoe crabs, known as living fossils, have maintained their morphology almost unchanged for the past 150 million years. The little morphological differentiation among horseshoe crab lineages has resulted in substantial controversy concerning the phylogenetic relationship among the extant species of horseshoe crabs, especially among the three species in the Indo-Pacific region. Previous studies suggest that the three species constitute a phylogenetically unresolvable trichotomy, the result of a cladogenetic process leading to the formation of all three Indo-Pacific species in a short geological time. Data from two mitochondrial genes (for 16S ribosomal rRNA and cytochrome oxidase subunit I) and one nuclear gene (for coagulogen) in the four species of horseshoe crabs and outgroup species were used in a phylogenetic analysis with various substitution models. All three genes yield the same tree topology, with Tachypleus-gigas and Carcinoscorpius-rotundicauda grouped together as a monophyletic taxon. This topology is significantly better than all the alternatives when evaluated with the RELL (resampling estimated log-likelihood) method.

  91. Xia, X. and W.-H. Li 1998. What amino acid properties affect protein evolution? Journal of Molecular Evolution 47:557-564.

    We studied 10 protein-coding mitochondrial genes from 19 mammalian species to evaluate the effects of 10 amino acid properties on the evolution of the genetic code, the amino acid composition of proteins, and the pattern of nonsynonymous substitutions. The 10 amino acid properties studied are the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation. The genetic code appears to have evolved toward minimizing polarity and hydropathy but not the other seven properties. This can be explained by our finding that the presumably primitive amino acids differed much only in polarity and hydropathy, but little in the other properties. Only the chemical composition (C) and isoelectric point (IE) appear to have affected the amino acid composition of the proteins studied, that is, these proteins tend to have more amino acids with typical C and IE values, so that nonsynonymous mutations tend to result in small differences in C and IE. All properties, except for hydroxythiolation, affect the rate of nonsynonymous substitution, with the observed amino acid changes having only small differences in these properties, relative to the spectrum of all possible nonsynonymous mutations.

  92. Xia, X. 1998. How optimized is the translational machinery in E. coli, S. typhimurium, and S. cerevisiae? Genetics 149: 37-44.

    The optimization of the translational machinery in cells requires the mutual adaptation of codon usage and tRNA concentration, and the adaptation of tRNA concentration to amino acid usage. Two predictions were derived based on a simple deterministic model of translation which assumes that elongation of the peptide chain is rate-limiting. The highest translational efficiency is achieved when the codon recognized by the most abundant tRNA reaches the maximum frequency. For each codon family, the tRNA concentration is optimally adapted to codon usage when the concentration of different tRNA species matches the square-root of the frequency of their corresponding synonymous codons. When tRNA concentration and codon usage are well adapted to each other, the optimal content of all tRNA species carrying the same amino acid should match the square-root of the frequency of the amino acid. These predictions are examined against empirical data from Escherichia coli, Salmonella typhimurium, and Saccharomyces cerevisiae.

  93. Xia, X. 1998. The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes. Molecular Biology and Evolution 15:336-344.

    Substitution rates at the three codon positions (r1, r2, and r3) of mammalian mitochondrial genes are in the order of r3 > r1 > r2, and the rate heterogeneity at the three positions, as measured by the shape parameter of the gamma distribution (alpha 1, alpha 2, and alpha 3), is in the order of alpha 3 > alpha 1 > alpha 2. The causes for the rate heterogeneity at the three codon positions remain unclear and, in particular, there has been no satisfactory explanation for the observation of alpha 1 > alpha 2. I attempted to dissect the causes of rate heterogeneity by studying the pattern of nonsynonymous substitutions with respect to codon positions in 10 mitochondrial genes from 19 mammalian species. Nonsynonymous substitutions involve more different amino acid replacements at the second than at the first codon position, which results in r1 > r2. The difference between r1 and r2 increases with the intensity of purifying selection, and so does the rate heterogeneity in nonsynonymous substitutions among sites at the same codon position. All mitochondrial genes appear to have functionally important and unimportant codons, with the latter having all three codon positions prone to nonsynonymous substitutions. Within the functionally important codons, the second codon position is much more conservative than the codon position. This explains why alpha 1 > alpha 2. The result suggests that overweighting of the second codon position in phylogenetic analysis may be a misguided practice.

  94. Xia, X. 1996. Maximising transcription efficiency causes codon usage bias. Genetics 144:1309-1320.

    The rate of protein synthesis depends on both the rate of initiation of translation and the rate of elongation of the peptide chain. The rate of initiation depends on the encountering rate between ribosomes and mRNA; this rate in turn depends on the concentration of ribosomes and mRNA. Thus, patterns of codon usage that increase transcriptional efficiency should increase mRNA concentration, which in turn would increase the initiation rate and the rate of protein synthesis. An optimality model of the transcriptional process is presented with the prediction that the most frequently used ribonucleotide at the third codon sites in mRNA molecules should be the same as the most abundant ribonucleotide in the cellular matrix where mRNA is transcribed. This prediction is supported by four kinds of evidence. First, A-ending codons are the most frequently used synonymous codons in mitochondria, where ATP is much more abundant than that of the three other ribonucleotides. Second, A-ending codons are more frequently used in mitochondrial genes than in nuclear genes. Third, protein genes from organisms with a high metabolic rate use more A-ending codons and have higher A content in their introns than those from organisms with a low metabolic rate.

  95. Xia, X., Hafner, M. S. and P. D. Sudman. 1996. On transition bias in mitochondrial genes of pocket gophers. Journal of Molecular Evolution 43:32-40.

    The relative contribution of mutation and purifying selection to transition bias has not been quantitatively assessed in mitochondrial protein genes. The observed transition/transversion (s/v) ratio is (micros Ps)/(microv Pv), where micros and microv denote mutation rate of transitions and transversions, respectively, and Ps and Pv denote fixation probabilities of transitions and transversions, respectively. Because selection against synonymous transitions can be assumed to be roughly equal to that against synonymous transversions, Ps/Pv approximately 1 at fourfold degenerate sites, so that the s/v ratio at fourfold degenerate sites is approximately micros/microv, which is a measure of mutational contribution to transition bias. Similarly, the s/v ratio at nondegenerate sites is also an estimate of micros/microv if we assume that selection against nonsynonymous transitions is roughly equal to that against nonsynonymous transversions. In two mitochondrial genes, cytochrome oxidase subunit I (COI) and cytochrome b (cyt-b) in pocket gophers, the s/v ratio is about two at nondegenerate and fourfold degenerate sites for both the COI and the cyt-b genes. This implies that mutation contribution to transition bias is relatively small. In contrast, the s/v ratio is much greater at twofold degenerate sites, being 48 for COI and 40 for cyt-b. Given that the micros/microv ratio is about 2, the Ps/Pv ratio at twofold degenerate sites must be on the order of 20 or greater. This suggests a great effect of purifying selection on transition bias in mitochondrial protein genes because transitions are synonymous and transversions are nonsynonymous at twofold degenerate sites in mammalian mitochondrial genes. We also found that nonsynonymous mutations at twofold degenerate sites are more neutral than nonsynonymous mutations at nondegenerate sites, and that the COI gene is subject to stronger purifying selection than is the cyt-b gene. A model is presented to integrate the effect of purifying selection, codon bias, DNA repair and GC content on s/v ratio of protein-coding genes.

  96. Xia, X. 1995. Body temperature, rate of biosynthesis, and evolution of genome size. Molecular Biology and Evolution 12:834-842.

    An optimality model relating the rate of biosynthesis to body temperature and gene duplication is presented to account for several observed patterns of genome size variation. The model predicts (1) that poikilotherms living in a warm climate should have a smaller genome than poikilotherms living in a cold climate, (2) that homeotherms should have a small genome as well as a small variation in genome size relative to their poikilothermic ancestors, (3) that cold geological periods should favor the evolution of poikilotherms with a large genome and that warm geological periods should do the opposite, and (4) that poikilotherms with a small genome should be more sensitive to changes in temperature than poikilotherms with a large genome. The model also offers two explanations for the empirically documented trend that organisms with a large cell volume have larger genomes than those with a small cell volume. Relevant empirical evidence is summarized to support these predictions.

  97. Xia, X. 1995. Revisiting Hamilton's rule. American Naturalist 145:483-492.
  98. Xia, X. 1993. A full sibling is not as valuable as an offspring: on Hamilton's rule. American Naturalist 142:174-185.
  99. Boonstra, R., Xia, X. and L. Pavone. 1992. Mating system of the meadow vole, Microtus pennsylvanicus. Behavioral Ecology 4:83-89.
  100. Xia, X. and R. Boonstra. 1992. Measuring temporal variation in population density: a critique. American Naturalist 140:883-892.
  101. Xia, X. 1992. Uncertainty of paternity can select against paternal care. American Naturalist 139:1126-1129.
  102. Xia, X. and J. S. Millar. 1991. Genetic evidence of promiscuity in Peromyscus leucopus. Behavioral Ecology and Sociobiology 28:171-178.
  103. Millar, J. S., Xia, X. and M. B. Norrie. 1991. Relationship among reproductive status, nutritional status and food characteristics in a natural population of Peromyscus maniculatus. Canadian Journal of Zoology 69:555-559.
  104. Xia, X. and J. S. Millar. 1990. Infestation of wild Peromyscus leucopus by bot fly larvae. Journal of Mammalogy 71:255-258.
  105. Xia, X. and J. S. Millar. 1989. Dispersion of adult males in relation to female reproductive status in Peromyscus leucopus. Canadian Journal of Zoology 67:1047-1052.
  106. El-Haddad, M., J. S. Millar and X. Xia. 1989. Offspring recognition by male Peromyscus maniculatus. Journal of Mammalogy 69:811-813.
  107. Xia, X. and J. S. Millar. 1988. Paternal behaviour by Peromyscus leucopus in enclosures. Canadian Journal of Zoology 66:1184-1187.
  108. Xia, X. and J. S. Millar. 1987. Morphological variation in deer mice in relation to sex and habitat. Canadian Journal of Zoology 65:527-533.
  109. Xia, X. and J. S. Millar. 1986. Sex-related dispersion in Peromyscus maniculatus. Canadian Journal of Zoology 64:933-936.

Books:

  1. Xia, X. 2013. Comparative genomics. Springer. VIII, 67 pp. Hardcopy for $24.99 at SpringerLink
  2. Xia, X. 2007. Bioinformatics and the cell: modern computational approaches in genomics, proteomics and transcriptomics. Springer. 361 pp. Hardcopy for $24.99 at SpringerLink
  3. Xia, X. 2000. Data Analysis in Molecular Biology and Evolution. Kluwer Academic Publishers. Find the book in a library near you.

Book chapters:

  1. Xia, X.2014. Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach. pp. 12-23 in M. Basu, Y. Pan, J. Wang, eds. Bioinformatics Research and Applications. Springer.
  2. More and more researchers in phylogenetics are concatenating gene sequences to produce supermatrices in the hope that larger data sets will lead to better phylogenetic resolution. Almost all of these supermatrices contain a high proportion of missing data which could potentially cause phylogenetic bias. Previous studies aiming to identify the missing-data-mediated bias in the maximum likelihood method have noted a bias associated with among-site rate variation. However, this finding is by sequence simulation and has been challenged by other simulation studies, with the controversy still unresolved. Here I illustrate analytically this bias caused by missing data coupled with among-site rate variation. This approach allows one to see how much the bias can contribute to likelihood differences among different topologies. The study highlights the point that, while supermatrices may lead to “robust” trees, such “robust” trees may be purchased with illegal phylogenetic currency.

  3. Xia, X. and Q. Yang 2013. Cenancestor. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition, Academic Press, San Diego. Volume1, pp. 493-494.
  4. Abstract: Cenancestor, the last universal common ancestor, is assumed to exist on the basis of extensive sharing of inferred homologous characters among representatives of living cellular organisms, such as the near universal genetic code, the concordance of phylogenetic trees from different genes, the sharing of fundamental biochemical processes, and the existence of numerous transitional fossils. It is a logical necessity if the cellular structure originated only once, given the cell theory stating that new cells are created by old cells dividing into two. One early concept of cenancestor is a genome that codes a minimal set of core genes essential for cellular life (the minimal genome) and from which all other genomes are derived. However, few genes are shared universally because a biological function can often be performed by unrelated genes. Even if such a set of core genes can be identified, the identification and dating of the cenancestor is difficult because of the lack of a universal global molecular clock and the rampant horizontal gene transfer. The scientific consensus of the cenancestor is neither a single cell nor a single genome, but is instead an entangle bank of heterogeneous genomes with relatively free flow of genetic information. Out of this entangled bank of frolicking genomes arose probably many evolutionary lineages with horizontal gene transfer gradually reduced and confined within individual lineages. Only three (Archaea, Eubacteria, and Eukarya) of these early lineages have representatives survived to this day.

  5. Xia, X. 2013. Codominance. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition, Academic Press, San Diego. Volumne 2, pp. 63-64.
  6. Abstract: Codominance pertains to the genetic phenomenon in which gene products from the two alleles in a heterozygote are produced in roughly equal amount, where gene products refer to either different transcripts from the two alleles, different proteins from cellular processing of the transcripts, or different metabolites specifically associated with the enzymatic activity of the allele-specific transcripts or proteins. The AB heterozygote at the classical blood type locus (the ABO locus) expresses both the A and B blood type antigens and has been considered as a classical case of codominance. Another example of codominance is the beta-thalassemia minor involving a mutant hemoglobin β-chain. The heterozygote (β°β) exhibits codominance because both alleles produce roughly equal amount of their respective proteins. Incomplete dominance pertains to the genetic phenomenon in which the distinct gene products from the two codominant alleles in a heterozygote blend to form a phenotype intermediate between those of the two homozygotes. A regression method for characterizing different degrees of dominance is numerically illustrated.

  7. Xia, X. and C.R. Primmer. 2013. Genotypic Frequency In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition. Academic Press, San Diego. Volume 3, pp. 319-320
  8. Abstract: The genotypic frequency is the frequency of a particular genotype in a population. It is frequently used to estimate allele frequencies, the inbreeding coefficient F, and a variety of other genetic parameters including genetic distances for building phylogenetic trees and dating cladogenic events. Maximum likelihood and least-squares methods are often used for such estimations and for model testing, for example, between a model that assumes a hidden allele and a model that does not.

  9. Xia, X. 2013. Wobble hypothesis. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition. Academic Press, San Diego. Volume 7, pp. 347-349.
  10. Abstract: The original wobble hypothesis with its extended codon–anticodon base pairs played a crucial role in understanding the working of the messenger RNA (mRNA) translation machinery. Wobble pairing reduces the number of transfer RNAs (tRNAs) needed for mRNA translation, but also tends to reduce translation efficiency and accuracy. Many nucleotide modifications have been discovered that either increase or decrease the wobble versatility of nucleotides, leading to increased decoding capacity without serious reduction in translation efficiency and accuracy. Recent studies on tRNA have led to the expanded wobble hypothesis that extends the wobble hypothesis by invoking wobble pairing between the third anticodon site (NIII) and the first codon site (N1), conditional on a CII/G2 or GII/C2 with three hydrogen bonds. This hypothesis implies that the anticodon UCG would wobble pair with stop codon UGA through a wobble UIII/G1 pair, and should therefore be strongly selected against. The hypothesis explains not only the avoidance of tRNAArg/UCG in diverse evolutionary lineages, but in particular why tRNAArg/UCG should be avoided in most eubacterial species and ancestral mitochondrial lineages where UGA is used as a stop codon, and why it is present in derived mitochondrial lineages such as vertebrate mitochondrial genomes, where UGA is no longer used as a stop codon. Wobble pairing implies the theoretical possibility of adding new base pairs of novel nucleotides to protein-coding genes to increase the coding capacity.

  11. Xia, X. 2012. Rapid evolution in animal mitochondria. Pp. 73-82 In R S. Singh, J. Xu and R. J. Kulathinal (eds.) Rapidly Evolving Genes and Genetic Systems. Oxford University Press.

    Conclusions: Three factors may account for the rapid evolution, as well as the rate heterogeneity, among animal mtDNA lineages. First, animal mtDNAs, except for those in Porifera and Cnidaria, exhibit strong local and global strand bias and may share the error-prone strand-displacement replication documented in mammals. The strand bias, associated with genes switching from one strand to the other, contributes significantly to increased evolution rates. Poriferan and Cnidarian mtDNAs, similar to plant mtDNA, do not exhibit global strand bias, have local strand asymmetric patterns similar to that of eubacterial species with single-origin replication, and also have extremely slow rates of evolution comparable to those in plant mtDNA and the nuclear genome. Second, in contrast to plant mtDNA with a single standard genetic code, animal mtDNAs feature a variety of different genetic codes and much of coding sequence evolution may be attributed to changes in genetic codes. Third, changes in tRNA pool in animal mitochondria, mediated by the gain/loss of tRNA genes in mtDNA, can contribute significantly to codon replacements in mitochondrial genes. All these factors are expected to result in accelerated and episodic evolution. Recent progresses in mtDNA research suggest that, while laboratory experiments remain important, many questions concerning mtDNA evolution can be addressed with the availability of genomic data and a comparative genomic approach.

  12. Xia, X. 2011. Comparative genomics. Pp. 567-600 in H. H-S Lu, B. Schölkopf, H. Zhao, eds. Handbook of Computational Statistics: Statistical Bioinformatics. Springer.

    Abstract: Comparative genomics was previously misguided by the naïve dogma that what is true in E. coli is also true in the elephant. With the rejection of such a dogma, comparative genomics has been positioned in proper evolutionary context. Here I numerically illustrate the application of phylogeny-based comparative methods in comparative genomics involving both continuous and discrete characters to solve problems from characterizing functional association of genes to detection of horizontal gene transfer and viral genome recombination, together with a detailed explanation and numerical illustration of statistical significance tests based on the false discovery rate (FDR). FDR methods are essential for multiple comparisons associated with almost any large-scale comparative genomic studies. I discuss the strength and weakness of the methods and provide some guidelines on their proper applications.

  13. Xia, X. and Lemey, P. 2009. Assessing substitution saturation with DAMBE. Pp. 615-630 in Philippe Lemey, Marco Salemi and Anne-Mieke Vandamme, eds. The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. 2nd edition. Cambridge University Press
  14. Xia, X. 2007. Molecular phylogenetics: mathematical framework and unsolved problems. Pp. 169-189 in U. Bastolla, M. Porto, H. E. Roman and M. Vendruscolo, eds. Structural approaches to sequence evolution. Springer-Verlag.

    Abstract: Phylogenetic relationship is essential in dating evolutionary events, reconstructing ancestral genes, predicting sites that are important to natural selection and, ultimately, understanding genomic evolution Three categories of phylogenetic methods are currently used: the distance-based, the maximum parsimony, and the maximum likelihood method. Here I present the mathematical framework of these methods and their rationales, provide computational details for each of them, illustrate analytically and numerically the potential biases inherent in these methods, and outline computational challenges and unresolved problems. This is followed by a brief discussion of the Bayesian approach that has recently been used in molecular phylogenetics.

  15. Xia, X. and Kumar, S. 2006. Codon-based detection of positive selection can be biased by heterogeneous distribution of polar amino acids along protein sequences. In: Markstein P, Xu Y (eds) COMPUTATIONAL SYSTEMS BIOINFORMATICS: Proceedings of the Conference CSB 2006. Imperial College Press, pp. 335-340.

    Abstract: The ratio of the number of nonsynonymous substitutions per site (Ka) over the number of synonymous substitutions per site (Ks) has often been used to detect positive selection. Investigators now commonly generate Ka/Ks ratio profiles in a sliding window to look for peaks and valleys in order to identify regions under positive selection. Here we show that the interpretation of peaks in the Ka/Ks profile as evidence for positive selection can be misleading. Genic regions with Ka/Ks > 1 in the MRG gene family, previously claimed to be under positive selection, are associated with a high frequency of polar amino acids with a high mutability. This association between an increased Ka and a high proportion of polar amino acids appears general and not limited to the MRG gene family or the sliding-window approach. For example, the sites detected to be under positive selection in the HIV1 protein-coding genes with a high posterior probability turn out to be mostly occupied by polar amino acids. These findings caution against invoking positive selection from Ka/Ks ratios and highlight the need for considering biochemical properties of the protein domains showing high Ka/Ks ratios. In short, a high Ka/Ks ratio may arise from the intrinsic properties of amino acids instead of from extrinsic positive selection.

  16. Xia, X. 2005. Content sensors based on codon structure and dna methylation for gene finding in vertebrate genomes. Pp. 21-29 in N. Kolchanov and R. Hofestadt (eds) Bioinformatics of Genome Regulation And Structure II. Springer Science+Business Media, Inc.
  17. Xia, X., Li, C., Yang, Q. 2003. Routine analysis of molecular data with software DAMBE. Pp. 149-167 in Yang, Q. ed. Fundamental concepts and methodology in molecular palaeontology. Science Publishers, China.
  18. Xia, X. and Z. Xie. 2003. Data exploration and tetrapod phylogeny. Pp. 329-347 in Marco Salemi and Anne-Mieke Vandamme, eds. The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. Cambridge University Press June, 2003.
  19. Xia, X. 1999. Estimating the frequency of litters with multiple paternity by using molecular data. in The application, methods and theories in molecular ecology, eds. Zhu, Y. G, Sun, M, Le, K. CHEP-Springer. Pp. 136-151.

    PhD thesis

    Xia, X. 1990. Mating system of natural populations of the white-footed mouse, Peromyscus leudocpus. University of Western Ontario.

© 2016. XiaLab. All Rights Reserved.
Visits since March. 1, 2016: 28336