XiaLab at University of Ottawa

My book chapters

(Links to my journal papers and my books

Xia, X. 2022. Improved Method for Rooting and Tip-Dating a Viral Phylogeny. pp 397–410 In: Lu, H.HS., Schölkopf, B., Wells, M.T., Zhao, H. (eds) Handbook of Statistical Bioinformatics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-65902-1_19

Abstract Each viral outbreak caused by a zoonotic transmission is associated with two urgent "When" and "Where" questions. The "When" question addresses the time of the zoonotic event, and the "Where" question addresses the geographic location of the zoonotic event. These two questions become difficult when there is no good outgroup for rooting the viral tree. Viral outbreaks and the subsequent intensive sequencing of viral genomes typically lead to many nearly identical viral strains isolated from human patients with no closely related viruses of animal origin to serve as an outgroup to root the tree. For example, the SARS-CoV-2 genomes are so closely related to each other with an average distance of ~0.0002, but the closest related virus derived from animals (RaTG13 from bat) has a sequence divergence of about 0.04. Including such a distant relative into the tree with SARS-CoV-2 will essentially shrink the SARS-CoV-2 genomes into a dot so that the tree would be roughly equivalent to a single branch with RaTG13 at one end and all SARS-CoV-2 genomes at the other end. Based on the assumption of a constant molecular clock, a least-squares method for rooting a viral phylogeny without an outgroup has previously been developed and applied to address the "When" and "Where" questions. However, the assumption of a constant evolutionary rate is often violated during viral evolution, especially when the viral population size increases with initial spread but decreases dramatically with various isolation and mass vaccination measures. I present an extended method by modeling the evolutionary rate as a linear function of time instead of a constant. This substantially improves the accuracy of dating the common ancestor of sampled SARS-CoV-2 genomes. Based on two large viral trees, one with 83,688 leaves and the other with 455,251 leaves, the common ancestor was dated May 27, 2019, and June 2, 2019, respectively.

Xia, X. 2019. Bioinformatic Approaches for Repurposing and Repositioning Antibiotics, Antiprotozoals, and Antivirals. pp. 679-700 in Kunal Roy, ed. In Silico Drug Design: Repurposing Techniques and Methodologies. Elsevier.

Abstract Drug development is a time-consuming and expensive process. This problem is particularly pronounced in vaccine and antibiotics development because frequent occurrence of drug resistance often renders a costly developed drug useless. Drug repurposing and repositioning (DRR) offer a cost-effective alternative in drug development because 1) pharmacodynamics (what a drug does to the body or to the pathogen in the body) and pharmacokinetics (what the body or pathogen in the body does to the drug) of the drug typically are already known, 2) the potential side effects have already thoroughly tested for getting the drug through the regulatory authority, and 3) the problem of synthesis and mass-production of the drug has already been solved. Successful DRR depends on knowledge of relationships among drugs, drug targets, pathogens and hosts. If a drug is known to have a specific drug target present in a specific pathogen, then it can be repurposed to a new pathogen with the same drug target. Genomics and transcriptomic data analyses have contributed to the identification and validation of these relationships. This chapter reviews relevant methods in bioinformatics, especially in transcriptomic data analysis, that are relevant to DRR. Deficiencies in bioinformatic tools used in RNA-Seq analysis are highlighted and possible solutions suggested.

Xia, X. 2014. Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach. pp. 12-23 in M. Basu, Y. Pan, J. Wang, eds. Bioinformatics Research and Applications. Springer.

Abstract More and more researchers in phylogenetics are concatenating gene sequences to produce supermatrices in the hope that larger data sets will lead to better phylogenetic resolution. Almost all of these supermatrices contain a high proportion of missing data which could potentially cause phylogenetic bias. Previous studies aiming to identify the missing-data-mediated bias in the maximum likelihood method have noted a bias associated with among-site rate variation. However, this finding is by sequence simulation and has been challenged by other simulation studies, with the controversy still unresolved. Here I illustrate analytically this bias caused by missing data coupled with among-site rate variation. This approach allows one to see how much the bias can contribute to likelihood differences among different topologies. The study highlights the point that, while supermatrices may lead to "robust" trees, such "robust" trees may be purchased with illegal phylogenetic currency.

Xia, X. and Q. Yang 2013. Cenancestor. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition, Academic Press, San Diego. Volume1, pp. 493-494.

Abstract Cenancestor, the last universal common ancestor, is assumed to exist on the basis of extensive sharing of inferred homologous characters among representatives of living cellular organisms, such as the near universal genetic code, the concordance of phylogenetic trees from different genes, the sharing of fundamental biochemical processes, and the existence of numerous transitional fossils. It is a logical necessity if the cellular structure originated only once, given the cell theory stating that new cells are created by old cells dividing into two. One early concept of cenancestor is a genome that codes a minimal set of core genes essential for cellular life (the minimal genome) and from which all other genomes are derived. However, few genes are shared universally because a biological function can often be performed by unrelated genes. Even if such a set of core genes can be identified, the identification and dating of the cenancestor is difficult because of the lack of a universal global molecular clock and the rampant horizontal gene transfer. The scientific consensus of the cenancestor is neither a single cell nor a single genome, but is instead an entangle bank of heterogeneous genomes with relatively free flow of genetic information. Out of this entangled bank of frolicking genomes arose probably many evolutionary lineages with horizontal gene transfer gradually reduced and confined within individual lineages. Only three (Archaea, Eubacteria, and Eukarya) of these early lineages have representatives survived to this day.

Xia, X. 2013. Codominance. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition, Academic Press, San Diego. Volumne 2, pp. 63-64.

Abstract: Codominance pertains to the genetic phenomenon in which gene products from the two alleles in a heterozygote are produced in roughly equal amount, where gene products refer to either different transcripts from the two alleles, different proteins from cellular processing of the transcripts, or different metabolites specifically associated with the enzymatic activity of the allele-specific transcripts or proteins. The AB heterozygote at the classical blood type locus (the ABO locus) expresses both the A and B blood type antigens and has been considered as a classical case of codominance. Another example of codominance is the beta-thalassemia minor involving a mutant hemoglobin β-chain. The heterozygote (β°β) exhibits codominance because both alleles produce roughly equal amount of their respective proteins. Incomplete dominance pertains to the genetic phenomenon in which the distinct gene products from the two codominant alleles in a heterozygote blend to form a phenotype intermediate between those of the two homozygotes. A regression method for characterizing different degrees of dominance is numerically illustrated.

Xia, X. and C.R. Primmer. 2013. Genotypic Frequency In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition. Academic Press, San Diego. Volume 3, pp. 319-320

Abstract: The genotypic frequency is the frequency of a particular genotype in a population. It is frequently used to estimate allele frequencies, the inbreeding coefficient F, and a variety of other genetic parameters including genetic distances for building phylogenetic trees and dating cladogenic events. Maximum likelihood and least-squares methods are often used for such estimations and for model testing, for example, between a model that assumes a hidden allele and a model that does not.

Xia, X. 2013. Wobble hypothesis. In: S Maloy, K Hughes, editors. Brenner's Encyclopedia of Genetics, 2nd edition. Academic Press, San Diego. Volume 7, pp. 347-349.

Abstract: The original wobble hypothesis with its extended codon–anticodon base pairs played a crucial role in understanding the working of the messenger RNA (mRNA) translation machinery. Wobble pairing reduces the number of transfer RNAs (tRNAs) needed for mRNA translation, but also tends to reduce translation efficiency and accuracy. Many nucleotide modifications have been discovered that either increase or decrease the wobble versatility of nucleotides, leading to increased decoding capacity without serious reduction in translation efficiency and accuracy. Recent studies on tRNA have led to the expanded wobble hypothesis that extends the wobble hypothesis by invoking wobble pairing between the third anticodon site (N_III) and the first codon site (N₁), conditional on a C_II/G₂ or G_II/C₂ with three hydrogen bonds. This hypothesis implies that the anticodon UCG would wobble pair with stop codon UGA through a wobble U_III/G₁ pair, and should therefore be strongly selected against. The hypothesis explains not only the avoidance of tRNA^Arg/UCG in diverse evolutionary lineages, but in particular why tRNA^Arg/UCG should be avoided in most eubacterial species and ancestral mitochondrial lineages where UGA is used as a stop codon, and why it is present in derived mitochondrial lineages such as vertebrate mitochondrial genomes, where UGA is no longer used as a stop codon. Wobble pairing implies the theoretical possibility of adding new base pairs of novel nucleotides to protein-coding genes to increase the coding capacity.

Xia, X. 2012. Rapid evolution in animal mitochondria. Pp. 73-82 In R S. Singh, J. Xu and R. J. Kulathinal (eds.) Rapidly Evolving Genes and Genetic Systems. Oxford University Press.
Conclusions: Three factors may account for the rapid evolution, as well as the rate heterogeneity, among animal mtDNA lineages. First, animal mtDNAs, except for those in Porifera and Cnidaria, exhibit strong local and global strand bias and may share the error-prone strand-displacement replication documented in mammals. The strand bias, associated with genes switching from one strand to the other, contributes significantly to increased evolution rates. Poriferan and Cnidarian mtDNAs, similar to plant mtDNA, do not exhibit global strand bias, have local strand asymmetric patterns similar to that of eubacterial species with single-origin replication, and also have extremely slow rates of evolution comparable to those in plant mtDNA and the nuclear genome. Second, in contrast to plant mtDNA with a single standard genetic code, animal mtDNAs feature a variety of different genetic codes and much of coding sequence evolution may be attributed to changes in genetic codes. Third, changes in tRNA pool in animal mitochondria, mediated by the gain/loss of tRNA genes in mtDNA, can contribute significantly to codon replacements in mitochondrial genes. All these factors are expected to result in accelerated and episodic evolution. Recent progresses in mtDNA research suggest that, while laboratory experiments remain important, many questions concerning mtDNA evolution can be addressed with the availability of genomic data and a comparative genomic approach.
Xia, X. 2011. Comparative genomics. Pp. 567-600 in H. H-S Lu, B. Schölkopf, H. Zhao, eds. Handbook of Computational Statistics: Statistical Bioinformatics. Springer.
Abstract: Comparative genomics was previously misguided by the naïve dogma that what is true in E. coli is also true in the elephant. With the rejection of such a dogma, comparative genomics has been positioned in proper evolutionary context. Here I numerically illustrate the application of phylogeny-based comparative methods in comparative genomics involving both continuous and discrete characters to solve problems from characterizing functional association of genes to detection of horizontal gene transfer and viral genome recombination, together with a detailed explanation and numerical illustration of statistical significance tests based on the false discovery rate (FDR). FDR methods are essential for multiple comparisons associated with almost any large-scale comparative genomic studies. I discuss the strength and weakness of the methods and provide some guidelines on their proper applications.
Xia, X. and Lemey, P. 2009. Assessing substitution saturation with DAMBE. Pp. 615-630 in Philippe Lemey, Marco Salemi and Anne-Mieke Vandamme, eds. The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. 2nd edition. Cambridge University Press
Xia, X. 2007. Molecular phylogenetics: mathematical framework and unsolved problems. Pp. 169-189 in U. Bastolla, M. Porto, H. E. Roman and M. Vendruscolo, eds. Structural approaches to sequence evolution. Springer-Verlag.
Abstract: Phylogenetic relationship is essential in dating evolutionary events, reconstructing ancestral genes, predicting sites that are important to natural selection and, ultimately, understanding genomic evolution Three categories of phylogenetic methods are currently used: the distance-based, the maximum parsimony, and the maximum likelihood method. Here I present the mathematical framework of these methods and their rationales, provide computational details for each of them, illustrate analytically and numerically the potential biases inherent in these methods, and outline computational challenges and unresolved problems. This is followed by a brief discussion of the Bayesian approach that has recently been used in molecular phylogenetics.
Xia, X. and Kumar, S. 2006. Codon-based detection of positive selection can be biased by heterogeneous distribution of polar amino acids along protein sequences. In: Markstein P, Xu Y (eds) COMPUTATIONAL SYSTEMS BIOINFORMATICS: Proceedings of the Conference CSB 2006. Imperial College Press, pp. 335-340.
Abstract: The ratio of the number of nonsynonymous substitutions per site (Ka) over the number of synonymous substitutions per site (Ks) has often been used to detect positive selection. Investigators now commonly generate Ka/Ks ratio profiles in a sliding window to look for peaks and valleys in order to identify regions under positive selection. Here we show that the interpretation of peaks in the Ka/Ks profile as evidence for positive selection can be misleading. Genic regions with Ka/Ks > 1 in the MRG gene family, previously claimed to be under positive selection, are associated with a high frequency of polar amino acids with a high mutability. This association between an increased Ka and a high proportion of polar amino acids appears general and not limited to the MRG gene family or the sliding-window approach. For example, the sites detected to be under positive selection in the HIV1 protein-coding genes with a high posterior probability turn out to be mostly occupied by polar amino acids. These findings caution against invoking positive selection from Ka/Ks ratios and highlight the need for considering biochemical properties of the protein domains showing high Ka/Ks ratios. In short, a high Ka/Ks ratio may arise from the intrinsic properties of amino acids instead of from extrinsic positive selection.
Xia, X. 2005. Content sensors based on codon structure and dna methylation for gene finding in vertebrate genomes. Pp. 21-29 in N. Kolchanov and R. Hofestadt (eds) Bioinformatics of Genome Regulation And Structure II. Springer Science+Business Media, Inc.
Xia, X., Li, C., Yang, Q. 2003. Routine analysis of molecular data with software DAMBE. Pp. 149-167 in Yang, Q. ed. Fundamental concepts and methodology in molecular palaeontology. Science Publishers, China.
Xia, X. and Z. Xie. 2003. Data exploration and tetrapod phylogeny. Pp. 329-347 in Marco Salemi and Anne-Mieke Vandamme, eds. The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. Cambridge University Press June, 2003.
Xia, X. 1999. Estimating the frequency of litters with multiple paternity by using molecular data. in The application, methods and theories in molecular ecology, eds. Zhu, Y. G, Sun, M, Le, K. CHEP-Springer. Pp. 136-151.