XiaLab at University of Ottawa

Xia, X. 2018. Bioinformatics and the cell: modern computational approaches in genomics, proteomics and transcriptomics. Springer. 2nd Edition.

Institutions with SpringerLink subscription can download the book for free or get a hardcopy at a much reduced price. Let me know if you do not have access to SpringerLink.

Chapter 1. String Mathematics, BLAST, and FASTA.

What is an e-value for ungapped and gapped BLAST? What are the Karlin-Altschul parameters that affect e-value calculation? How nucleotide frequencies and match-mismatch matrices affect such parameters? What are the key algorithms for FASTA and BLAST? How do their differences affect sensitivity of sequence search? This chapter addresses these questions and illustrates applications of string matching in genomics, transcriptomics, and proteomics, as well as in drug discovery.

Chapter 2. Sequence Alignment.

Sequence alignment serves two purposes: global alignment is mainly for identifying site homology between sequences to facilitate the inference of ancestral-descendant relationships through molecular phylogenetics, and local alignment mainly for identifying sequence similarities that may be due to either inheritance or convergence. Dynamic programming algorithm for pairwise alignment and profile alignment is illustrated in detail, for both constant gap penalty and affine function gap penalty, followed by progressive multiple sequence alignment using a guide tree, and by how to align protein-coding nucleotide sequences against aligned amino acid sequences. Derivation of PAM and BLOSUM matrices were numerically illustrated in detail, and how the effect of purifying selection on substitution patterns is discussed. Also discussed are pros and cons of representing internal sequences by a profile or by a reconstructed sequence in multiple sequence alignment.

Chapter 3. Position weight matrix and Perceptron.

This chapter covers two frequently used algorithms for motif characterization and prediction. The first part is on position weight matrix (PWM) algorithm which takes in a set of aligned motif sequences (e.g., 5’ splice sites) and generates a list of PWM scores which may be taken as the signal strength for each input sequence. In this context, PWM is somewhat a sequence equivalent of principle component analysis in multivariate statistics. Also generated are a PWM for highlighting the non-randomness of motif patterns and the significance tests of the site-specific motif patterns. PWM also serves as an essential component in algorithms for de novo motif discovery, such as Gibbs sampler. How to specify background frequencies and pseudo-counts in computing PWM? What are their effects on the PWM outcome? How to control for Type I error rate involving multiple comparisons by using false discovery rate? All these topics are detailed in this chapter. The second part of the chapter covers perceptron which is the simplest artificial neural network with only a single neuron. Its function is equivalent to two-group discriminant function analysis in multivariate statistics, i.e., to identify nucleotide or amino acid sites that provide the greatest discriminant power between two sets of sequences. The algorithm of perceptron as well as its application and limitations are illustrated in detail.

Chapter 4. Gibbs sampler. Bioinformatics and the Cell, Springer, Cham: 99-111.

Gibbs sampler is for de novo motif discovery. Suppose we have a set of sequences each containing a regulatory motif located in different locations of the sequences, but we do not know what the motif looks like or where it is located within each sequence. Gibbs sampler will find such a motif if it is well represented in these sequences. If we have a set of yeast intron sequences each containing a branchpoint site (BPS) somewhere, but we do not know what BPS looks like or where it is located along the intron sequence, Gibbs sampler will find these BPSs. Another scenario involves the discovery of protein binding sites (e.g., transcription factor binding site) given a set of sequences from ChIP-Seq. Each of these sequences has a short sequence segment with affinity to a protein, but we do not know what the short sequence segment looks like or where it is located within the sequence. Gibbs sampler shines in discovering such protein-binding sites. This chapter breaks the black box of Gibbs sampler and numerically illustrates each of its computational steps, including the site sampler (which assumes that each input sequence harbors a signal motif) and motif sampler (which is used when some sequences may contain multiple signal motifs and some none).

Chapter 5. Transcriptomics and RNA-Seq Data Analysis.

High-throughput sequencing data (HTS) has been used in detecting not only differential gene expression, but also alternative splicing events and different transcription start and termination sites. It has also been used in ribosome profiling for characterizing translation efficiency and Hi-C method for constructing genome 3-D architecture. There are two main difficulties in analyzing HTS data: the large file size, often in terabytes, and the allocation of reads to paralogous genes which impacts the accuracy of computed RPKM values, especially for multicellular eukaryotes with many paralogues. This chapter provides a conceptual framework for analyzing HTS data and offers numerical illustrations of solutions to both problems mentioned above. It includes examples from real data on how to compare performance of different software packages.

Chapter 6. Self-Organizing Map and Other Clustering Methods in Transcriptomics.

Self-organizing map (SOM) is an artificial neural network algorithm, having been used frequently with transcriptomic data analysis, in particular for clustering co-expressed genes as a basis to infer co-regulated genes. It can be applied to any set of objects as long as a distance function can be defined between objects. SOM is numerically illustrated together with a simple UPGMA method to contrast between the two. A less known application of SOM is in discovering heterogeneous motifs present in a set of sequences, making it more general than Gibbs sampler in de novo motif discovery. These two approaches, one with a (gene × expression) matrix as input and the other with a set of sequences as input (where each sequence may contain multiple but heterogeneous protein-binding sites), are illustrated.

Chapter 7. Hidden Markov Models and Protein Secondary Structure Prediction.

Hidden Markov model (HMM) is for inferring hidden states of a Markov model based on observed data. For example, intron and exon are hidden states and need to be inferred from the observed nucleotide sequences. Similarly, secondary structural elements such as alpha helices and beta sheets are hidden states and need to be inferred from observed amino acid sequences. The accuracy of HMM in inferring hidden states depends on the transition probability matrix and emission probability matrix derived from training HMM with representative observations. If different states have very different probability to transit into each other, and if the emission probability matrix of the hidden states are highly different from each other, then HMM can be quite accurate. This chapter details the key algorithms used in HMM, such as Viterbi algorithm for reconstructing the hidden states and the forward algorithm to compute the probability of the observed sequence of events. Both Viterbi and forward algorithms are dynamic programming algorithms that we were first exposed to in the chapter on sequence alignment. HMM is applied to reconstructing protein secondary structure as an illustrative example.

Chapter 8. Bioinformatics and Translation Initiation.

Translation consists of three subprocesses: initiation, elongation, and termination, all involving signal motifs and their decoders. Shine-Dalgarno (SD) sequences in prokaryotic mRNA are decoded by anti-SD (aSD) sequences at the 3’ end of the small subunit rRNA (3’TAIL). Sense codons are decoded by tRNAs and stop codons by release factors. The signal motifs are stronger in highly expressed genes than lowly expressed genes. This chapter focuses on translation initiation and presents several new bioinformatics methods for characterizing evolution and co-evolution of signal motifs and their decoders. These methods are amply illustrated with real prokaryotic and eukaryotic data. The main factors determining translation initiation efficiency is the strength of the signal motif (e.g., location and pairing strength of SD and the nature of start codon), abundance of their decoder (e.g., number of ribosomes), and secondary structure that may embed SD and start codon rendering them inaccessible by their decoders. Mutations that result in SD/aSD incompatibility between a bacteriophage and a subset of its hosts leads to a more restricted host range.

Chapter 9. Bioinformatics and Translation Elongation.

Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.

Chapter 10. Bioinformatics and Translation Termination in Bacteria.

The codon-anticodon adaptation theory is applicable to the coevolution between stop codons and their decoders (release factors). In most prokaryotes, UAA and UAG are decoded by release factor 1 (RF1) and UAA and UGA by release factor 2 (RF2). Stop codon usage is strongly affected by both mutation bias and differential abundance of RF1 and RF2, and to a less extent by the abundance of tRNAs that may misread stop codons, e.g., misreading of UGA by tRNATrp that decodes UGG. An explanation is offered for the nearly universal rarity of UAG usage in bacterial species. Empirical evidence is also presented to demonstrate the strong effect of flanking nucleotides on stop codon signal strength.

Chapter 11. Genomic Features: Content Sensors, Nucleotide Skew Plot, Strand Asymmetry, and DNA Methylation.

This chapter introduces tools to characterize genomic features and illustrates how a phylogenetic perspective can fundamentally alter one’s conclusion on genomic evolution. The chapter starts by explaining content sensors (e.g., nucleotide, dinucleotide, triplet frequencies, etc.) in contrast to signal sensors (e.g., 5’ and 3’ splice sites, branchpoint sites, SD sequences, anti-SD sequences, Kozak consensus in mammalian mRNAs, sense and stop codons, etc.). Frequently used indices for characterizing genomic content sensors include various word skews, with the simplest being GC skew often used to identify the origin of DNA replication in prokayrotes. Single-origin replication in most bacterial genomes results in strong mutation bias between leading and lagging strands which can be graphically revealed by various skew plots. Confounding these strand biases is the genomic modification by DNA methylation which can have profound effect on genome evolution. Association between CpG deficiency and CpG-specific DNA methylation was challenged previously with two mycoplasma genomes but is restored by a phylogeny-based reanalysis and re-interpretation of the genomic data.

Chapter 12. Nucleotide Substitution Models and Evolutionary Distances.

Genomes change over time, so are the interactions of genes and gene products that breathe life into a genome. To have the most advantageous view of genomes, genes and their interactions, we need to see things from the very beginning. Substitution models enable us to trace history back to the very early forms of life, by reconstructing the genomic "books" erased and obliterated by billions of years of mutations. This chapter focuses on nucleotide-based substitution models, presenting three different ways of deriving, for various substitution models, transition probability matrices that are needed to compute evolutionary distances and to compute likelihood of a tree. These three different ways should allow students of different mathematical background to understand substitution models and their uses in phylogenetics. Almost all frequently used substitution models are nested models with the simple one being a special case of the more general one. The likelihood ratio test, as well as the information-theoretic indices as an alternative approach to model selection, is numerically illustrated in choosing the substitution model that best describes the aligned sequences.

Chapter 13. Protein Substitution Model and Evolutionary Distance.

In addition to nucleotide-based substitution models, there are also models based on amino acid and codon sequences. Observed substitutions between two sense codons depend on codon frequencies, the difference between the two encoded amino acids, and the number of nucleotide site differences between the two codons (which could differ at 1, 2, or all 3 sites). Similarly, observed substitutions between two amino acids depend on amino acid frequencies and amino acid dissimilarities. This chapter focuses on amino acid substitution models with their parameters derived from empirical substitution matrices. How is an empirical substitution matrix compiled? How to derive transition probability and rate matrices from an empirical matrix? How to derive evolutionary distances from these matrices? Under what circumstances one may fail to obtain an evolutionary distance? These questions are addressed in detail with numerical illustrations.

Chapter 14. Maximum Parsimony Method in Phylogenetics.

A phylogenetic tree offers a simple way of visualizing evolutionary history. Out of many possible trees one can draw to link different species, we must have a criterion to choose the best tree, just as we need a criterion to choose the best alignment (the highest alignment score given the scoring scheme) or the best substitution model (the smallest information-theoretic indices). Maximum parsimony (MP) is one of the criteria for choosing the best phylogenetic tree, i.e., the tree that requires the smallest number of changes to account for the sequence variation is the best tree. This chapter details two frequently used algorithms used in the MP framework to obtain the minimum number of changes: the Fitch algorithm which assumes all nucleotides or amino acids replacing each other with equal frequency, and the Sankoff algorithm which uses the cost matrix to accommodate differential substitution rate among different nucleotides or different amino acids. Both belong to the dynamic programming algorithm we were first exposed to in the chapter on sequence alignment. The long-branch attraction problem, which is associated not only with MP method but also with other methods that do not correct, or insufficiently correct, for multiple substitutions, is numerically illustrated in detail. Two different ways of reporting statistical support for a phylogenetic tree are presented: the resampling method (bootstrapping and jackknifing) and the statistical test of alternative topologies. Various ways of improving the power of the statistical tests are included and discussed.

Chapter 15. Distance-Based Phylogenetic Methods.

The distance-based phylogenetic method is fast and remains the most popular one in molecular phylogenetics, especially in the big-data age when researchers often build phylogenetic trees with hundreds or even thousands of leaves. A distance-based method has two components: the evolutionary distance matrix typically derived from a substitution model, and the tree-building algorithm that constructs a tree from the distance matrix. Evolutionary distances differ not only by substitution models, but also by whether they are estimated independently (based on only information between two aligned sequences) or simultaneously (based on information from all pairwise comparisons). Simultaneous estimated distances perform much better in molecular phylogenetics than independently estimated distances. Frequently used tree-building algorithms include neighbor-joining which clusters two species by weighing two measures: how close they are to each other (which is used in a simpler method known as UPGMA) and how far they jointly are from all other species. An alternative method is based on the least-squares framework. Alternative trees first have their branches evaluated by the least-squares method, and then the best tree is chosen by either the least-square criterion or the minimum evolution criterion. Distance-based dating methods are also illustrated numerically, with both fossil-calibrated dating and tip-dating, together with the estimation of variation in the inferred dates by resampling.

Chapter 16. Maximum Likelihood in Molecular Phylogenetics.

Maximum likelihood (ML) methods remain the gold standard in molecular phylogenetics. The calculation of likelihood, given a topology and a substitution model, is illustrated with both a brute-force approach and the pruning algorithm which is the most fundamental algorithm in likelihood calculation. The pruning algorithm is also a dynamic programming algorithm. The likelihood calculation is separately presented without and with a molecular clock. While ML is the most robust of all methods in molecular phylogenetics, it may suffer from bias when handling missing data coupled with rate heterogeneity over sites.

Chapter 17. Protein Isoelectric Point and Helicobacter pylori.

The last three chapters are on proteins which are the workhorses in the cell. The most fundamental property of a protein is its isoelectric point (pI) which determines what cellular components a protein can interact electrostatically (e.g., a positively charged substrate would demand a negatively charged enzyme). This chapter first explains the iterative approach to compute the theoretical protein pI and then apply it to a study of genomic adaptation of Helicobacter pylori, a bacterial gastric pathogen. Two research questions are addressed. How did H. pylori acquire its proteins with extraordinarily high pI? Are these proteins with high pI contribute to acid resistance of the pathogen? A detailed genomic analysis illustrates the hypothesis-testing approach in science.

Chapter 18. Bioinformatics and In Silico 2D Gel Electrophoresis.

Proteins can be separated in a 2-D gel based on protein isoelectric point (pI) and molecular weight (MW), and the more abundant proteins will manifest themselves with a larger and darker dots in the gel than less abundant proteins. Because protein pI and MW can be easily calculated, and protein abundance can be approximated by predicted translation efficiency, we can do in silico 2-D gel and compare the separation pattern against that in the empirical 2-D gel. Differences between the two suggest post-translational modifications. The approach of in silico 2-D gel is detailed in this chapter.

Chapter 19. Fundamentals of Proteomics.

Basic proteomic data acquisition involved protein separation, digestion and peptide mass fingerprinting to identify the proteins. This chapter outlines the main types of protein mass spectrometry, and charge deconvolution for computing the molecular mass from the distribution of multiple charged ions of the molecule of interest. Once we obtain a list of peptides with computed peptide mass, we can search them against a set of in silico digested peptides for matches. Statistical methods on how to identify proteins from such matches are presented.