Posts

Showing posts with the label Bioinformatics

Performing Local BLAST Search using BioEdit

Image
BioEdit is a free biological sequence alignment editor. It has an intuitive multiple document interface and offers convenient features that make alignment and manipulation of sequences relatively easy on windows desktop computers. Additionally, there are several sequence manipulation and analysis options and links to external analysis programs that facilitate a working environment, allowing users to view and manipulate sequences with simple point-and-click operations. The BioEdit software provides automated local and web BLAST ( B asic L ocal A lignment S earch T ool) searches with a simple graphical user interface (GUI) to the command-line BLAST program. This is a simple video tutorial on how to construct a custom nucleotide/protein database using BioEdit software to support NCBI BLAST . The tutorial also covers how to perform a query search, specifically pairwise sequence alignment, against the database.

Interactive Phylogenetic Tree Visualization using iTOL

Image
A phylogenetic tree ( a.k.a. , cladogram or dendrogram) is a diagrammatic/graphical representation of the genetical/evolutionary relationship of species/organisms/genes. Phylogenetic tree helps to find the common ancestor. The i nteractive T ree O f L ife ( iTOL ) is an online tool to display and manipulate phylogenetic trees. iTOL offers free access (limited access), standard subscription (unlimited access), and iTOL annotation editor subscription (unlimited and versatile access). It supports user-interactive customizable tree layouts, manual drawing, and annotation. iTOL can visualize trees with 50,000 or more nodes. This is a simple video tutorial for user interactive visualization of phylogenetic trees using the iTOL tool. The iTOL tool produces phylogenetic trees in various formats, such as rectangular, slanting, curved, radial, and curved. It accepts input in Newick, Nexus, or PhyloXML file format.

Converting image to 3D molecule using CACTUS OSRA

Image
OSRA (Optical Structure Recognition Application) is a free and open-source optical graph recognition program. The stand-alone version of OSRA is a command-line. OSRA converts a graphical representation of chemical structures from images, as they appear in journal articles, patent documents, textbooks, trade magazines etc., to SMILES (Simplified Molecular Input Line Entry Specification) format. In addition, the online OSRA tool converts the SMILES string to a 3D molecule in SDF (Standard Data File) file format. OSRA can recognize over 90 graphical format documents by parsing vectors through ImageMagick software. The standard file formats include BMP, GIF, ICO, JPEG, PNG, TIFF, WMF, PDF, PS, etc. The CACTUS OSRA tool parses the graphical input to a SMILES string and converts it to a 3D molecule. The video tutorial below demonstrates the conversion of a document in GIF format image to a 3D structure. Note that any software designed for optical recognition is unlikely to b...

Converting image to 3D molecule using VEGA ZZ OSRA

Image
VEGA ZZ is a free (for non-profit academic uses) molecular modelling suite. It consists of many third-party packages, which act as an interface to the VEGA ZZ software. OSRA (Optical Structure Recognition Application) is a free and open-source optical graph recognition program. The stand-alone version of OSRA is a command-line. OSRA converts a graphical representation of chemical structures from images to SMILES (Simplified Molecular Input Line Entry Specification) format. In addition, the online OSRA tool converts the SMILES string to a 3D molecule in SDF (Standard Data File) file format. OSRA can recognize over 90 graphical format documents by parsing vectors through ImageMagick software. The standard file formats include BMP, GIF, ICO, JPEG, PNG, TIFF, WMF, PDF, PS, etc. VEGA ZZ consists of an OSRA plug-in that acts as an interface to it. Moreover, it supports imaging devices such as cameras and image scanners for acquiring documents through the TWAIN interface. The VEG...

RNA Secondary Structure Prediction using Nussinov Algorithm

Image
The Nussinov algorithm is an RNA secondary structure (folding) prediction method using a dynamic programming approach. Ruth Nussinov introduced this algorithm in the year 1978. It involves computing a two-dimensional (2D) diagonal matrix with the same sequence at both dimensions. The scores are given based on complementary ( 1 ) or non-complementary ( 0 ) matches of characters. Matrix solving consists of three stages ( i ) initialization , ( ii ) matrix-filling , and ( iii ) trace-back of arrows for structures. \(\style{ color: blue; } {\begin{array} \\ \text{RNA sequence, } S=a_1a_2a_3....a_{l-1}a_l \\ \begin{align*} \\ \!\!\!\!\! \text{where,} \\ & a=\text{characters (A, U, C, G)} \\ & l=\text{length of the sequence} \\ \end{align*} \end{array}} \) In this tutorial, I have taken a sample RNA sequence ( S ) as GGGAAAUCC for prediction. Initialization The initialization step is to preset the diagonal cells with zero ( 0 ) values to perform matrix filling. \(...

Best Fonts to Display Biological Sequence and Alignment

Image
Biological sequence alignment is a method to find similarities between two sequences. The characters (letters or biological sequence alphabets) are arranged row-wise and column-wise according to match/identity and mismatch of characters through symbol representation. In row-wise, a gap character (hyphen symbol ‘-’) is used to align/adjust the sequence characters for matching without changing the order. Correspondingly, the column-wise characters such as match (same character, pipe symbol ‘|’, colon symbol ‘:’, asterisk symbol ‘*’, or dot symbol ‘.’), mismatch (anyone of the sequence character, or blank space ‘ ’), positive (positive symbol ‘+’), and gap/indel (blank space ‘ ’) are used to represent sequence alignment. A sequence or sequence alignment must be properly formatted in a text editor/word processor using fixed-width fonts to find the differences of characters in the s...

Creating Custom Database using Standalone NCBI BLAST+

Image
Basic Local Alignment Search Tool (BLAST) is a collection of programs developed using heuristic algorithm in C++ for comparing DNA, RNA, and protein sequences. The standalone command-line interface (CLI) of BLAST is named as BLAST+. The latest version of NCBI BLAST+ can be downloaded from the FTP server of NCBI ( ftp://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST ). This is a simple tutorial for creating a custom database, accessing the database, and performing a sequence search using BLAST+. 1. Creating a Custom Database A nucleotide ( nucl ) or protein ( prot ) database can be created using -dbtype parameter in makeblastdb program. We can create two types of database using command-line below, Non-indexed Database: ./makeblastdb -in DBX.fasta -out DBX -dbtype prot Building a new DB, current time: 12/04/2020 10:10:06 New DB name: C:\NCBI\blast-2.6.0+\bin\DBX New DB title: DBX.fasta Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences ...

Prediction of 3D Structure (Folding) of a RNA Sequence

Image
Ribonucleic acid (RNA) is a linear single-stranded molecule that takes part in translation to protein. The intramolecular interactions ( a.k.a. folding) of RNA base pairs (A=U and C≡G) form a secondary structure. Nussinov (or) Zuker algorithm is a dynamic programming approach used for the prediction of the secondary structure of the RNA. The dot-bracket structure with the minimum free energy is a stable secondary structure. Prediction of the three-dimensional (3D) structure of RNA using mFold and RNAComposer tools have demonstrated in this tutorial. The mFold tool predicts the dot-bracket notation format RNA folding result from the RNA sequence, while the RNAComposer tool predicts the 3D structure from the dot-bracket notation. The resources used in this tutorial are NCBI GenBank , mFold , and RNAComposer . Note: The length of RNA sequence input in mFold is limited to 4000 bases and sequence/dot-bracket notation input in RNAComposer is limited to 500 bases, due to the c...

Constructing Entropy Plot from Multiple Sequence Alignment

Image
The entropy in sequence analysis refers to the measure of the variation of characters (column) in multiple sequences. Entropy plot through multiple sequence alignment can be predicted using different types of entropy formulas, namely Shannon's Entropy , Schneider's Entropy , Shenkin's Entropy , Gerstein's Entropy , and Gap normalized Entropy . Prediction of entropy plot consists of two phases: ( i ) performing multiple sequence alignment and consensus, and ( ii ) calculation of entropy number for each column through consensus of multiple sequence alignment. The entropy plot is generated by plotting vertical lines in the order of the consensus sequence on the x -axis, and the entropy number on the y -axis. This simple video tutorial demonstrates how to predict entropy plot through multiple sequence alignment. The tools used in this tutorial are ClustalW , and Entropy Plotter . Note: We can choose any multiple sequence alignment tool, but the alignment output must...

Biological Sequence Pattern Matching using Perl

Image
This article is a simple Perl programming tutorial for matching patterns in the biological sequence using regular expressions. In this tutorial, I have used ActiveState Perl 5.24.3 software for compiling the Perl script. Pattern Matching In Bioinformatics, string matching or pattern matching is a fundamental and popular method used in a wide range of applications ranging from sequence alignment to functional prediction. Pattern matching is classified into exact pattern matching and approximate pattern matching. The exact pattern matching method does not allow any insertion, deletion, or substitution of characters while matching with the target sequence, whereas the approximate pattern matching method allows with certain limitations. In Computational Biology, a pattern is an expression as a sequence of characters with a defined set of symbolic representation. Example: N{P}-[ST]{P}A(2,3). Source Code system('cls'); print "\n+-----------------------------------+...

Computing Amino Acid Composition using C++

Image
This article explains the simple method to compute composition of amino acids in the protein sequence using C++. In this tutorial, I have used Dev C++ v5.11 software for compiling the C++ program. Length of the Protein Sequence Length of the protein sequence is the count ( C ) of the total number of amino acid characters in the protein sequence. Let, Protein Sequence (S) = S 1 S 2 S 3 …S l -1 S l Where, S ∈ {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} Then, l is the length of the protein sequence (S). Amino Acid Composition of the Protein Sequence Amino acid composition is the sum of count ( C ) of each amino acids in the protein sequence. Count of each amino acids is C A , C C , C D , C E , C F , C G , C H , C I , C K , C L , C M , C N , C P , C Q , C R , C S , C T , C V , C W , and C Y . Source Code // Computing Composition of Amino Acids in the Protein Sequence #include <iostream> #include <iomanip> #include <stri...

Frequency Plot of Protein Sequence using PHP and R

Image
A frequency plot is a graphical data analysis technique for summarizing the distributional information of a variable. The response variable is divided into equal sized intervals (or bins). The number of occurrences of the response variable is calculated for each bin. In this tutorial, the number of occurrences of each amino acids in the protein sequence (response variable) is calculated and sorted in ascending order. The frequency plot then consists of: Vertical Axis = Amino acids Horizontal Axis = Frequencies of the amino acids There are 4 types of frequency plots: Frequency plot (absolute counts); Relative frequency plot (convert counts to proportions); Cumulative frequency plot; Cumulative relative frequency plot. The frequency plot and the histogram have the same information except the frequency plot has lines connecting the frequency values, whereas the histogram has bars at the frequency values. Frequency plot using PHP and R In this tutorial, the programming langu...

Local NCBI BLAST+ in WebServer - Easy Steps

Image
This is a simple tutorial which explains how to design your own web interface for NCBI BLAST+ to perform local and online database search using PHP in webserver. The PHP library loc BLAST executes the NCBI BLAST+ programs using exec() function through passing parameters from the HTML form fields. In loc BLAST , two list boxes were used to select program & database, and text area & file upload is used to input query sequence in the FASTA file format. A FASTA file validation function is included to validate the query sequence before executing the BLAST programs. The loc BLAST PHP library and test database files were freely available at GitHub . Requirements for loc BLAST Setup In this tutorial, I have given a brief explanation about embedding the latest NCBI BLAST+ (the latest version of NCBI BLAST+ as on November 17, 2020 is 2.11.0 .) in any PHP enabled web server. The latest version of NCBI BLAST+ (standalone command-line BLAST programs) can be downloaded from the FTP s...

Extracting Multiple FASTA Sequences using PHP

Image
This is a simple tutorial which explains how to safely extract multiple sequences from a FASTA file using PHP script. I have used four functions to perform different tasks: read_fas_file() - to check invalid file, fas_check() - to check FASTA file format, get_seq() - to retrieve sequence and sequence name pairs, and fas_get() - to extract and display multiple FASTA file formatted sequences. The full source code and multiple protein sequences in FASTA file format used in this tutorial is given below. Source Code function read_fas_file($x) { // Check for Empty File if (!file_exists($x)) { print "File Not Exist!!"; exit(); } else { $fh = fopen($x, 'r'); if (filesize($x) == 0) { print "File is Empty!!"; fclose($fh); exit(); } else { $f = fread($fh, filesize($x)); fclose($fh); return $f; } } } function fas_check($x) { // Check FASTA File Format $gt = substr($x, 0, 1); if ($gt != ">") { print "Not F...