Posts

Showing posts with the label Bioinformatics

2D & 3D Structure and Interaction of RNA-RNA Complex Prediction

Image
The prediction of 3D RNA-RNA structure and interaction is a multi-step process involving loop-loop recognition, coaxial stacking, and metal-ion stabilization, resulting in a complex topological structure essential for biological function.Unlike DNA, which primarily exists as a double helix, RNA is single-stranded and can adopt diverse three-dimensional shapes, allowing it to perform catalytic, structural, and regulatory roles. Structural Hierarchy and Driving Forces RNA-RNA interactions are governed by a hierarchical folding process where secondary structure elements (helices, loops, and bulges) form first, followed by the arrangement of these elements into a specific tertiary structure. The primary physical forces driving these interactions include: Hydrogen Bonding: Beyond standard Watson-Crick base pairing (A-U, G-C), RNA frequently utilizes non-canonical interactions such as Wobble base pairs (G-U) and Hoogsteen pairings. Base Stacking: The hydrophobic effec...

Compiling and Creating locBLAST Image using Docker

Image
loc BLAST is a PHP library that provides a graphical user interface (GUI) for the command-line NCBI BLAST+ programs. The official Docker image of loc BLAST is available on Docker Hub . 💿 Using NCBI Docker Images for Web BLAST The most straightforward way to run a web-based BLAST service with Docker is to use a pre-built image provided by the NCBI: NCBI BLAST+ Command Line Tools: The NCBI provides official Docker images for the standalone command-line BLAST+ suite, which can be found on their GitHub page and Docker Hub . Latest version: The ncbi/blast:latest tag generally points to the most recent stable release. Specific versions: You can specify a particular version, e.g., ncbi/blast:2.14.1 , to ensure reproducibility. 🛠️ Setting up loc BLAST in a Docker Environment To use loc BLAST , you would need to deploy it within a web server environment that also has access to the NCBI BLAST+ binaries. loc BLAST requires a web serv...

3D Protein Structure Prediction Server AlphaFold 3

Image
The AlphaFold Server is a free, web-based platform launched by Google DeepMind and Isomorphic Labs to provide the scientific community with access to AlphaFold 3 . While the original AlphaFold was revolutionary for predicting protein structures, the AlphaFold 3 server expands this capability to virtually all “life’s molecules,” allowing researchers to model how proteins interact with DNA, RNA, ligands, and more in a single, unified system. 🧬 Key Capabilities Unlike its predecessors, the AlphaFold 3 server is a multimodal model. It doesn't just fold proteins; it predicts the joint 3D structure of complex molecular assemblies. Protein-Ligand Interactions: Accurately models how small molecules (drugs) bind to proteins, showing a 50% improvement over traditional docking methods. Nucleic Acids: Predicts the structures of DNA and RNA and how they interact with proteins (e.g., transcription factors or CRISPR complexes). Chemical Modifications: Pr...

Bioinformatics Protocol for NGS Data Analysis

Image
A step-by-step Bioinformatics protocol for Next-Generation Sequencing (NGS) are, Data quality control using tools like FastQC to assess raw data. Data preprocessing for adapter trimming and low-quality base removal with tools like Trimmomatic or FastP. Read mapping to a reference genome using aligners such as BWA or Bowtie2. Post-alignment processing including duplicate removal with Picard and variant calling with GATK or Samtools. Downstream analysis and visualization for specific applications like differential gene expression or variant interpretation using tools like R packages or IGV. A more detailed breakdown of those were given below 1. Data Quality Control (QC) Purpose: To check the quality of the raw sequencing reads and identify any potential issues. Tools: FastQC: A widely used tool to generate quality control reports for raw sequencing data. Output: A report summarizing metrics like Phred scores, adapter contamination, and sequence qu...

Molecular Dynamics Simulation of Micromolecules using Chimera

Image
Performing a Molecular Dynamics (MD) simulation of a small molecule in UCSF Chimera involves a series of steps to prepare the molecule, set up the simulation environment, run the simulation, and finally, analyze the resulting trajectory. Here's a step-by-step guide for the same: 1. Loading and Preparing the Small Molecule Structure Open Chimera: Launch UCSF Chimera or ChimeraX. Load your molecule: Import your small molecule structure into Chimera using File > Open or File > Fetch by ID if the structure is available in a database like the Protein Data Bank (PDB). Add Hydrogens: Use the "Molecular Dynamics Simulation" tool's "Prep Structure" section to add hydrogens. You might also be able to use the addh command. Assign Force Field Parameters: Since you are working with a small molecule (a nonstandard residue), you will use Amber's Antechamber module, which is included in Chimera, to assign force field parameters. This involves ass...

Constructing Phylogenetic Tree using UPGMA Method

Image
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a distance-based method for constructing phylogenetic trees. It works by iteratively clustering the two closest groups of sequences together, forming a new cluster until all sequences are grouped into a single tree. The distances between clusters are calculated using the average of all pairwise distances between sequences within those clusters. UPGMA produces rooted trees, meaning it has a defined root representing the common ancestor. Here's a more detailed explanation: 1. Distance Matrix UPGMA begins with a distance matrix, which contains the pairwise distances between all sequences being compared. These distances can be based on sequence alignment, protein structure comparisons, or other relevant metrics. \[D_{i,j}=\max\begin{cases}D_{i-1,j-1} & + & s(a_i,b_j) \\D_{i-1,j} & + & s(a_i,-) \\D_{i,j-1} & + & s(-,b_j)\end{cases}=\max\begin{cases}D_{i-1,j-1}& + ...

Predicting Functional Regions in the Protein Sequence using SMART

Image
Prediction of functional regions in the protein sequence plays a crucial role in the computer-aided drug discovery. SMART (a Simple Modular Architecture Research Tool) helps identify and annotate protein domains and analyze domain architectures by BLAST search. In bioinformatics, domains refer to distinct functional, structural, or evolutionary units within proteins, DNA, or RNA. Here are some key types of domains in bioinformatics: 1. Protein Domains Structural Domains : Compact, independently folding units within a protein (e.g., SH3, zinc finger, immunoglobulin domains). Functional Domains : Regions responsible for specific biochemical activities (e.g., kinase domain, DNA-binding domain). Evolutionary Domains : Conserved regions indicating common ancestry (e.g., Pfam domains). 2. DNA/RNA Domains Regulatory Domains : DNA regions controlling gene expression (e.g., promoters, enhancers). Functional RNA Domains : Motifs in non-coding RNAs (e.g., ribozyme catalytic cor...

Binary Matrix to Cladogram Construction using NTSYSpc

Image
NTSYSpc (Numerical Taxonomy and Multivariate Analysis System) is a simple and light-weight statistical software used for cluster analysis of molecular genetic qualitative data. The broad features of NTSYSpc include similarity/dissimilarity, clustering, graph theoretic methods, ordination, interactive graphics, multivariate tests, geometric morphometrics, and comparison of matrices. This brief video tutorial illustrates how to use NTSYSpc to create a cladogram ( a tree diagram that shows the cladistic relationship between several populations/species ) from a binary matrix. The table made up of 0s and 1s is represented by the binary matrix. On the other hand, 1 denotes the presence of character, 0 denotes the absence of character ( trait ), and a blank area denotes the absence of character.

Performing Local BLAST Search using BioEdit

Image
BioEdit is a free biological sequence alignment editor. It has an intuitive multiple document interface and offers convenient features that make alignment and manipulation of sequences relatively easy on windows desktop computers. Additionally, there are several sequence manipulation and analysis options and links to external analysis programs that facilitate a working environment, allowing users to view and manipulate sequences with simple point-and-click operations. The BioEdit software provides automated local and web BLAST ( B asic L ocal A lignment S earch T ool) searches with a simple graphical user interface (GUI) to the command-line BLAST program. This is a simple video tutorial on how to construct a custom nucleotide/protein database using BioEdit software to support NCBI BLAST . The tutorial also covers how to perform a query search, specifically pairwise sequence alignment, against the database.

Interactive Phylogenetic Tree Visualization using iTOL

Image
A phylogenetic tree ( a.k.a. , cladogram or dendrogram) is a diagrammatic/graphical representation of the genetical/evolutionary relationship of species/organisms/genes. Phylogenetic tree helps to find the common ancestor. The i nteractive T ree O f L ife ( iTOL ) is an online tool to display and manipulate phylogenetic trees. iTOL offers free access (limited access), standard subscription (unlimited access), and iTOL annotation editor subscription (unlimited and versatile access). It supports user-interactive customizable tree layouts, manual drawing, and annotation. iTOL can visualize trees with 50,000 or more nodes. This is a simple video tutorial for user interactive visualization of phylogenetic trees using the iTOL tool. The iTOL tool produces phylogenetic trees in various formats, such as rectangular, slanting, curved, radial, and curved. It accepts input in Newick, Nexus, or PhyloXML file format.

Constructing Phylogenetic Tree using MEGA Software

Image
A phylogenetic tree ( a.k.a. , cladogram or dendrogram) is a diagrammatic/graphical representation of the genetical/evolutionary relationship of species/organisms/genes. It helps to find the common ancestor. Construction of a phylogenetic tree consists of two phases, multiple sequence alignment and computing distance matrix. This is a simple video tutorial for constructing a phylogenetic tree using Molecular Evolutionary Genetics Analysis ( MEGA ) software. The MEGA software produces phylogenetic trees from multiple sequences in various formats: rectangular, slanting, curved, radial, and curved.

Converting image to 3D molecule using CACTUS OSRA

Image
OSRA (Optical Structure Recognition Application) is a free and open-source optical graph recognition program. The stand-alone version of OSRA is a command-line. OSRA converts a graphical representation of chemical structures from images, as they appear in journal articles, patent documents, textbooks, trade magazines etc., to SMILES (Simplified Molecular Input Line Entry Specification) format. In addition, the online OSRA tool converts the SMILES string to a 3D molecule in SDF (Standard Data File) file format. OSRA can recognize over 90 graphical format documents by parsing vectors through ImageMagick software. The standard file formats include BMP, GIF, ICO, JPEG, PNG, TIFF, WMF, PDF, PS, etc. The CACTUS OSRA tool parses the graphical input to a SMILES string and converts it to a 3D molecule. The video tutorial below demonstrates the conversion of a document in GIF format image to a 3D structure. Note that any software designed for optical recognition ...

Converting image to 3D molecule using VEGA ZZ OSRA

Image
VEGA ZZ is a free (for non-profit academic uses) molecular modelling suite. It consists of many third-party packages, which act as an interface to the VEGA ZZ software. OSRA (Optical Structure Recognition Application) is a free and open-source optical graph recognition program. The stand-alone version of OSRA is a command-line. OSRA converts a graphical representation of chemical structures from images to SMILES (Simplified Molecular Input Line Entry Specification) format. In addition, the online OSRA tool converts the SMILES string to a 3D molecule in SDF (Standard Data File) file format. OSRA can recognize over 90 graphical format documents by parsing vectors through ImageMagick software. The standard file formats include BMP, GIF, ICO, JPEG, PNG, TIFF, WMF, PDF, PS, etc. VEGA ZZ consists of an OSRA plug-in that acts as an interface to it. Moreover, it supports imaging devices such as cameras and image scanners for acquiring documents through the TWAIN interface. The VEG...

RNA Secondary Structure Prediction using Nussinov Algorithm

Image
The Nussinov algorithm is an RNA secondary structure (folding) prediction method using a dynamic programming approach. Ruth Nussinov introduced this algorithm in the year 1978. It involves computing a two-dimensional (2D) diagonal matrix with the same sequence at both dimensions. The scores are given based on complementary ( 1 ) or non-complementary ( 0 ) matches of characters. Matrix solving consists of three stages ( i ) initialization , ( ii ) matrix-filling , and ( iii ) trace-back of arrows for structures. \(\style{ color: blue; } {\begin{array} \\ \text{RNA sequence, } S=a_1a_2a_3....a_{l-1}a_l \\ \begin{align*} \\ \!\!\!\!\! \text{where,} \\ & a=\text{characters (A, U, C, G)} \\ & l=\text{length of the sequence} \\ \end{align*} \end{array}} \) In this tutorial, I have taken a sample RNA sequence ( S ) as GGGAAAUCC for prediction. Initialization The initialization step is to preset the diagonal cells with zero ( 0 ) values to perform matrix filling. \(...

Best Fonts to Display Biological Sequence and Alignment

Image
Biological sequence alignment is a method to find similarities between two sequences. The characters (letters or biological sequence alphabets) are arranged row-wise and column-wise according to match/identity and mismatch of characters through symbol representation. In row-wise, a gap character (hyphen symbol ‘-’) is used to align/adjust the sequence characters for matching without changing the order. Correspondingly, the column-wise characters such as match (same character, pipe symbol ‘|’, colon symbol ‘:’, asterisk symbol ‘*’, or dot symbol ‘.’), mismatch (anyone of the sequence character, or blank space ‘ ’), positive (positive symbol ‘+’), and gap/indel (blank space ‘ ’) are used to represent sequence alignment. A sequence or sequence alignment must be properly formatted in a text editor/word processor using fixed-width fonts to find the differences of characters in the s...

Creating Custom Database using Standalone NCBI BLAST+

Image
Basic Local Alignment Search Tool (BLAST) is a collection of programs developed using heuristic algorithm in C++ for comparing DNA, RNA, and protein sequences. The standalone command-line interface (CLI) of BLAST is named as BLAST+. The latest version of NCBI BLAST+ can be downloaded from the FTP server of NCBI ( ftp://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST ). This is a simple tutorial for creating a custom database, accessing the database, and performing a sequence search using BLAST+. 1. Creating a Custom Database A nucleotide ( nucl ) or protein ( prot ) database can be created using -dbtype parameter in makeblastdb program. makeblastdb is a command-line utility from NCBI's BLAST+ suite used to create searchable databases from sequence files (like FASTA) for faster sequence similarity searches and generating indexed files that allow tools like BLAST to quickly find matches to query sequences, requiring options like -in for input, -dbtype to set the database...

Prediction of 3D Structure (Folding) of a RNA Sequence

Image
Ribonucleic acid (RNA) is a linear single-stranded molecule that takes part in translation to protein. The intramolecular interactions ( a.k.a. folding) of RNA base pairs (A=U and C≡G) form a secondary structure. Nussinov (or) Zuker algorithm is a dynamic programming approach used for the prediction of the secondary structure of the RNA. The dot-bracket structure with the minimum free energy is a stable secondary structure. Prediction of the three-dimensional (3D) structure of RNA using mFold and RNAComposer tools have demonstrated in this tutorial. The mFold tool predicts the dot-bracket notation format RNA folding result from the RNA sequence, while the RNAComposer tool predicts the 3D structure from the dot-bracket notation. The resources used in this tutorial are NCBI GenBank , mFold , and RNAComposer . Note: The length of RNA sequence input in mFold is limited to 4000 bases and sequence/dot-bracket notation input in RNAComposer is limited to 500 bases, due to...

Constructing Entropy Plot from Multiple Sequence Alignment

Image
The entropy in sequence analysis refers to the measure of the variation of characters (column) in multiple sequences. Entropy plot through multiple sequence alignment can be predicted using different types of entropy formulas, namely Shannon's Entropy , Schneider's Entropy , Shenkin's Entropy , Gerstein's Entropy , and Gap normalized Entropy . Prediction of entropy plot consists of two phases: ( i ) performing multiple sequence alignment and consensus, and ( ii ) calculation of entropy number for each column through consensus of multiple sequence alignment. The entropy plot is generated by plotting vertical lines in the order of the consensus sequence on the x -axis, and the entropy number on the y -axis. This simple video tutorial demonstrates how to predict entropy plot through multiple sequence alignment. The tools used in this tutorial are ClustalW , and Entropy Plotter . Note: We can choose any multiple sequence alignment tool, but the ...

Biological Sequence Pattern Matching using Perl

Image
This article is a simple Perl programming tutorial for matching patterns in the biological sequence using regular expressions. In this tutorial, I have used ActiveState Perl 5.24.3 software for compiling the Perl script. Pattern Matching In Bioinformatics, string matching or pattern matching is a fundamental and popular method used in a wide range of applications ranging from sequence alignment to functional prediction. Pattern matching is classified into exact pattern matching and approximate pattern matching. The exact pattern matching method does not allow any insertion, deletion, or substitution of characters while matching with the target sequence, whereas the approximate pattern matching method allows with certain limitations. In Computational Biology, a pattern is an expression as a sequence of characters with a defined set of symbolic representation. Example: N{P}-[ST]{P}A(2,3). Source Code system('cls'); print "\n+-----------------------------------+...

Computing Amino Acid Composition using C++

Image
This article explains the simple method to compute composition of amino acids in the protein sequence using C++. In this tutorial, I have used Dev C++ v5.11 software for compiling the C++ program. Length of the Protein Sequence Length of the protein sequence is the count ( C ) of the total number of amino acid characters in the protein sequence. Let, Protein Sequence (S) = S 1 S 2 S 3 …S l -1 S l Where, S ∈ {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} Then, l is the length of the protein sequence (S). Amino Acid Composition of the Protein Sequence Amino acid composition is the sum of count ( C ) of each amino acids in the protein sequence. Count of each amino acids is C A , C C , C D , C E , C F , C G , C H , C I , C K , C L , C M , C N , C P , C Q , C R , C S , C T , C V , C W , and C Y . Source Code // Computing Composition of Amino Acids in the Protein Sequence #include <iostream> #include <iomanip> #include <stri...