Bioinformatics Group Project: Protein Sequence Analysis Pipeline
This project performs comprehensive analysis on a set of DNA sequences. It includes sequence property exploration, translation to protein, multiple and pairwise alignments, phylogenetic tree construction, and BLAST functional annotation.
A tutorial can be found here.
Project Overview
This pipeline, written in Python and run via Google Colab, processes input DNA sequences (.fna) to explore their properties and translate them into protein sequences. It then performs various bioinformatics analyses including:
- GC content and sequence length visualization
- Translation to proteins
- Multiple sequence alignment (MSA) using Clustal Omega
- Sequence conservation analysis
- Pairwise similarity and clustering
- Phylogenetic tree generation
- Functional annotation via BLAST
Dependencies
Ensure the following packages are installed:
pip install biopython matplotlib numpy scipy pandas seaborn apt-get install -y clustalo
Steps in the Pipeline
- Data Preparation
- Input: sequences.fna (FASTA format)
- Parsing: Using Biopython to load sequences
- Verification: Displays sequence ID, length, and snippet
- Sequence Properties
- GC Content:
- Calculated per sequence
- Bar plot (gc_contents.png)
- Sequence Length:
- Measured in base pairs
- Bar plot (lengths.png)
- Summary Output:
- CSV with Sequence ID, GC Content, Length (bp)
- Combined DataFrame visualizations (distribution_lengths_GC.png)
- Protein Translation
- Translates nucleotide sequences to protein using Biopython
- Saves results in protein_sequences.fasta and temp_proteins.fasta
- Sequence Alignment and Conservation
- MSA via Clustal Omega
- Input: temp_proteins.fasta
- Output: sequence_alignments_proteins.faa
- Conservation Analysis
- Calculates position-wise conservation score
- Visualization: conservation_plot.png
- Pairwise Similarity
- Matrix heatmap (distance_matrix.png)
- Hierarchical clustering (hierarchical_clustering.png)
- Phylogenetic Tree Construction
- Method: UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Tool: Biopython Phylo
- Input: MSA from Clustal Omega
- Output Visualization: phylogenetic_tree.png
- Functional Annotation via BLAST
- Tool: NCBI BLASTP (online query)
- For Each Protein:
- Identifies top hit, organism, function, E-value, score
- Summary Table: blast_df (pandas DataFrame)
- Visualizations:
- Top organisms (organism_frequency_distribution.png)
- Top protein functions (function_distribution.png)
Output Summary
Output File | Description |
---|---|
gc_contents.png | GC content bar plot |
lengths.png | Sequence lengths plot |
properties.csv | Sequence ID, GC content, length |
distribution_lengths_GC.png | Combined histogram & boxplot |
protein_sequences.fasta | Translated protein sequences |
sequence_alignments_proteins.faa | MSA results |
conservation_plot.png | Consensus conservation scores |
distance_matrix.png | Pairwise distance heatmap |
hierarchical_clustering.png | Sequence clustering dendrogram |
phylogenetic_tree.png | Phylogenetic tree |
organism_frequency_distribution.png | Top organisms from BLAST |
function_distribution.png | Top protein functions from BLAST |
Notes
- The pipeline is designed to be run in Google Colab.
- BLAST queries are performed online using Biopython’s NCBIWWW.qblast(). Internet is required.
- Clustal Omega must be available via command line (clustalo).
Credits
Developed by Of Mice and Microbes. Part of the Bioinformatics course group project, Spring 2025.