Steps in the Pipeline

  1. Data Preparation
  • Input: sequences.fna (FASTA format). There is an example sequence that can be run as an input in the pipeline.
  • Parsing: Using Biopython to load sequences
  • Verification: Displays sequence ID, length, and snippet
  1. Sequence Properties
  • GC Content:
    • Calculated per sequence
    • Bar plot (gc_contents.png)
  • Sequence Length:
    • Measured in base pairs
    • Bar plot (lengths.png)
  • Summary Output:
    • CSV with Sequence ID, GC Content, Length (bp)
    • Combined DataFrame visualizations (distribution_lengths_GC.png)
  1. Protein Translation
  • Translates nucleotide sequences to protein using Biopython
  • Saves results in protein_sequences.fasta and temp_proteins.fasta
  1. Sequence Alignment and Conservation
  • MSA via Clustal Omega
    • Input: temp_proteins.fasta
    • Output: sequence_alignments_proteins.faa
  • Conservation Analysis
    • Calculates position-wise conservation score
    • Visualization: conservation_plot.png
  • Pairwise Similarity
    • Matrix heatmap (distance_matrix.png)
    • Hierarchical clustering (hierarchical_clustering.png)
  1. Phylogenetic Tree Construction
  • Method: UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
  • Tool: Biopython Phylo
  • Input: MSA from Clustal Omega
  • Output Visualization: phylogenetic_tree.png
  1. Functional Annotation via BLAST
  • Tool: NCBI BLASTP (online query)
  • For Each Protein:
    • Identifies top hit, organism, function, E-value, score
  • Summary Table: blast_df (pandas DataFrame)
  • Visualizations:
    • Top organisms (organism_frequency_distribution.png)
    • Top protein functions (function_distribution.png)

Output Summary

Output File Description
gc_contents.png GC content bar plot
lengths.png Sequence lengths plot
properties.csv Sequence ID, GC content, length
distribution_lengths_GC.png Combined histogram & boxplot
protein_sequences.fasta Translated protein sequences
sequence_alignments_proteins.faa MSA results
conservation_plot.png Consensus conservation scores
distance_matrix.png Pairwise distance heatmap
hierarchical_clustering.png Sequence clustering dendrogram
phylogenetic_tree.png Phylogenetic tree
organism_frequency_distribution.png Top organisms from BLAST
function_distribution.png Top protein functions from BLAST

Notes

  • The pipeline is designed to be run in Google Colab.
  • BLAST queries are performed online using Biopython’s NCBIWWW.qblast(). Internet is required.
  • Clustal Omega must be available via command line (clustalo).