Tutorial | Bioinformatics Of Mice and Microbes

Steps in the Pipeline

Data Preparation

Input: sequences.fna (FASTA format). There is an example sequence that can be run as an input in the pipeline.
Parsing: Using Biopython to load sequences
Verification: Displays sequence ID, length, and snippet

Sequence Properties

GC Content:
- Calculated per sequence
- Bar plot (gc_contents.png)
Sequence Length:
- Measured in base pairs
- Bar plot (lengths.png)
Summary Output:
- CSV with Sequence ID, GC Content, Length (bp)
- Combined DataFrame visualizations (distribution_lengths_GC.png)

Protein Translation

Translates nucleotide sequences to protein using Biopython
Saves results in protein_sequences.fasta and temp_proteins.fasta

Sequence Alignment and Conservation

MSA via Clustal Omega
- Input: temp_proteins.fasta
- Output: sequence_alignments_proteins.faa
Conservation Analysis
- Calculates position-wise conservation score
- Visualization: conservation_plot.png
Pairwise Similarity
- Matrix heatmap (distance_matrix.png)
- Hierarchical clustering (hierarchical_clustering.png)

Phylogenetic Tree Construction

Method: UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Tool: Biopython Phylo
Input: MSA from Clustal Omega
Output Visualization: phylogenetic_tree.png

Functional Annotation via BLAST

Tool: NCBI BLASTP (online query)
For Each Protein:
- Identifies top hit, organism, function, E-value, score
Summary Table: blast_df (pandas DataFrame)
Visualizations:
- Top organisms (organism_frequency_distribution.png)
- Top protein functions (function_distribution.png)

Output Summary

Output File	Description
gc_contents.png	GC content bar plot
lengths.png	Sequence lengths plot
properties.csv	Sequence ID, GC content, length
distribution_lengths_GC.png	Combined histogram & boxplot
protein_sequences.fasta	Translated protein sequences
sequence_alignments_proteins.faa	MSA results
conservation_plot.png	Consensus conservation scores
distance_matrix.png	Pairwise distance heatmap
hierarchical_clustering.png	Sequence clustering dendrogram
phylogenetic_tree.png	Phylogenetic tree
organism_frequency_distribution.png	Top organisms from BLAST
function_distribution.png	Top protein functions from BLAST

Notes

The pipeline is designed to be run in Google Colab.
BLAST queries are performed online using Biopython’s NCBIWWW.qblast(). Internet is required.
Clustal Omega must be available via command line (clustalo).