Steps in the Pipeline
- Data Preparation
- Input: sequences.fna (FASTA format). There is an example sequence that can be run as an input in the pipeline.
- Parsing: Using Biopython to load sequences
- Verification: Displays sequence ID, length, and snippet
- Sequence Properties
- GC Content:
- Calculated per sequence
- Bar plot (gc_contents.png)
- Sequence Length:
- Measured in base pairs
- Bar plot (lengths.png)
- Summary Output:
- CSV with Sequence ID, GC Content, Length (bp)
- Combined DataFrame visualizations (distribution_lengths_GC.png)
- Protein Translation
- Translates nucleotide sequences to protein using Biopython
- Saves results in protein_sequences.fasta and temp_proteins.fasta
- Sequence Alignment and Conservation
- MSA via Clustal Omega
- Input: temp_proteins.fasta
- Output: sequence_alignments_proteins.faa
- Conservation Analysis
- Calculates position-wise conservation score
- Visualization: conservation_plot.png
- Pairwise Similarity
- Matrix heatmap (distance_matrix.png)
- Hierarchical clustering (hierarchical_clustering.png)
- Phylogenetic Tree Construction
- Method: UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Tool: Biopython Phylo
- Input: MSA from Clustal Omega
- Output Visualization: phylogenetic_tree.png
- Functional Annotation via BLAST
- Tool: NCBI BLASTP (online query)
- For Each Protein:
- Identifies top hit, organism, function, E-value, score
- Summary Table: blast_df (pandas DataFrame)
- Visualizations:
- Top organisms (organism_frequency_distribution.png)
- Top protein functions (function_distribution.png)
Output Summary
Output File | Description |
---|---|
gc_contents.png | GC content bar plot |
lengths.png | Sequence lengths plot |
properties.csv | Sequence ID, GC content, length |
distribution_lengths_GC.png | Combined histogram & boxplot |
protein_sequences.fasta | Translated protein sequences |
sequence_alignments_proteins.faa | MSA results |
conservation_plot.png | Consensus conservation scores |
distance_matrix.png | Pairwise distance heatmap |
hierarchical_clustering.png | Sequence clustering dendrogram |
phylogenetic_tree.png | Phylogenetic tree |
organism_frequency_distribution.png | Top organisms from BLAST |
function_distribution.png | Top protein functions from BLAST |
Notes
- The pipeline is designed to be run in Google Colab.
- BLAST queries are performed online using Biopython’s NCBIWWW.qblast(). Internet is required.
- Clustal Omega must be available via command line (clustalo).