From Millions of Reads to Complete Genomes

Genome Assembly
From Millions of Reads to Complete Genomes
Transform fragmented sequencing data into contiguous genome representations through computational reconstruction.
The Assembly Challenge
Reconstructing the Genomic Puzzle
Genome assembly faces a daunting mathematical challenge: reconstructing a complete genome from millions of short, overlapping fragments. Consider the human genome — 3 billion base pairs of information that must be pieced together from reads of only 150 base pairs each.
To adequately cover the genome, you need 20 million or more reads. The assembler must determine how these millions of fragments overlap, which ones connect to which, and in what orientation they should be placed.
How do we order and orient all these pieces correctly? That's where sophisticated assembly algorithms come in.
Challenge
The Repeat Problem
When Identical Sequences Create Ambiguity
Repetitive DNA is the nemesis of genome assembly. When a sequence appears multiple times in the genome, reads originating from these regions look identical. The assembler cannot determine which genomic copy each read came from.
Example scenario: A transposable element appears 10 times throughout the genome. Reads mapping to this element could belong to any of those 10 copies. Without additional information, the assembler must make arbitrary choices, potentially creating misassemblies.
Comparing Genome Assembly Approaches
Aligning sequence reads to an existing reference genome to reconstruct the target genome
De-novo : Building the genome from scratch by assembling overlapping sequence reads without a reference template.
 Illustration of how over-lapping works 
Image Reference: https://www.cs.hku.hk/research/research-highlights/De_Novo_Genome_Assembly
De Novo vs. Reference-Based Assembly
De Novo Assembly
Advantages:
No reference genome required
Discovers novel sequences and structural variants
Captures organism-specific genomic features
Disadvantages:
Computationally intensive and time-consuming
Struggles with highly repetitive regions
Requires high sequencing coverage (50-100x)
Reference-Based Assembly
Advantages:
Fast and computationally straightforward
Works well with lower coverage (10-30x)
Leverages existing genomic knowledge
Disadvantages:
Introduces reference bias in variant calling
Cannot discover novel sequences
Requires closely related reference genome
Assembly Graphs: De Bruijn Approach
Graph-Based Genome Reconstruction
Modern assemblers use de Bruijn graphs to represent relationships between sequence fragments. In this approach, nodes represent short k-mers (typically 31-127 bp sequences), and edges connect k-mers that overlap.
A path through the graph reconstructs the original genomic sequence. This method is remarkably efficient and handles sequencing errors gracefully through graph simplification algorithms.
However, repetitive sequences create ambiguous branching points where multiple valid paths exist, leading to assembly fragmentation. The choice of k-mer size critically affects graph structure and assembly quality.
De Bruijn Graph Assembly
K-mers are short DNA subsequences of length k extracted from sequencing reads, used as building blocks in graph-based assembly.
Standard de Bruijn Graph
An assembly graph of a mouse gut metagenome. 
The assembly graph was built by metaspades and visualized with Bandage. Image courtesy of Silas Kieser.
Compact de Bruijn Graph
Simplified graph merging linear paths for efficiency
Image Reference : https://spacegraphcats.github.io/spacegraphcats/0a-primer/
Assembly Methods Compared
This diagram illustrates two primary computational approaches to genome assembly:
On the left side, the Overlap-Layout-Consensus (OLC) method is shown. It begins by identifying overlaps between individual sequencing reads (i), then positions these reads to form contigs (ii), and finally generates a consensus sequence (iii) by resolving discrepancies.
The right side depicts the de Bruijn graph assembly method. This approach converts reads into a collection of fixed-length subsequences called k-mers (i). These k-mers are then used to construct a de Bruijn graph where nodes are k-mers and edges represent overlaps (ii). The final genomic sequence (contigs) is generated by traversing paths through this graph (iii).
While both methods effectively reconstruct genomic sequences from fragmented data, they employ fundamentally different computational strategies to achieve this common goal.
Image Reference : https://spacegraphcats.github.io/spacegraphcats/0a-primer/
De Bruijn vs. Overlap-Layout-Consensus Assembly
De Bruijn Graph Approach:
Method: Breaks reads into k-mers, builds graph where nodes are k-mers and edges represent overlaps
Computational Efficiency: Fast and memory-efficient for short reads
Best For: Illumina short reads (50-300 bp)
Advantages: Handles high coverage well, computationally scalable, error-tolerant through graph simplification
Disadvantages: Struggles with repeats longer than k-mer size, loses read connectivity information
Examples: SPAdes, Velvet, SOAPdenovo
Overlap-Consensus Approach:
Method: Finds all pairwise overlaps between reads, builds overlap graph, finds consensus path
Computational Efficiency: Computationally intensive, requires all-vs-all comparisons
Best For: Long reads (PacBio, Nanopore) with higher error rates
Advantages: Preserves read connectivity, better handles long repeats, produces longer contigs
Disadvantages: Slow for large datasets, memory-intensive, sensitive to sequencing errors
Examples: Canu, Miniasm, Flye
Modern hybrid assemblers often combine both approaches to leverage their complementary strengths.
Coverage & Read Depth
Understanding Sequencing Depth
Coverage is calculated as total bases sequenced divided by genome size. It indicates how many times, on average, each genomic position has been sequenced.
\text{Coverage} = \frac{\text{Total Bases Sequenced}}{\text{Genome Size}}Adequate coverage is essential for confident base calling and successful assembly. Insufficient coverage creates gaps, while excessive coverage wastes resources without proportional quality improvement.
< 10x
Assembly nearly impossible
10-30x
Acceptable for small genomes
50-100x
Gold standard for complex genomes
> 100x
Overkill for most applications
Metrics
Key Assembly Metrics
1
N50 Length
The contig length at which 50% of the assembly is contained in contigs of this size or larger. Higher is better — aim for >50kb for bacterial genomes. This metric balances both contig length and quantity.
2
L50 Count
The smallest number of contigs whose combined length represents 50% of the assembly. Lower is better — fewer contigs indicate more contiguous assembly with less fragmentation.
3
Total Contig Count
The total number of assembled sequences produced. Fewer is better — indicates the assembler successfully joined fragments. Highly fragmented assemblies have thousands of small contigs.
4
BUSCO Score
Percentage of conserved single-copy orthologs found complete in the assembly. >90% is excellent, indicating high completeness. This metric assesses biological accuracy, not just sequence contiguity.
SPAdes Assembler
St. Petersburg Genome Assembler
SPAdes is the most popular and widely-adopted de novo assembler for short-read Illumina data. Developed at St. Petersburg Academic University, it has become the gold standard for bacterial and small eukaryotic genome assembly due to its sophisticated algorithms and consistent performance.
Multi-Kmer Strategy
Automatically builds and combines graphs using multiple k-mer sizes (21, 33, 55, 77) to balance sensitivity and specificity
Error Correction
Aggressive built-in error correction algorithms clean data before assembly, improving graph quality
Paired-End Aware
Intelligently uses paired-end information to resolve ambiguities and create longer scaffolds
Dual Output
Produces both scaffolds (with estimated gaps) and contigs (continuous sequences only)
How SPAdes Works
SPAdes employs a sophisticated multi-stage pipeline. It begins by constructing de Bruijn graphs at multiple k-mer sizes, each capturing different aspects of the genome structure. Small k-mers provide sensitivity for low-coverage regions, while large k-mers span repeats more effectively.
Graph simplification removes erroneous branches caused by sequencing errors and low-complexity sequences. The repeat graph structure explicitly represents repetitive elements. Finally, paired-end relationships connect nearby contigs into scaffolds, with estimated gap sizes represented by Ns in the output sequence.
Choosing K Values
Read Length < 100bp
Use k = 21, 31, 41
Small k-mers match shorter reads
Read Length 100-150bp
Use k = 21, 33, 55, 77
SPAdes default settings
Read Length > 150bp
Use k = 33, 55, 77, 99
Larger k-mers for longer reads
Coverage Considerations
Low coverage (<30x): Favor smaller k values that require less depth to form confident overlaps. Large k-mers may lack sufficient support at low coverage.
High coverage (>50x): Larger k values become safe and effective, providing better repeat resolution and more specific overlaps.
General rule: k ≈ 1/3 of read length
Multiple k values allow SPAdes to extract maximum information from your data, with each size contributing unique insights about genome structure.
Challenges
Common Assembly Challenges
Highly Repetitive Genomes
Genomes rich in transposable elements or tandem repeats produce fragmented assemblies with many small contigs. The assembler cannot confidently traverse ambiguous repeat regions. Solution: Increase coverage or use long-read sequencing to span repeats.
Insufficient Coverage
Low sequencing depth creates gaps where regions lack adequate read support. Assembly fragmentation increases dramatically below 30x coverage. Solution: Sequence additional libraries to increase depth.
Sample Contamination
Non-target DNA introduces foreign sequences into the assembly, creating spurious contigs and inflating genome size estimates. Solution: Screen samples carefully and use contamination detection tools.
High Heterozygosity
In diploid or polyploid organisms, sequence variants between homologous chromosomes confuse assemblers, leading to collapsed or duplicated regions. Solution: Use heterozygosity-aware assemblers or haplotype-specific approaches.
Running SPAdes
Basic Command Structure
SPAdes uses a straightforward command-line interface with intuitive parameters for common scenarios.
spades.py \
  -1 reads_R1.fastq.gz \
  -2 reads_R2.fastq.gz \
  -o output_dir \
  --careful \
  -t 4 \
  -k 21,33,55,77
-1, -2: Forward and reverse paired-end read files
-o: Output directory for results
--careful: Reduces misassemblies in low-coverage regions
-t: Number of parallel threads (adjust for your system)
-k: K-mer sizes to use (comma-separated list)
SPAdes Output Files
scaffolds.fasta (PRIMARY OUTPUT)
Your assembled genome with estimated gaps between contigs represented as Ns. This is the main result file you'll use for downstream analysis and annotation.
contigs.fasta
Continuous sequences without any predicted gaps. Use this when you need only high-confidence assembled sequence without gap estimations.
assembly_graph_with_scaffolds.gfa
Graph structure file for visualization in tools like Bandage. Allows you to inspect the assembly graph structure and identify problematic regions.
spades.log
Detailed execution log documenting each assembly stage, k-mer statistics, and any warnings or errors encountered during the run.
corrected/ directory
Contains error-corrected reads that SPAdes generated before assembly. These cleaned reads can be useful for other analyses.
tmp/ directory
Temporary intermediate files used during assembly. Can be deleted after successful completion to save disk space.
Advanced Topics
Hybrid Assembly, Metagenomics & More
Expanding beyond standard assembly approaches to tackle complex genomic challenges.
Hybrid Assembly: Combining Read Types
Best of Both Worlds
Hybrid assembly strategically combines the complementary strengths of different sequencing technologies. Short reads from Illumina provide high accuracy (>99.9%) but limited ability to span repetitive regions. Long reads from PacBio or Oxford Nanopore can span repeats and resolve structural complexity but have higher error rates (5-15%).
SPAdes hybrid mode uses accurate short reads to correct errors in long reads, then leverages the corrected long reads to resolve repeats and create more contiguous assemblies.
spades.py \
  -1 short_R1.fq.gz \
  -2 short_R2.fq.gz \
  --pacbio long_reads.fastq \
  -o hybrid_output
Metagenomic Assembly
Assembling Mixed Communities
Metagenomic assembly reconstructs genomes from environmental samples containing multiple organisms. This presents unique challenges: organisms present at vastly different abundances create highly variable coverage, and similar species can cause cross-contamination between assemblies.
SPAdes metagenomic mode adjusts algorithms to handle heterogeneous coverage and reduces chimeric assemblies. However, post-processing binning remains essential to separate individual organism genomes from the mixed assembly.
spades.py \
  --meta \
  -1 reads_R1.fq.gz \
  -2 reads_R2.fq.gz \
  -o metagenomic_output
Follow up with binning tools like MetaBAT, MaxBin, or CONCOCT to assign contigs to source organisms.
Summary
Key Takeaways
QC is Non-Negotiable
Quality control forms the foundation of reliable NGS analysis. Never skip this critical step.
Assembly is Tractable
Despite complexity, genome assembly becomes manageable with good data and proper tools.
SPAdes is Powerful
SPAdes provides versatile, sophisticated algorithms suitable for diverse assembly challenges.
Always Validate
Assembly quality assessment reveals problems and confirms your results meet standards.
Document Everything
Reproducibility requires thorough documentation of all analysis decisions and parameters.
Next step: Apply these concepts in hands-on practice sessions with real sequencing datasets!
Resources
Essential Tools & Resources
FastQC
Quality control tool for high throughput sequence data
bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC
Aggregate results from multiple bioinformatics analyses
multiqc.info
SPAdes
Genome assembler for single-cell and standard assemblies
cab.spbu.ru/software/spades/
Trimmomatic
Flexible read trimming tool for Illumina NGS data
usadellab.org/cms/?page=trimmomatic
Fastp
Ultra-fast all-in-one FASTQ preprocessor
github.com/OpenGene/fastp