A Bioinformatics Contest · Deadline: 20 March

Your Rewards

What You Can Win

Tier	Who qualifies	Reward
🥇 Explorer	Full genome + comparison + report	Free 30-min 1-on-1 + course acknowledgement + 40% discount
🥈 Builder	Partial pipeline + documentation	25% discount + personal feedback on your submission
🌱 Curious	Explored repo + documented questions	10% discount + every question answered personally

Deadline: 20 March · betweendatadreams.com

Step by Step

How to Make Your Own Genome

Read this in full before you start. The steps connect — skipping ahead will cost you time.

Before You Start

You need the following on your computer:

Linux, macOS, or Windows with WSL2
Conda (Miniconda or Anaconda) — docs.conda.io
Git — git-scm.com
A free GitHub account — github.com
Stable internet connection

Never used the command line? Spend 30 minutes on a basic terminal tutorial first — it will save you hours later.

Find Your Species on NCBI

NCBI is the world's largest public genomic database. Your journey starts here.

Choose your organism:

Smaller genome = easier. Bacteria, fungi, small eukaryotes are ideal.
Diplomonads (Spironucleus, Hexamita, Giardia) are perfect — GenoDiplo was built for them.
Avoid plants and vertebrates for a first attempt — their genomes are too large.

Find raw sequencing data:

Go to ncbi.nlm.nih.gov/sra
Search your organism name
Filter: Library Strategy = WGS | Platform = OXFORD_NANOPORE
Note the SRR accession number (e.g. SRR12345678)

Do NOT download an assembled genome. You need raw reads in FASTQ format — look for SRA run files, not GenBank assemblies.

conda install -c bioconda sra-tools
fasterq-dump SRR12345678 --split-files -O ./raw_data/

Large datasets can take 30–60 minutes to download. Start this overnight if needed.

Create Your GitHub Repository

Go to github.com → click + → New repository
Name it: MyGenome-[YourSpeciesName]
Set to Public — must be viewable for submission
Add a README → Create repository

git clone https://github.com/YOUR_USERNAME/MyGenome-YourSpecies.git
cd MyGenome-YourSpecies
git clone https://github.com/zeyak/GenoDiplo.git

Your folder structure should look like this:

MyGenome-YourSpecies/
├── GenoDiplo/
│   ├── workflow/
│   ├── README.md
│   └── environment.yml
└── raw_data/
    └── your_reads.fastq

Install Dependencies and Configure

Create the Conda environment:

cd GenoDiplo
conda env create -f environment.yml
conda activate GenoDiplo

If environment.yml fails, run: conda update conda — then retry.

Open config.yaml and edit the key fields for your species:

samples:
  your_species:
    reads: "../raw_data/your_reads.fastq"

genome_size: "50m"     # search '[species] genome size' on NCBI
min_read_length: 1000  # lower to 500 if too few reads pass
threads: 4             # number of CPU cores on your machine

Search '[species name] genome size' on Google Scholar or the NCBI Genome page to find the right estimate.

Run the Pipeline

Do a dry run first to check everything is configured correctly:

snakemake --cores 4 --dry-run

If no errors — run the full pipeline:

snakemake --cores 4

The pipeline will likely break somewhere. This is intentional — it mirrors real research. Read the error, Google it, fix it, try again. Document every error and fix in your report.

# If it breaks mid-run, skip failed steps and continue:
snakemake --cores 4 --keep-going

Compare to a Reference Genome

Go to ncbi.nlm.nih.gov/genome and find a reference genome for a closely related species. Download in FASTA format, then run QUAST:

conda install -c bioconda quast
quast.py your_assembly.fasta -r reference_genome.fasta -o quast_output/

Key metrics to report from QUAST:

N50 — assembly contiguity (higher = better)
Total assembly size vs. expected genome size
Number of contigs (fewer, longer = better)
% of reference covered by your assembly

Write Your Evaluation Report

Your report doesn't need to be a scientific paper. It needs to be honest and clear. Include:

Species chosen and why
NCBI accession number and data source
Changes made to config.yaml and pipeline
Errors encountered and how you resolved them
QUAST comparison results — table or screenshot
What you would explore or improve next

Submit by 20 March

Push everything to GitHub:

git add .
git commit -m "Final genome assembly and report"
git push origin main

Then email zeynepakdeniz@betweendatadreams.com with:

Your GitHub repository link
Your assembled genome FASTA file (attached)
Your evaluation report — PDF or Markdown

You did it. The data dream is real.

Go Deeper

Resources

Everything you need — the tool, the science behind it, and the research that inspired this contest.

The Tool

GitHub Repository

GenoDiplo — Genome Analysis Pipeline

The open-source Snakemake pipeline you are using in this contest. Clone it, read the README, adapt it for your species.

github.com/zeyak/GenoDiplo

The Science

PhD Thesis — Uppsala University

Sequencing and Comparative Analyses of Diplomonad Genomes

The doctoral thesis by Zeynep Akdeniz that underpins the GenoDiplo pipeline. Covers the full genomic landscape of diplomonads, the assembly methodology, and comparative analysis across the group. Essential reading to understand why this organism and this pipeline exist.

uu.diva-portal.org/smash/get/diva2:1893897/FULLTEXT01.pdf

Published Paper — Scientific Data, Nature · February 2025

The Expanded Genome of Hexamita inflata, a Free-Living Diplomonad

The peer-reviewed paper describing the first reference genome of Hexamita inflata — a free-living diplomonad with a 142 Mbp genome encoding 79,341 proteins. This is the destination: the kind of work GenoDiplo is designed to produce.

nature.com/articles/s41597-025-04514-x

Databases & Tools

NCBI SRA

Sequence Read Archive — Raw Genomic Data

Find and download raw sequencing reads (FASTQ) for thousands of organisms. Start here for Step 1.

ncbi.nlm.nih.gov/sra

NCBI Genome

Reference Genome Database

Find assembled reference genomes for comparison in Step 5.

ncbi.nlm.nih.gov/genome

Snakemake

Workflow Management System

Full documentation for the Snakemake system that powers GenoDiplo.

snakemake.readthedocs.io

QUAST

Genome Assembly Quality Assessment

Used in Step 5 to compare your assembly to a reference genome.

quast.sourceforge.net

Bioconda

Bioinformatics Software via Conda

All the bioinformatics tools you need — installable with a single conda command.

bioconda.github.io