Between Data Dreams
A Bioinformatics Contest  ·  Deadline: 20 March
Your Rewards

What You Can Win

Tier Who qualifies Reward
🥇 Explorer Full genome + comparison + report Free 30-min 1-on-1 + course acknowledgement + 40% discount
🥈 Builder Partial pipeline + documentation 25% discount + personal feedback on your submission
🌱 Curious Explored repo + documented questions 10% discount + every question answered personally
Deadline: 20 March  ·  betweendatadreams.com

Step by Step

How to Make Your Own Genome

Read this in full before you start. The steps connect — skipping ahead will cost you time.

Before You Start

You need the following on your computer:

Never used the command line? Spend 30 minutes on a basic terminal tutorial first — it will save you hours later.

1
Find Your Species on NCBI

NCBI is the world's largest public genomic database. Your journey starts here.

Choose your organism:

Find raw sequencing data:

Do NOT download an assembled genome. You need raw reads in FASTQ format — look for SRA run files, not GenBank assemblies.
conda install -c bioconda sra-tools
fasterq-dump SRR12345678 --split-files -O ./raw_data/
Large datasets can take 30–60 minutes to download. Start this overnight if needed.
2
Create Your GitHub Repository
git clone https://github.com/YOUR_USERNAME/MyGenome-YourSpecies.git
cd MyGenome-YourSpecies
git clone https://github.com/zeyak/GenoDiplo.git

Your folder structure should look like this:

MyGenome-YourSpecies/
├── GenoDiplo/
│   ├── workflow/
│   ├── README.md
│   └── environment.yml
└── raw_data/
    └── your_reads.fastq
3
Install Dependencies and Configure

Create the Conda environment:

cd GenoDiplo
conda env create -f environment.yml
conda activate GenoDiplo
If environment.yml fails, run: conda update conda — then retry.

Open config.yaml and edit the key fields for your species:

samples:
  your_species:
    reads: "../raw_data/your_reads.fastq"

genome_size: "50m"     # search '[species] genome size' on NCBI
min_read_length: 1000  # lower to 500 if too few reads pass
threads: 4             # number of CPU cores on your machine
Search '[species name] genome size' on Google Scholar or the NCBI Genome page to find the right estimate.
4
Run the Pipeline

Do a dry run first to check everything is configured correctly:

snakemake --cores 4 --dry-run

If no errors — run the full pipeline:

snakemake --cores 4
The pipeline will likely break somewhere. This is intentional — it mirrors real research. Read the error, Google it, fix it, try again. Document every error and fix in your report.
# If it breaks mid-run, skip failed steps and continue:
snakemake --cores 4 --keep-going
5
Compare to a Reference Genome

Go to ncbi.nlm.nih.gov/genome and find a reference genome for a closely related species. Download in FASTA format, then run QUAST:

conda install -c bioconda quast
quast.py your_assembly.fasta -r reference_genome.fasta -o quast_output/

Key metrics to report from QUAST:

6
Write Your Evaluation Report

Your report doesn't need to be a scientific paper. It needs to be honest and clear. Include:

7
Submit by 20 March

Push everything to GitHub:

git add .
git commit -m "Final genome assembly and report"
git push origin main

Then email zeynepakdeniz@betweendatadreams.com with:

You did it. The data dream is real.

Go Deeper

Resources

Everything you need — the tool, the science behind it, and the research that inspired this contest.

The Tool
GitHub Repository
GenoDiplo — Genome Analysis Pipeline
The open-source Snakemake pipeline you are using in this contest. Clone it, read the README, adapt it for your species.
github.com/zeyak/GenoDiplo
The Science
PhD Thesis — Uppsala University
Sequencing and Comparative Analyses of Diplomonad Genomes
The doctoral thesis by Zeynep Akdeniz that underpins the GenoDiplo pipeline. Covers the full genomic landscape of diplomonads, the assembly methodology, and comparative analysis across the group. Essential reading to understand why this organism and this pipeline exist.
uu.diva-portal.org/smash/get/diva2:1893897/FULLTEXT01.pdf
Published Paper — Scientific Data, Nature · February 2025
The Expanded Genome of Hexamita inflata, a Free-Living Diplomonad
The peer-reviewed paper describing the first reference genome of Hexamita inflata — a free-living diplomonad with a 142 Mbp genome encoding 79,341 proteins. This is the destination: the kind of work GenoDiplo is designed to produce.
nature.com/articles/s41597-025-04514-x
Databases & Tools
NCBI SRA
Sequence Read Archive — Raw Genomic Data
Find and download raw sequencing reads (FASTQ) for thousands of organisms. Start here for Step 1.
ncbi.nlm.nih.gov/sra
NCBI Genome
Reference Genome Database
Find assembled reference genomes for comparison in Step 5.
ncbi.nlm.nih.gov/genome
Snakemake
Workflow Management System
Full documentation for the Snakemake system that powers GenoDiplo.
snakemake.readthedocs.io
QUAST
Genome Assembly Quality Assessment
Used in Step 5 to compare your assembly to a reference genome.
quast.sourceforge.net
Bioconda
Bioinformatics Software via Conda
All the bioinformatics tools you need — installable with a single conda command.
bioconda.github.io