A Bioinformatics Contest · Deadline: 20 March
Your Rewards
What You Can Win
| Tier |
Who qualifies |
Reward |
| 🥇 Explorer |
Full genome + comparison + report |
Free 30-min 1-on-1 + course acknowledgement + 40% discount |
| 🥈 Builder |
Partial pipeline + documentation |
25% discount + personal feedback on your submission |
| 🌱 Curious |
Explored repo + documented questions |
10% discount + every question answered personally |
Deadline: 20 March · betweendatadreams.com
Step by Step
How to Make Your Own Genome
Read this in full before you start. The steps connect — skipping ahead will cost you time.
Before You Start
You need the following on your computer:
Never used the command line? Spend 30 minutes on a basic terminal tutorial first — it will save you hours later.
1
Find Your Species on NCBI
NCBI is the world's largest public genomic database. Your journey starts here.
Choose your organism:
- Smaller genome = easier. Bacteria, fungi, small eukaryotes are ideal.
- Diplomonads (Spironucleus, Hexamita, Giardia) are perfect — GenoDiplo was built for them.
- Avoid plants and vertebrates for a first attempt — their genomes are too large.
Find raw sequencing data:
- Go to ncbi.nlm.nih.gov/sra
- Search your organism name
- Filter: Library Strategy = WGS | Platform = OXFORD_NANOPORE
- Note the SRR accession number (e.g. SRR12345678)
Do NOT download an assembled genome. You need raw reads in FASTQ format — look for SRA run files, not GenBank assemblies.
conda install -c bioconda sra-tools
fasterq-dump SRR12345678 --split-files -O ./raw_data/
Large datasets can take 30–60 minutes to download. Start this overnight if needed.
2
Create Your GitHub Repository
- Go to github.com → click + → New repository
- Name it: MyGenome-[YourSpeciesName]
- Set to Public — must be viewable for submission
- Add a README → Create repository
git clone https://github.com/YOUR_USERNAME/MyGenome-YourSpecies.git
cd MyGenome-YourSpecies
git clone https://github.com/zeyak/GenoDiplo.git
Your folder structure should look like this:
MyGenome-YourSpecies/
├── GenoDiplo/
│ ├── workflow/
│ ├── README.md
│ └── environment.yml
└── raw_data/
└── your_reads.fastq
3
Install Dependencies and Configure
Create the Conda environment:
cd GenoDiplo
conda env create -f environment.yml
conda activate GenoDiplo
If environment.yml fails, run: conda update conda — then retry.
Open config.yaml and edit the key fields for your species:
samples:
your_species:
reads: "../raw_data/your_reads.fastq"
genome_size: "50m" # search '[species] genome size' on NCBI
min_read_length: 1000 # lower to 500 if too few reads pass
threads: 4 # number of CPU cores on your machine
Search '[species name] genome size' on Google Scholar or the NCBI Genome page to find the right estimate.
4
Run the Pipeline
Do a dry run first to check everything is configured correctly:
snakemake --cores 4 --dry-run
If no errors — run the full pipeline:
snakemake --cores 4
The pipeline will likely break somewhere. This is intentional — it mirrors real research. Read the error, Google it, fix it, try again. Document every error and fix in your report.
# If it breaks mid-run, skip failed steps and continue:
snakemake --cores 4 --keep-going
5
Compare to a Reference Genome
Go to ncbi.nlm.nih.gov/genome and find a reference genome for a closely related species. Download in FASTA format, then run QUAST:
conda install -c bioconda quast
quast.py your_assembly.fasta -r reference_genome.fasta -o quast_output/
Key metrics to report from QUAST:
- N50 — assembly contiguity (higher = better)
- Total assembly size vs. expected genome size
- Number of contigs (fewer, longer = better)
- % of reference covered by your assembly
6
Write Your Evaluation Report
Your report doesn't need to be a scientific paper. It needs to be honest and clear. Include:
- Species chosen and why
- NCBI accession number and data source
- Changes made to config.yaml and pipeline
- Errors encountered and how you resolved them
- QUAST comparison results — table or screenshot
- What you would explore or improve next
7
Submit by 20 March
Push everything to GitHub:
git add .
git commit -m "Final genome assembly and report"
git push origin main
Then email zeynepakdeniz@betweendatadreams.com with:
- Your GitHub repository link
- Your assembled genome FASTA file (attached)
- Your evaluation report — PDF or Markdown
You did it. The data dream is real.
Go Deeper
Resources
Everything you need — the tool, the science behind it, and the research that inspired this contest.
The Tool
GitHub Repository
GenoDiplo — Genome Analysis Pipeline
The open-source Snakemake pipeline you are using in this contest. Clone it, read the README, adapt it for your species.
github.com/zeyak/GenoDiplo
The Science
PhD Thesis — Uppsala University
Sequencing and Comparative Analyses of Diplomonad Genomes
The doctoral thesis by Zeynep Akdeniz that underpins the GenoDiplo pipeline. Covers the full genomic landscape of diplomonads, the assembly methodology, and comparative analysis across the group. Essential reading to understand why this organism and this pipeline exist.
uu.diva-portal.org/smash/get/diva2:1893897/FULLTEXT01.pdf
Published Paper — Scientific Data, Nature · February 2025
The Expanded Genome of Hexamita inflata, a Free-Living Diplomonad
The peer-reviewed paper describing the first reference genome of Hexamita inflata — a free-living diplomonad with a 142 Mbp genome encoding 79,341 proteins. This is the destination: the kind of work GenoDiplo is designed to produce.
nature.com/articles/s41597-025-04514-x
Databases & Tools
NCBI SRA
Sequence Read Archive — Raw Genomic Data
Find and download raw sequencing reads (FASTQ) for thousands of organisms. Start here for Step 1.
ncbi.nlm.nih.gov/sra
NCBI Genome
Reference Genome Database
Find assembled reference genomes for comparison in Step 5.
ncbi.nlm.nih.gov/genome
Snakemake
Workflow Management System
Full documentation for the Snakemake system that powers GenoDiplo.
snakemake.readthedocs.io
QUAST
Genome Assembly Quality Assessment
Used in Step 5 to compare your assembly to a reference genome.
quast.sourceforge.net
Bioconda
Bioinformatics Software via Conda
All the bioinformatics tools you need — installable with a single conda command.
bioconda.github.io