By Dr. Daniela Vergara
Cannabis sativa has a diploid genome, which means it has two sets of chromosomes, for a total of 20. One set of the chromosomes comes from the mother and the other set from the father.
Eighteen of those chromosomes, which are not related to sex, are called autosomes, and the last two are sex chromosomes. Female and monoecious (plants that have both male and female flowers, please check our post on sex chromosomes) C. sativa plants usually have two X chromosomes (XX). Male plants have one X and one Y chromosome (XY) (Baek and Vergara 2025).
However, when we consider that half coming from one of the parents, C. sativa would have 10 chromosomes, 9 autosomes and 2 sex chromosomes (X and Y)
In wild or “feral” cannabis plants, having more than two sets of chromosomes (called polyploidy) is rare, but the industry is developing these polyploids and if you’re interested, I can talk more about these.
The Y Chromosome
The Y chromosome in male C. sativa plants is the largest chromosome in the genome, and bigger than the X chromosome. The X chromosome is around 104 million base pairs (Mbp), while the Y is between 131 and 151 Mbp (Baek and Vergara 2025). We have two posts on these sex chromosomes so please check them out!
Mitochondrial and Chloroplast Genomes
Plants, C. sativa included, don’t just have the DNA in the nucleus (nuclear DNA), but also have DNA in two organelles (tiny organs) called the chloroplasts and mitochondria. These two organelles are very important because they are in charge of photosynthesis and of the cell’s energy, respectively. This is the story of our next blog post, so stay tuned!
Genome Sequencing in Cannabis sativa
There are multiple C. sativa varieties that have been sequenced. The first varieties were Purple Kush, Finola, and USO31 sequenced in 2011 (van Bakel et al. 2011). Even though we have a lot of information, some parts of the C. sativa genome—like the complete Y chromosome or the X chromosomes from monoecious plants—are still not fully assembled (Baek and Vergara 2025).
Today I’m going to walk you through some of the publicly available information about C. sativa from the governmental repository NCBI.
NCBI
The National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/) is a U.S. government resource that provides access to a wide range of biological data, including DNA and protein sequences, scientific literature, and genome information.
I’m going to use some of the terminology we defined in our previous blog post when we talked about genome sequencing, assembling etc.
If you go to the main website and type “Cannabis sativa” and then hit search, you can then browse the genomes and assemblies available. As of today, May 23, 2025, there are 17 assemblies publicly available. This is what you find in the following website:
https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=3483

These assemblies represent different varieties that have been sequenced and assembled and are publicly available, for example, Purple Kush, Finola, and the one that I was involved with assembling, Pineapple Banana Bubba Kush.
There are several of these genomes that have been annotated including the current reference genome Pink Pepper (Ryu et al. 2024), and the previous reference genome cs10 (Grassa et al. 2021).
This database helps users compare genomes across different C. sativa types, study genetic diversity, and develop tools for breeding, trait selection, and cannabinoid production research.
Pink Pepper
If you want to further explore the current reference assembly, Pink Pepper, you will find the following information from this link https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_029168945.1/
This is currently the most complete assembly and from the previous link you find the assembly statistics but before let’s define two things. In genomics a:
- Contig is a long stretch of DNA with no gaps. It’s a smooth, uninterrupted piece of the genome.
- Scaffold is an even larger stretch of DNA. It’s made by joining multiple contigs together, often with small gaps in between.
So contigs are like puzzle pieces, and scaffolds are like parts of the puzzle that have been partly put together to show more of the picture.
Now let’s go through the statistics:
- Genome size: About 770.3 megabases (Mbp) -> remember that the C. sativa genome size is estimated to be about 818Mbp for a female, so Pink Pepper may be missing some base pairs.
- Number of chromosomes: 10 -> which matches haploid genome size which is 10.
- Scaffold N50: 77 Mb means that at least half of the genome is found in scaffold pieces that are 77 million base pairs (Mb) long or longer—this shows a high-quality genome assembly.
- Contig N50: 23.5 Mb means that at least half of all the DNA letters (nucleotides) are in contigs that are 23.5 million base pairs or longer. In other words, at least half of the genome is made up of pieces this size or bigger.
- GC content: 34%, which reflects the proportion of guanine (G) and cytosine (C) in the genome. Therefore, the other 66% is composed of adenine (A) and thymine (T)
- Coverage: 299x, meaning each base in the genome was sequenced, on average, 299 times
- Assembly level: Chromosome, indicating that the pieces are placed at a chromosome level and most of the chromosomes are assembled. Many of the other assemblies are not a chromosome level and are highly fragmented.

Sample Details:
- The genome comes from a C. sativa plant named Pink Pepper, submitted by Kangwon National University in Korea (Ryu et al. 2024).
- The sample ID is SAMN31276239, and the isolate name is KNU-18-1.
Sequencing & Assembly Methods:
- Sequencing was done using Oxford Nanopore, PacBio RSII, and Illumina NovaSeq technologies. Oxford Nanopore and PacBio produce long reads, meaning they sequence large sections of DNA at once and produce long reads, though they may have higher error rates. Illumina, which is one of the most widely used methods, produces shorter reads but with higher accuracy. Using all three technologies together helps produce a more complete and reliable genome assembly.
- The genome was assembled using NextDenovo and SMARTdenovo, bioinformatic tools for piecing together genome assemblies.
Annotation

This screenshot shows the annotation details for the C. sativa Pink Pepper genome, including the date, annotation name, and the provider—NCBI’s RefSeq database.
Through this annotation we can see that there are 35,194 genes of which 28,747 are protein-coding genes. Non-protein coding genes are usually those that regulate expression or are involved in other cellular processes. Protein coding genes are those that are directly involved with the instructions that make proteins.
Quality

The good news is that this is a high quality assembly, because at least 98.2% of the BUSCO (Benchmarking Universal Single-Copy Orthologs) genes are there. BUSCO genes are those that are found in almost all living things within a certain group—like plants—and usually appear only once in each genome. Because they are so common and consistent, BUSCO genes are used to check if a genome assembly is complete and accurate. If most of these genes are found and complete, it means the genome is high quality.
There is still some work to do (besides the assemblies of the Y and monoecious chromosomes). For example, some of the BUSCO genes seem to be repeated so there may be redundancies. And there are other improvements that I will mention below.
Chromosome Information

The genome contains 10 chromosomes, including one X chromosome so we know this was a female plant.
Each chromosome is listed with:
- A GenBank and RefSeq accession number (unique IDs for the DNA sequences and the annotation).
- Its size in base pairs (bp), ranging from ~51 million to ~92 million bp. As seen from the figure, the X chromosome is one of the largest, but other autosomes are larger.
- Its GC content, which as explained above is the percent of guanine (G) and cytosine (C) bases. With that information we can estimate the percent adenine (A) and thymine (T).
- There are 7 unplaced scaffolds, which are bits of DNA sequence that haven’t been assigned to a specific chromosome yet. This is much better than previous assemblies that had multiple unplaced scaffolds. For example, cs10 which was the reference assembly until Pink Pepper appeared, had about 200 unplaced scaffolds. However, this is another place that needs improvement in future genome assemblies.
I hope you enjoyed this post about the C. sativa genome and the publicly available genome assemblies. I know it was a bit technical, but it’s helpful to understand where some of your tax dollars go when it comes to research. NCBI plays a key role in supporting genomic work by hosting databases where this kind of information is stored. When researchers publish studies using genome data, they usually have to submit it to public databases. This not only keeps things transparent but also supports collaboration, allowing others—sometimes from across the world—to build on that data for future discoveries.
References
Baek, Y. and D. Vergara. 2025. A review of sexual strategies in Cannabis sativa L. under genomic and environmental controls. Agrosystems, Geosciences & Environment 8:e70050.
Grassa, C. J., G. D. Weiblen, J. P. Wenger, C. Dabney, S. G. Poplawski, S. Timothy Motley, T. P. Michael, and C. Schwartz. 2021. A new Cannabis genome assembly associates elevated cannabidiol (CBD) with hemp introgressed into marijuana. New Phytologist 230:1665-1679.
Ryu, B.-R., G.-J. Gim, Y.-R. Shin, M.-J. Kang, M.-J. Kim, T.-H. Kwon, Y.-S. Lim, S.-H. Park, and J.-D. Lim. 2024. Chromosome-level Haploid Assembly of Cannabis sativa L. cv. Pink Pepper. Scientific Data 11:1442.
van Bakel, H., J. M. Stout, A. G. Cote, C. M. Tallon, A. G. Sharpe, T. R. Hughes, and J. E. Page. 2011. The draft genome and transcriptome of Cannabis sativa. Genome Biology 12.


