e
(i) Develop a shared state-of-the-art actinobacterial strain collection and genome database to revitalize natural products discovery
(ii) Serve the broad scientific community by providing strains with curated draft genomes to promote research and development on natural products and associated applications
The Actinobacterial Strain Collection at the Natural Products Discovery Center (NPDC) at UF Scripps contains a total of 122,550 strains. These strains, isolated over the last eight decades and from 77 different countries, represent microbial and natural product diversities that are not available anywhere else and impossible to reproduce in laboratory settings today. The potential for natural product discovery from the NPDC at UF Scripps is immense. Assuming about 30 biosynthetic gene clusters (BGCs) per strain, the collection's 125,000 strains could encode more than 3.75 million BGCs, potentially producing more than 3.75 million natural products. In reference to the ~20,000 natural products of Actinobacteria origin known to date, this leaves millions of compounds to be discovered. Although many strains may produce the same or very similar products, these redundancies are unlikely to fundamentally reduce the total number of novel natural products encoded in the NPDC. The millions of new BGCs will also serve as an unprecedented treasure trove for discovery of new enzymes and biocatalysts, while enabling a suite of innovative synthetic biology applications.
Reads Processing
BBDuk [1] was used to remove contaminants, trim reads that contained adapter sequence and homopolymers of G's of size 5 or more at the ends of the reads, remove reads containing 1 or more 'N' bases or having length <= 51 bp or 33% of the full read length. Reads mapped with BBMap [1] to masked human references at 93% identity were separated into a chaff file. Further, reads aligned to masked common microbial contaminants were separated into a chaff file.
Assembly
The following steps were then performed for assembly: (1) artifact filtered and normalized Illumina reads were assembled with SPAdes (version v3.14.1; –phred-offset 33 –cov-cutoff auto -t 16 -m 64 –careful -k 25,55,95) [2]; (2) contigs were discarded if the length was <1kb (BBTools reformat.sh: minlength=1000 ow=t).
Genomes QC
CheckM [3] was used to calculate the contamination and completeness level of genomes. Genomes having >=95% completeness and <=10% contamination were kept, while others are discarded.
Annotations
GTDB-Toolkit [4] was used to annotate the taxonomy of genomes. Prokka [5] was used to predict and annotate coding sequences in the genomes, while antiSMASH[ ant6i] was used to predict the biosynthetic gene clusters. Finally, BiG-SLiCE [6] was used to calculate BGC Families / GCFs (using l2-normalized cutoff of 0.5).
References:
1. B. Bushnell: BBTools software package (version 38.90), URL https://bbtools.jgi.doe.gov.
2. Bankevich A, et.al, SPAdes: a new genome assembly algorithm and its applications to single–cell sequencing. J Comput Biol 2012; 19:455–77.
3. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55.
4. Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks, GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database, Bioinformatics, Volume 36, Issue 6, 15 March 2020, Pages 1925–1927.
5. Torsten Seemann, Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 2068–2069.
6. Kai Blin, Simon Shaw, Katharina Steinke, Rasmus Villebro, Nadine Ziemert, Sang Yup Lee, Marnix H Medema, Tilmann Weber, antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W81–W87.
5. Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.
Kalkreuter, E.; Pan, G.; Cepeda, A. J.; Shen, B. Targeting bacterial genomes for natural product discovery: opportunities, challenges, and strategies. Trends Pharmacol. Sci. 2020, 41, 13-26. Steele, A. D.; Teijaro, C. N.; Yang, D.; Shen, B. Leveraging a large microbial strain collection for natural product discovery. J. Biol. Chem. 2019, 194, 16567-16576.