(i) Develop a shared state-of-the-art actinobacterial strain collection and genome database to revitalize natural products discovery
(ii) Serve the broad scientific community to promote research and development on natural products and associated applications
The Actinobacterial Strain Collection at the Natural Products Discovery Center (NPDC) at UF Scripps currently contains a total of 125,127 strains, of which at least 62,325 are actinobacteria, 14,465 other bacteria, and 48,334 unidentified. These strains were isolated over the last eight decades, allowing for the capture of irreproducible natural product diversity caused by evolution and changes in environmental cues. The 62,328 actinobacteria in the collection, spanning at least 88 genera, were isolated from 69 different countries with different climates and ecological factors that will further increase the chemical diversity of the natural products waiting to be discovered. The potential for natural product discovery from the NPDC at UF Scripps is immense. Assuming about 30 biosynthetic gene clusters (BGCs) per strain, the collection's 125,000 strains could encode more than 3.75 million BGCs, potentially producing more than 3.75 million natural products. In reference to known natural products that have been isolated from bacteria (about 40,000), this leaves millions of compounds to be discovered. Although many strains may produce the same or very similar products, these redundancies are unlikely to fundamentally reduce the total number of novel natural products encoded in the NPDC. The millions of new BGCs will also serve as an unprecedented treasure trove for discovery of new enzymes and biocatalysts, while enabling a suite of innovative synthetic biology applications.
BBDuk  was used to remove contaminants, trim reads that contained adapter sequence and homopolymers of G's of size 5 or more at the ends of the reads, remove reads containing 1 or more 'N' bases or having length <= 51 bp or 33% of the full read length. Reads mapped with BBMap  to masked human references at 93% identity were separated into a chaff file. Further, reads aligned to masked common microbial contaminants were separated into a chaff file.
The following steps were then performed for assembly: (1) artifact filtered and normalized Illumina reads were assembled with SPAdes (version v3.14.1; –phred-offset 33 –cov-cutoff auto -t 16 -m 64 –careful -k 25,55,95) ; (2) contigs were discarded if the length was <1kb (BBTools reformat.sh: minlength=1000 ow=t).
CheckM  was used to calculate the contamination and completeness level of genomes. Genomes having >=95% completeness and <=10% contamination were kept, while others are discarded.
GTDB-Toolkit  was used to annotate the taxonomy of genomes. Prokka  was used to predict and annotate coding sequences in the genomes, while antiSMASH[ ant6i] was used to predict the biosynthetic gene clusters. Finally, BiG-SLiCE  was used to calculate BGC Families / GCFs (using l2-normalized cutoff of 0.5).
1. B. Bushnell: BBTools software package (version 38.90), URL https://bbtools.jgi.doe.gov.
2. Bankevich A, et.al, SPAdes: a new genome assembly algorithm and its applications to single–cell sequencing. J Comput Biol 2012; 19:455–77.
3. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55.
4. Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks, GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database, Bioinformatics, Volume 36, Issue 6, 15 March 2020, Pages 1925–1927.
5. Torsten Seemann, Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 2068–2069.
6. Kai Blin, Simon Shaw, Katharina Steinke, Rasmus Villebro, Nadine Ziemert, Sang Yup Lee, Marnix H Medema, Tilmann Weber, antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W81–W87.
5. Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.
Kalkreuter, E.; Pan, G.; Cepeda, A. J.; Shen, B. Targeting bacterial genomes for natural product discovery: opportunities, challenges, and strategies. Trends Pharmacol. Sci. 2020, 41, 13-26. Steele, A. D.; Teijaro, C. N.; Yang, D.; Shen, B. Leveraging a large microbial strain collection for natural product discovery. J. Biol. Chem. 2019, 194, 16567-16576.