An Integrated Machine Learning Pipeline for Genome-Level Bacterial Identification and In Silico Prioritization of Candidate Antibacterial and Antifungal Proteins from Public Microbial Sequence Data Resources

Authors

  • Ravi Dayani Roswell Park Comprehensive Cancer Center, New York, USA. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V4I1P116

Keywords:

Bacterial Genome Classification, Antimicrobial Protein Discovery, Machine Learning, Genome Mining, In Silico Screening, Bioinformatics

Abstract

This paper introduces an integrated machine learning workflow that first classifies bacterial nucleotide sequences and then performs in silico prioritization of proteins that may warrant follow-up as antibacterial or antifungal candidates. Public nucleotide records are collected automatically from NCBI and organized into labeled classes for model development. Sequence fragments are encoded with compositional descriptors and used to train a supervised bacterial identifier with probability-based confidence reporting and a simple novelty-screening step. The trained classifier achieved 0.9904 accuracy on held-out fragments, with weighted precision, recall, and F1-scores of 0.9905, 0.9904, and 0.9887, respectively. The framework was further applied to the NCBI case-study genome LJCX01000023.1, which was assigned to the Mycobacterium smegmatis class with mean confidence of 0.7789 across 450 fragments. To extend the analysis beyond organism labeling, predicted open reading frames from the case-study genome are translated and ranked using sequence-informed heuristics for downstream antimicrobial screening. The study offers a reproducible computational route from public genome retrieval to candidate prioritization and supports future data-driven antimicrobial discovery.

References

[1] Bell, B. G., Schellevis, F., Stobberingh, E., Goossens, H., & Pringle, M. (2014). A systematic review and meta-analysis of the effects of antibiotic consumption on antibiotic resistance. BMC Infectious Diseases, 14, 13. https://doi.org/10.1186/1471-2334-14-13

[2] Gow, N. A. R., Johnson, C., Berman, J., Coste, A. T., Cuomo, C. A., Perlin, D. S., Bicanic, T., Harrison, T. S., Wiederhold, N., Bromley, M., & Chiller, T. (2022). The importance of antimicrobial resistance in medical mycology. Nature Communications, 13, 5352. https://doi.org/10.1038/s41467-022-32249-5

[3] D. E. Wood, J. Lu, and B. Langmead, “Improved metagenomic analysis with Kraken 2,” Genome Biology, vol. 20, no. 1, p. 257, 2019, doi: 10.1186/s13059-019-1891-0.

[4] Q. Liang et al., “DeepMicrobes: taxonomic classification for metagenomics with deep learning,” NAR Genomics and Bioinformatics, vol. 2, no. 1, lqaa009, 2020, doi: 10.1093/nargab/lqaa009.

[5] Blin, K., Shaw, S., Kloosterman, A. M., Charlop-Powers, Z., van Wezel, G. P., Medema, M. H., & Weber, T. (2021). antiSMASH 6.0: Improving cluster detection and comparison capabilities. Nucleic Acids Research, 49(W1), W29–W35. https://doi.org/10.1093/nar/gkab335

[6] G. Wang, X. Li, and Z. Wang, “APD3: the antimicrobial peptide database as a tool for research and education,” Nucleic Acids Research, vol. 44, no. D1, pp. D1087-D1093, 2016, doi: 10.1093/nar/gkv1278.

[7] M. Pirtskhalava et al., “DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics,” Nucleic Acids Research, vol. 49, no. D1, pp. D288-D297, 2021, doi: 10.1093/nar/gkaa991.

[8] G. Shi et al., “DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides,” Nucleic Acids Research, vol. 50, no. D1, pp. D488-D496, 2022, doi: 10.1093/nar/gkab651.

[9] E. Sayers, “A general introduction to the E-utilities,” NCBI Bookshelf, 2022. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK25497/

[10] D. Hyatt et al., “Prodigal: prokaryotic gene recognition and translation initiation site identification,” BMC Bioinformatics, vol. 11, p. 119, 2010, doi: 10.1186/1471-2105-11-119.

[11] S. C. Potter et al., “HMMER web server: 2018 update,” Nucleic Acids Research, vol. 46, no. W1, pp. W200-W204, 2018, doi: 10.1093/nar/gky448.

[12] P. Jones et al., “InterProScan 5: genome-scale protein function classification,” Bioinformatics, vol. 30, no. 9, pp. 1236-1240, 2014, doi: 10.1093/bioinformatics/btu031.

[13] J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583-589, 2021, doi: 10.1038/s41586-021-03819-2.

[14] National Center for Biotechnology Information, “Actinobacteria bacterium OV320 ctg31, whole genome shotgun sequence,” GenBank accession LJCX01000023.1. [Online]. Available: https://www.ncbi.nlm.nih.gov/nuccore/LJCX01000023.1

Published

2023-03-30

Issue

Section

Articles

How to Cite

1.
Dayani R. An Integrated Machine Learning Pipeline for Genome-Level Bacterial Identification and In Silico Prioritization of Candidate Antibacterial and Antifungal Proteins from Public Microbial Sequence Data Resources. IJAIDSML [Internet]. 2023 Mar. 30 [cited 2026 Jun. 8];4(1):141-4. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/564