Volume 42 Issue 2
Mar.  2021
Turn off MathJax
Article Contents
Yu-Fang Mao, Xi-Guo Yuan, Yu-Peng Cun. A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data. Zoological Research, 2021, 42(2): 246-249. doi: 10.24272/j.issn.2095-8137.2021.014
Citation: Yu-Fang Mao, Xi-Guo Yuan, Yu-Peng Cun. A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data. Zoological Research, 2021, 42(2): 246-249. doi: 10.24272/j.issn.2095-8137.2021.014

A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data

doi: 10.24272/j.issn.2095-8137.2021.014
Funds:  This study was supported by the CAS Pioneer Hundred Talents Program and National Natural Science Foundation of China (32070683) to Y.P.C
More Information
  • Somatic mutations are a large category of genetic variations, which play an essential role in tumorigenesis. Detection of somatic single nucleotide variants (SNVs) could facilitate downstream analysis of tumorigenesis. Many computational methods have been developed to detect SNVs, but most require normal matched samples to differentiate somatic SNVs from the normal state, which can be difficult to obtain. Therefore, developing new approaches for detecting somatic SNVs without matched samples are crucial. In this work, we detected somatic mutations from individual tumor samples based on a novel machine learning approach, svmSomatic, using next-generation sequencing (NGS) data. In addition, as somatic SNV detection can be impacted by multiple mutations, with germline mutations and co-occurrence of copy number variations (CNVs) common in organisms, we used the novel approach to distinguish somatic and germline mutations based on the NGS data from individual tumor samples. In summary, svmSomatic: (1) considers the influence of CNV co-occurrence in detecting somatic mutations; and (2) trains a support vector machine algorithm to distinguish between somatic and germline mutations, without requiring normal matched samples. We further tested and compared svmSomatic with other common methods. Results showed that svmSomatic performance, as measured by F1-score, was significantly better than that of others using both simulation and real NGS data.
  • loading
  • [1]
    Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, et al. 2012. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 28(3): 423−425. doi: 10.1093/bioinformatics/btr670
    Cun YP, Yang TP, Achter V, Lang U, Peifer M. 2018. Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust. Nature Protocols, 13(6): 1488−1501. doi: 10.1038/nprot.2018.033
    Fan Y, Xi L, Hughes DST, Zhang JJ, Zhang JH, Futreal PA, et al. 2016. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biology, 17(1): 178. doi: 10.1186/s13059-016-1029-6
    Guyon I, Boser BE, Vapnik V. 1993. Automatic capacity tuning of very large VC-dimension classifiers. In: Proceedings of Advances in Neural Information Processing Systems 5. Denver: NIPS, 147–155.
    Hastie T, Tibshirani R. 1998. Classification by pairwise coupling. The Annals of Statistics, 26(2): 451−471. doi: 10.1214/aos/1028144844
    Kalatskaya I, Trinh QM, Spears M, Mcpherson JD, Bartlett JMS, Stein L. 2017. ISOWN: accurate somatic mutation identification in the absence of normal tissue controls. Genome Medicine, 9(1): 59. doi: 10.1186/s13073-017-0446-9
    Koboldt DC, Zhang QY, Larson DE, Shen D, McLellan MD, et al. 2012. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3): 568−576. doi: 10.1101/gr.129684.111
    Lai ZW, Markovets A, Ahdesmaki M, Johnson J. 2015. Abstract 4864: VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Cancer Research, 75(15): 4864−4864.
    Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, et al. 2015. The European Genome-phenome Archive of human data consented for biomedical research. Nature Genetics, 47(7): 692−695. doi: 10.1038/ng.3312
    Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14): 1754−1760. doi: 10.1093/bioinformatics/btp324
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16): 2078−2079. doi: 10.1093/bioinformatics/btp352
    Liu RM, Liu EQ, Yang J, Li M, Wang FL. 2006. Optimizing the hyper-parameters for SVM by combining evolution strategies with a grid search. In: Proceedings of International Conference on Intelligent Computing. Kunming, China: Springer, 712–721.
    Liu YC, Loewer M, Aluru S, Schmidt B. 2016. SNVSniffer: an integrated caller for germline and somatic single-nucleotide and indel mutations. BMC Systems Biology, 10(S2): 47. doi: 10.1186/s12918-016-0300-5
    Pattnaik S, Gupta S, Rao AA, Panda B. 2014. SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15: 40. doi: 10.1186/1471-2105-15-40
    Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29(1): 308−311. doi: 10.1093/nar/29.1.308
    Smith KS, Yadav VK, Pei SS, Pollyea DA, Jordan CT, De S. 2016. SomVarIUS: somatic variant identification from unpaired tissue samples. Bioinformatics, 32(6): 808−813. doi: 10.1093/bioinformatics/btv685
    Wang WX, Wang PW, Xu F, Luo RB, Wong MP, Lam TW, et al. 2014. FaSD-somatic: a fast and accurate somatic SNV detection algorithm for cancer genome sequencing data. Bioinformatics, 30(17): 2498−2500. doi: 10.1093/bioinformatics/btu338
    Wei Z, Wang W, Hu PZ, Lyon GJ, Hakonarson H. 2011. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Research, 39(19): e132. doi: 10.1093/nar/gkr599
    Xi JN, Yuan XG, Wang MH, Li A, Li XL, Huang Q. 2020. Inferring subgroup-specific driver genes from heterogeneous cancer samples via subspace learning with subgroup indication. Bioinformatics, 36(6): 1855−1863.
    Xu C. 2018. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Computational and Structural Biotechnology Journal, 16: 15−24. doi: 10.1016/j.csbj.2018.01.003
    Yuan XG, Miller DJ, Zhang JY, Herrington D, Wang Y. 2012. An overview of population genetic data simulation. Journal of Computational Biology, 19(1): 42−54. doi: 10.1089/cmb.2010.0188
    Yuan XG, Zhang JY, Yang LY. 2017. IntSIM: an integrated simulator of next-generation sequencing data. IEEE Transactions on Biomedical Engineering, 64(2): 441−451. doi: 10.1109/TBME.2016.2560939
    Yuan X, Bai J, Zhang J, Yang L, Duan J, Li Y, et al. 2020a. CONDEL: Detecting copy number variation and genotyping deletion zygosity from single tumor samples using sequence data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(4): 1141−1153.
    Yuan X, Ma C, Zhao H, Yang L, Wang S, Xi J. 2020b. STIC: Predicting single nucleotide variants and tumor purity in cancer genome. IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi: 10.1109/TCBB.2020.2975181.
  • ZR-2021-014 Supplementary Material.pdf
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(1)  / Tables(1)

    Article Metrics

    Article views (299) PDF downloads(34) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint