SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method

Research output: Working paperPreprintResearch

Documents

  • Fulltext

    Final published version, 903 KB, PDF document

We introduce the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs. The butterfly method is based on a three-component mixture of normal distributions, in which parameters are easily found using the open-source statistical software R. This makes the method transparent, straight-forward to change parameters according to the user’s needs, and easy to analyse the data within R after the SNPs have been called. We contribute with two open-source R packages that make SNP calling easy by helping with bookkeeping and by giving easy access to meta-information about the SNPs on the Illumina Infinium Omni5-4 BeadChip Kit (including chromosome, probe type, and SNP bases). We test our method on > 4 mio. SNPs and compare the results with those obtained with the GenTrain method used by Illumina GenomeStudio as well as SNPs obtained by PCR-free whole genome sequencing (WGS). We demonstrate two variants of our method: one where we account for potential probe type bias by estimating a separate model for each probe type (type I and type II) and another that uses a general model such that the model’s parameter estimates do not depend on the sample that is being analysed. We focused on varying the no-call rate and show how it changed the concordance with that of WGS. This is done by using a threshold on the a posteriori probability of belonging to a SNP cluster and by using the number of beads to adjust the stringency of the no-call mechanism. With the butterfly method, we achieve a SNP call rate of around 99% and a SNP concordance of around 99% with the WGS data. By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data than the GenomeStudio.
Original languageEnglish
PublisherbioRxiv
Number of pages15
DOIs
Publication statusPublished - 20 Jan 2022

Number of downloads are based on statistics from Google Scholar and www.ku.dk


No data available

ID: 302456466