TY - JOUR
T1 - A genotype imputation method for de-identified haplotype reference information by using recurrent neural network
AU - Kojima, Kaname
AU - Tadaka, Shu
AU - Katsuoka, Fumiki
AU - Tamiya, Gen
AU - Yamamoto, Masayuki
AU - Kinoshita, Kengo
N1 - Funding Information:
This work was supported by Tohoku Medical Megabank Project from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) and the Japan Agency for Medical Research and Development (AMED) under Grant Number JP20km0105002 and by the Facilitation of R&D Platform for AMED Genome Medicine Support conducted by AMED under Grant Number JP20km0405001. The funders had no role in study design, data collection and analysis, decision to publish.
Publisher Copyright:
Copyright: © 2020 Kojima et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2020/10/1
Y1 - 2020/10/1
N2 - Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes for thousands of individuals, which is known as a haplotype reference panel. In general, more accurate imputation results were obtained using a larger size of haplotype reference panel. Most of the existing genotype imputation methods explicitly require the haplotype reference panel in precise form, but the accessibility of haplotype data is often limited, due to the requirement of agreements from the donors. Since de-identified information such as summary statistics or model parameters can be used publicly, imputation methods using de-identified haplotype reference information might be useful to enhance the quality of imputation results under the condition where the access of the haplotype data is limited. In this study, we proposed a novel imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network (RNN). The model parameters are presented in the form of de-identified information from which the restoration of the genotype data at the individual-level is almost impossible. We demonstrated that the proposed method provides comparable imputation accuracy when compared with the existing imputation methods using haplotype datasets from the 1000 Genomes Project (1KGP) and the Haplotype Reference Consortium. We also considered a scenario where a subset of haplotypes is made available only in de-identified form for the haplotype reference panel. In the evaluation using the 1KGP dataset under the scenario, the imputation accuracy of the proposed method is much higher than that of the existing imputation methods. We therefore conclude that our RNN-based method is quite promising to further promote the data-sharing of sensitive genome data under the recent movement for the protection of individuals’ privacy.
AB - Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes for thousands of individuals, which is known as a haplotype reference panel. In general, more accurate imputation results were obtained using a larger size of haplotype reference panel. Most of the existing genotype imputation methods explicitly require the haplotype reference panel in precise form, but the accessibility of haplotype data is often limited, due to the requirement of agreements from the donors. Since de-identified information such as summary statistics or model parameters can be used publicly, imputation methods using de-identified haplotype reference information might be useful to enhance the quality of imputation results under the condition where the access of the haplotype data is limited. In this study, we proposed a novel imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network (RNN). The model parameters are presented in the form of de-identified information from which the restoration of the genotype data at the individual-level is almost impossible. We demonstrated that the proposed method provides comparable imputation accuracy when compared with the existing imputation methods using haplotype datasets from the 1000 Genomes Project (1KGP) and the Haplotype Reference Consortium. We also considered a scenario where a subset of haplotypes is made available only in de-identified form for the haplotype reference panel. In the evaluation using the 1KGP dataset under the scenario, the imputation accuracy of the proposed method is much higher than that of the existing imputation methods. We therefore conclude that our RNN-based method is quite promising to further promote the data-sharing of sensitive genome data under the recent movement for the protection of individuals’ privacy.
UR - http://www.scopus.com/inward/record.url?scp=85089293874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089293874&partnerID=8YFLogxK
U2 - 10.1371/journal.pcbi.1008207
DO - 10.1371/journal.pcbi.1008207
M3 - Article
C2 - 33001993
AN - SCOPUS:85089293874
SN - 1553-734X
VL - 16
JO - PLoS Computational Biology
JF - PLoS Computational Biology
IS - 10
M1 - e1008207
ER -