TY - JOUR
T1 - Artificial intelligence powered statistical genetics in biobanks
AU - Narita, Akira
AU - Ueki, Masao
AU - Tamiya, Gen
N1 - Funding Information:
Acknowledgements This work was supported by a grant for the Tohoku Medical Megabank Project and the Centre for Advanced Intelligence Project from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We are grateful to Dr. Satoshi Makino and Miho Kuriki for their special assistances.
Publisher Copyright:
© 2020, The Author(s), under exclusive licence to The Japan Society of Human Genetics.
PY - 2021/1
Y1 - 2021/1
N2 - Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.
AB - Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.
UR - http://www.scopus.com/inward/record.url?scp=85089297634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089297634&partnerID=8YFLogxK
U2 - 10.1038/s10038-020-0822-y
DO - 10.1038/s10038-020-0822-y
M3 - Review article
C2 - 32782383
AN - SCOPUS:85089297634
SN - 1434-5161
VL - 66
SP - 61
EP - 65
JO - Journal of Human Genetics
JF - Journal of Human Genetics
IS - 1
ER -