Toshiaki Katayama - ASPiRE-Advanced Data Science Center for Protein Research Institute for Protein Research THE UNIVERSITY OF OSAKA

Toshiaki Katayama Guest Professor

Laboratory：: Laboratory of Protein Databases (PDBj)
Research Interests：: Developing core technologies to enable seamless standardization and interoperability of life science data.

Overview

We are standardizing life science databases and enhancing their interoperability while also developing foundational technologies for data science. In life sciences, numerous databases have been created due to the diversity of living organisms and the complexity of biological systems. However, to truly understand biological systems and their environments, it is essential to interconnect these databases. For example, combining disease data with genomic information can help identify disease-related genes. To understand the mechanisms underlying diseases, we need to integrate additional layers of information—such as genetic variants, gene regulation, epigenetic modifications, protein structures, molecular interactions, biological pathways, and cellular and organ-level data. Facilitating the integrated use of such diverse life science databases requires the unification of identifiers, terminology, and concepts, along with the standardization of data formats. However, traditional data models, such as tables or trees, are insufficient to represent the inherent complexity of life science data. Therefore, we utilize knowledge graphs as flexible data structures that conform to W3C standards, and we are annotating data with common identifiers and ontologies to support this integration. In addition, we are working on the construction of genome graphs using long-read sequencing technologies, which have been rapidly advancing in recent years. Through this effort, we aim to enhance the interpretation of next-generation genome analysis by leveraging integrated life science data.

Q&A

What are the unique aspects or strengths of your research?

While numerous public databases have been developed and maintained in the life sciences, integrated systems that make full use of the information within these databases remain underdeveloped. Achieving such integration requires international collaboration, which we promote by organizing the annual BioHackathon to foster a global network of database developers and researchers and to advance data integration and collaborative research. We believe that the combined use of knowledge graphs and genome graphs will serve as a foundation for next-generation life sciences and will play an increasingly important role in the future.

How do you think the results of your research will benefit society or industry?

Knowledge graphs have been most actively used in the field of drug discovery, but this framework is broadly applicable across domains ranging from basic life science to environmental research. In recent years, the use of large language models (LLMs) has become widespread, however, there are concerns that they are running out of high-quality information to learn from. At the same time, LLMs struggle to provide accurate answers on highly specialized topics such as genetic variants and protein 3D structures. Accordingly, it is essential that the academic community takes responsibility for maintaining accurate scientific knowledge in databases, making it possible to extract reliable facts, determine what is known and what remains unknown, and identify the supporting data and literature. This effort ultimately contributes to providing reliable information for solving real-world problems in society and industry.

How is data science utilized in your research?

Research applying machine learning methods, such as classification problems, to integrated data is advancing. Knowledge graphs, which clearly represent the meaning of data and have high machine-readability, are ideal foundation for such analysis. However, methods for efficiently selecting appropriate explanatory and target variables from vast and complex datasets are still a challenge, and we are working on developing technologies to address this issue. Combining large language models with databases and knowledge graphs is also an emerging area in data science. Progress in this field is expected to significantly enhance the accessibility and usability of integrated data.

Please share examples of collaborative research or the potential for future collaboration.

Recent advances in long-read sequencing technology have enabled us to decode individual genomes with near telomere-to-telomere (T2T) accuracy. As a result, we have entered an era where both single nucleotide polymorphisms (SNPs) and a variety of structural variants can be identified. A major challenge in future genomic medicine is to understand how structural variants and copy number variations influence gene regulation, three-dimensional structure, and biological function. To address this challenge, we are collaborating with leading research centers in Japan that manage major genomic variant databases, forming a working group dedicated to standardizing the representation and annotation of structural variants in databases. We are also conducting joint research projects to develop foundational models by utilizing the knowledge stored in our knowledge graphs and to provide technical support for data integration efforts in biobanks.

What are the prospects and goals for your research?

As life science research grows in scale and complexity, it is becoming increasingly important to develop strategic approaches to data management from experimental design to future data reuse. With research becoming more automated, there will be a deeper focus on how well data is organized and its potential to inspire new ideas. In this context, the accumulation and preservation of accessible datasets aimed at uncovering the full scope of human knowledge will be essential as a research infrastructure for data science. Since the mechanisms of life and disease remain largely unknown, it is essential to ensure that everyone can access the latest findings and actively participate in research efforts to unravel these mysteries. At the same time, we aim to promote data science through collaborative research, leveraging newly integrated data to achieve new scientific breakthroughs.

Selected papers

G.-J. Bekker, C. Nagao, M. Shirota, T. Nakamura, T. Katayama, D. Kihara, K. Kinoshita and G. Kurisu, Protein Data Bank Japan: Improved tools for sequence-oriented analysis of protein structures, Protein Science, 34(3), e70052 (2025).
https://doi.org/10.1002/pro.70052
S. Ikeda, K. F. Aoki-Kinoshita, H. Chiba, S. Goto, M. Hosoda, S. Kawashima, J.-D. Kim, Y. Moriya, T. Ohta, H. Ono, …, Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features, Journal of Biomedical Semantics, 16(1) (2025).
https://doi.org/10.1186/s13326-024-00322-1
S. Nakagawa, T. Katayama, L. Jin, J. Wu, K. Kryukov, R. Oyachi, J. S. Takeuchi, T. Fujisawa, S. Asano, M. Komatsu, …, SARS-CoV-2 HaploGraph: visualization of SARS-CoV-2 haplotype spread in Japan, Genes & Genetic Systems (2023).
https://doi.org/10.1266/ggs.23-00085
N. Mitsuhashi, L. Toyo-Oka, T. Katayama, M. Kawashima, S. Kawashima, K. Miyazaki and T. Takagi, TogoVar: A comprehensive Japanese genetic variation database, Human Genome Variation, 9(1), 44 (2022).
https://doi.org/10.1038/s41439-022-00222-9
L. Garcia, E. Antezana, A. Garcia, E. Bolton, R. Jimenez, P. Prins, J. M. Banda and T. Katayama, Ten simple rules to run a successful BioHackathon, PLoS Computational Biology, 16(5), e1007808 (2020).
https://doi.org/10.1371/journal.pcbi.1007808

Researcher