Professional Certificate in Data Analysis in Bioinformatics · Guide

Biological Databases

5 min read Updated 4 May 2026

Biological databases play a crucial role in the field of bioinformatics, providing a centralized repository for biological data that can be accessed, queried, and analyzed by researchers around the world. These databases contain a wide range of information, including DNA sequences, protein structures, gene expression data, and much more. In this section, we will explore some key terms and vocabulary related to biological databases.

1. **Database**: A database is a structured collection of data that is stored and organized in a way that allows for easy access, retrieval, and manipulation. In the context of bioinformatics, biological databases store information related to various aspects of biology, such as genes, proteins, sequences, and pathways.

2. **Sequence Database**: Sequence databases store genetic sequences, such as DNA or protein sequences. These databases are essential for comparing sequences, identifying similarities, and predicting the function of unknown sequences. Examples of sequence databases include GenBank, UniProt, and Ensembl.

3. **GenBank**: GenBank is a comprehensive database of nucleotide sequences that is maintained by the National Center for Biotechnology Information (NCBI). It contains sequences from a wide range of organisms, including bacteria, viruses, plants, and animals.

4. **UniProt**: UniProt is a database of protein sequences and functional information that is maintained by the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). It provides a wealth of information on protein function, structure, and interactions.

5. **Ensembl**: Ensembl is a genome browser and database that provides access to annotated genome sequences for a wide range of species. It allows researchers to explore gene structures, regulatory elements, genetic variation, and evolutionary relationships.

6. **Structure Database**: Structure databases store information on the three-dimensional structures of biological molecules, such as proteins and nucleic acids. These databases are essential for studying protein folding, protein-ligand interactions, and drug design. Examples of structure databases include the Protein Data Bank (PDB) and CATH.

7. **Protein Data Bank (PDB)**: The Protein Data Bank is a repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and complex assemblies. It is a valuable resource for structural biologists, biochemists, and drug designers.

8. **CATH**: The Class, Architecture, Topology, and Homologous superfamily (CATH) database classifies protein structures into a hierarchy of structural domains. It helps researchers understand the relationships between protein structures and functions.

9. **Gene Expression Database**: Gene expression databases store information on the expression levels of genes in different tissues, cell types, or conditions. These databases are essential for studying gene regulation, identifying biomarkers, and understanding disease mechanisms. Examples of gene expression databases include GEO and ArrayExpress.

10. **GEO**: The Gene Expression Omnibus (GEO) is a database maintained by the National Center for Biotechnology Information (NCBI) that archives and shares gene expression data from a wide range of experimental platforms. It allows researchers to compare gene expression profiles across different studies.

11. **ArrayExpress**: ArrayExpress is a database maintained by the European Bioinformatics Institute (EBI) that stores gene expression data from microarray and high-throughput sequencing experiments. It provides tools for data visualization, analysis, and interpretation.

12. **Ontology**: An ontology is a formal representation of knowledge in a specific domain, including terms, concepts, relationships, and rules. In bioinformatics, ontologies are used to standardize and organize biological data, making it easier to search, query, and integrate information from different sources. Examples of ontologies include the Gene Ontology (GO) and the Human Phenotype Ontology (HPO).

13. **Gene Ontology (GO)**: The Gene Ontology is a standardized ontology that describes the functions, processes, and cellular locations of genes and gene products. It provides a controlled vocabulary for annotating genes and proteins in biological databases, allowing researchers to perform functional enrichment analysis and pathway analysis.

14. **Human Phenotype Ontology (HPO)**: The Human Phenotype Ontology is a standardized vocabulary that describes phenotypic abnormalities associated with human diseases. It provides a framework for annotating clinical data, genetic variants, and disease models, enabling researchers to identify genotype-phenotype correlations and diagnose rare genetic disorders.

15. **Pathway Database**: Pathway databases store information on biological pathways, including metabolic pathways, signaling pathways, and regulatory networks. These databases help researchers understand the interactions between genes, proteins, and metabolites in living organisms. Examples of pathway databases include KEGG and Reactome.

16. **KEGG**: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database that integrates genomic, chemical, and systemic information on biological pathways and networks. It provides a valuable resource for studying metabolism, signaling, and diseases.

17. **Reactome**: Reactome is a curated database of biological pathways and reactions that covers a wide range of species and biological processes. It provides detailed information on molecular interactions, enzyme-catalyzed reactions, and cellular events.

18. **Variant Database**: Variant databases store information on genetic variants, such as single nucleotide polymorphisms (SNPs) and copy number variations (CNVs). These databases are essential for studying genetic diversity, population genetics, and disease susceptibility. Examples of variant databases include dbSNP and ClinVar.

19. **dbSNP**: The Single Nucleotide Polymorphism Database (dbSNP) is a database maintained by the National Center for Biotechnology Information (NCBI) that catalogs genetic variations in the human genome. It provides a comprehensive resource for studying genetic diversity, disease associations, and evolutionary history.

20. **ClinVar**: ClinVar is a publicly accessible database maintained by the National Center for Biotechnology Information (NCBI) that aggregates information on genetic variants and their clinical significance. It helps researchers, clinicians, and patients interpret the impact of genetic variants on health and disease.

21. **Metabolomics Database**: Metabolomics databases store information on small molecules, metabolites, and metabolic pathways in living organisms. These databases are essential for studying metabolism, biomarkers, and drug metabolism. Examples of metabolomics databases include HMDB and MetaboLights.

22. **HMDB**: The Human Metabolome Database (HMDB) is a comprehensive database of small molecule metabolites found in the human body. It provides information on metabolite structures, pathways, concentrations, and disease associations, making it a valuable resource for metabolomics research.

23. **MetaboLights**: MetaboLights is a database maintained by the European Bioinformatics Institute (EBI) that stores metabolomics data from a wide range of species and experimental platforms. It provides tools for data sharing, analysis, and visualization, enabling researchers to explore metabolic pathways and networks.

24. **Data Mining**: Data mining is the process of discovering patterns, trends, and relationships in large datasets. In the context of biological databases, data mining techniques are used to extract valuable insights from complex biological data, such as gene expression profiles, protein structures, and metabolic pathways.

25. **Data Integration**: Data integration is the process of combining data from different sources, formats, or databases to create a unified view of the information. In bioinformatics, data integration techniques are used to merge and harmonize biological data from diverse sources, enabling researchers to perform comprehensive analyses and make informed decisions.

In conclusion, biological databases play a critical role in bioinformatics by providing a centralized repository for storing, organizing, and sharing biological data. These databases contain a wealth of information on genes, proteins, sequences, structures, pathways, and variants, which can be accessed, queried, and analyzed by researchers worldwide. By understanding key terms and vocabulary related to biological databases, researchers can effectively navigate and utilize these resources to advance their research in various fields of biology and biomedicine.

Key takeaways

Biological databases play a crucial role in the field of bioinformatics, providing a centralized repository for biological data that can be accessed, queried, and analyzed by researchers around the world.
In the context of bioinformatics, biological databases store information related to various aspects of biology, such as genes, proteins, sequences, and pathways.
These databases are essential for comparing sequences, identifying similarities, and predicting the function of unknown sequences.
**GenBank**: GenBank is a comprehensive database of nucleotide sequences that is maintained by the National Center for Biotechnology Information (NCBI).
**UniProt**: UniProt is a database of protein sequences and functional information that is maintained by the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR).
**Ensembl**: Ensembl is a genome browser and database that provides access to annotated genome sequences for a wide range of species.
**Structure Database**: Structure databases store information on the three-dimensional structures of biological molecules, such as proteins and nucleic acids.

Biological Databases

Key takeaways

More from Professional Certificate in Data Analysis in Bioinformatics