Research Support: Bioinformatics

Research Support: Bioinformatics (DNA image by To Uyen from the Noun Project)

From Information to Insight

Biological databases contain an amazing amount of information that can be used to generate hypotheses and advance your research without even stepping in the lab. 

This page contains information about Databases and Analysis tools that facilitate data reuse. The analysis tools can also be used to analyze your own data.


The most comprehensive sets of molecular data can be found in data bases that participate in the International Nucleotide Sequence Database Collaboration (INSDC).

  • National Center for Biotechnology Information (NCBI)
  • European Nucleotide Archive (ENA)
  • DNA Data Bank of Japan (DDBJ​)

Their search interfaces differ, but these organizations synchronize their data across these 3 databases, so which one you prefer is up to you.

​If you're looking for more specialized databases and tools, browse the categories below or use the Online Bioinformatics Resource Collection (OBRC). 



Database Description


Berkeley Cancer Morphometric Data

Berkeley Morphometric Visualization and Quantification from H&E sections, sponsored by the Lawrence Berkeley National Laboratory, allows the TCGA community to download computed histology-based information, and visualize images and overlaid computed information.


Cancer Digital Slide Archive

The Cancer Digital Slide Archive (CDSA) is a browser-based, interactive tool for viewing and annotating (in beta) TCGA diagnostic and tissue slide images. Pathology reports, clinical metadata, as well as genomics information can also be retrieved. The CSDA is being developed and maintained by the Department of Biomedical Informatics and the Winship Cancer Institute, Emory University


The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.



The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes.

Users Guide

The Cancer Genome Workbench

CGWB (The Cancer Genome Workbench) hosts mutation, copy number, expression, and methylation data from a number of projects, including TCGA and TARGET.

The Cancer Imaging Archive

TCIA is a large archive of medical images of cancer accessible for public download. Registering is free. The images are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus.

User's Guide


Database Description



Search for information about chemical compounds, substances, and BioAssays

PubChem Help

PubChem Bioassay

The PubChem BioAssay Database contains bioactivity screens of chemical substances described in PubChem Substance. It provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to that screening procedure.

PubChem Help
PubChem Compound

The PubChem Compound Database contains validated chemical depiction information provided to describe substances in PubChem Substance. Structures stored within PubChem Compounds are pre-clustered and cross-referenced by identity and similarity groups.

PubChem Help

PubChem Substance

The PubChem Substance Database contains descriptions of samples, from a variety of sources, and links to biological screening results that are available in PubChem BioAssay. If the chemical contents of a sample are known, the description includes links to PubChem Compound.

PubChem Help

Database Description

Help is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world.

Help for Researchers


ClinVar aggregates information about genomic variation and its relationship to human health.

What is ClinVar?
Genetic Testing Registry

Find all types of GTR records, including tests, conditions/phenotypes, genes, and labs.

GTR Help


Organizes information related to human medical genetics, such as attributes of conditions with a genetic contribution.

MedGen Help


OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. OMIM is authored and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, under the direction of Dr. Ada Hamosh. Its official home is

Introduction to OMIM

Database Description



AmiGO is a search engine and database for the Gene Ontology (GO) project,  a collaborative effort to address the need for consistent descriptions of gene products across databases.

AmiGO Manual
Array Express

ArrayExpress is an archive of functional genomics data from high-throughput experiments that provides these data for reuse to the research community.

ArrayExpress Help

BioCyc The BioCyc collection of Pathway/Genome Databases (PGDBs) provides a reference on the genomes and metabolic pathways of thousands of sequenced organisms. User's Guide

ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, i.e., modified histones,

Getting started with ENCODE


Explore, view, and download genome-wide maps of DNA and histone modifications from our diverse collection of epigenomic data sets

Epigenomics Help

GEO datasets

This database stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository. Enter search terms to locate experiments of interest. DataSet records contain additional resources including cluster tools and differential expression queries.

GEO Documentation

GEO Profiles

This database stores individual gene expression profiles from curated DataSets in the Gene Expression Omnibus (GEO) repository. Search for specific profiles of interest based on gene annotation or pre-computed profile characteristics.

GEO Documentation


The PhenoGen Informatics website is a comprehensive toolbox for storing, analyzing, and integrating microarray data and related genotype and phenotype data. This tool provides a way to visualize data from genomic sequencing, RNA-Seq, and microarray data for rats and mice.​

PhenoGen Help

TCGA The Cancer Genome Atlas (TCGA) TCGA Wiki


UniGene computationally identifies transcripts from the same locus, analyzes expression by tissue, age, and health status and reports related proteins (protEST) and clone resources.

UniGene help


Human Genome Resources

Database Description


NCBI Human Genome Resources

A challenge facing researchers today is that of piecing together and analyzing the plethora of data currently being generated through the Human Genome Project and scores of smaller projects. NCBI's Web site serves an an integrated, one-stop, genomic information infrastructure for biomedical researchers from around the world so that they may use these data in their research efforts.

Using MapViewer


Human Genome Nomenclature Committee (HGNC) is a curated online repository of HGNC-approved gene nomenclature, gene families and associated resources including links to genomic, proteomic and phenotypic information.

Ensembl Genome Browser

The Ensembl project's representation of human genomic data, including genome assembly, gene annotation, comparative genomics, variation and regulation.

Ensembl Help

UCSC Genome Informatics

The UCSC Genome site contains the human reference sequence. It also provides portals to ENCODE data at UCSC (2003 to 2012) and to the Neandertal project.


UCSC Genome Browser: An Introduction

1000 Genome Project

The 1000 Genomes Browser allows users to explore variant calls, genotype calls and supporting sequence read alignments that have been produced by the 1000 Genomes project.



Other Genome Databases





The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.

Ensembl Workshop Materials


EuPathDB Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Infectious Diseases is a portal for accessing genomic-scale datasets associated with the eukaryotic pathogens in the following websites: AmoebaDB, CryptoDB, FungiDB, GiardiaDB, MicrosporidiaDB, PiroplasmaDB, PlasmoDB, ToxoDB, TrichDB, TriTrypDB, OrthoMCL.

What is EuPathDB?

Integrated Microbial Genomes (IMG) and metagenomes supports the annotation, analysis and distribution of microbial genome and metagenome datasets sequenced at DOE's Joint Genome Institute (JGI).


Microbial Genome Database The Microbial Genome Database (MBGD) facilitates comparative analysis of completely sequenced microbial genomes, the number of which is now growing rapidly. The aim of MBGD is to facilitate comparative genomics from various points of view such as ortholog identification, paralog clustering, motif analysis and gene order comparison.  
Microbial Genome Resources

Microbial Genomes Resources contains public data from prokaryotic genome sequencing projects. The sequence collection contains data from finished genomes as well as draft assemblies.


Mouse Genome Informatics

MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.

Tutorial From Sanger Institute

Port Eco

PortEco is a next-generation resource for knowledge and data about the biology of Escherichia coli K-12 group strains (these are laboratory strains and are not pathogenic), its bacteriophages, plasmids, and mobile genetic elements. PortEco is being developed by a national consortium of both laboratory biologists and computational biologists, and is funded by a grant from the U.S. National Institutes of Health.


Rat Genome Database The Rat Genome Database was created to serve as a repository of rat genetic and genomic data, as well as mapping, strain, and physiological information. It also facilitates investigators research efforts by providing tools to search, mine, and analyze this data. Rat Community
Tair The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana . Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, gene expression, Help
UCSC Genome Browser

The UCSC Genome site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides portals to ENCODE data at UCSC (2003 to 2012) and to the Neandertal project.



Database Description


Antibodies Online

Antibodies Online is an online marketplace for proteomics that contains more than 1 million research antibodies, ELISA kits and related products from 150 suppliers. They make comparing products easy by standardizing the relevant information and validates product data and experimental details.



A Database of Immunodominant B cell Epitopes

Information and Help

The dbMHC database provides an open, publicly accessible platform for DNA and clinical data related to the human Major Histocompatibility Complex (MHC).



IEDB contains experimental data characterizing antibody and T cell epitopes studied in humans, non-human primates, and other animals and includes epitopes involved in infectious disease, allergy, autoimmunity, and transplant.

Video Tutorial


IMGT specialized in the sequences, genes and structures of immunoglobulins (IG) or antibodies, T cell receptors (TR), major histocompatibility (MH) proteins of vertebrates, IgSF and MhSF superfamily proteins of vertebrates and invertebrates, fusion proteins for immunological applications (FPIA) and composite proteins for clinical applications (CPCA).


Find murine models of immune processes and immunological diseases.


NetMHC 3.4 Server

predicts binding of peptides to a number of different HLA alleles using artificial neural networks (ANNs).

Database Description



The Conserved Domain Database is a resource for the annotation of functional units in proteins. Its collection of domain models includes a set curated by NCBI, which utilizes 3D structure to provide insights into sequence/structure/function relationships.

CDD Help


Protein Data Bank (PDB) is an archive of information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps students and researchers understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease.

Understanding PDB Data

The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function.

Entrez Help
Protein Clusters

This collection of related protein sequences (clusters) consists of proteins derived from the annotations of whole genomes, organelles and plasmids. It currently limited to Archaea, Bacteria, Plants, Fungi, Protozoans, and Viruses

Protein Clusters Help


STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: Genomic Context, High-throughput experiments, coexpression, and previous knowledge


The Structure database contains 3D protein structures and allows users to retrieve specific subsets of resolved protein structures, find structural templates for proteins, find structures that are similar in 3D shape and view 3D srtucture. It is also referred to as the Molecular Modeling Database (MMDB).


How To

The Human Protein Atlas

The Human Protein Atlas (HPA) portal is a publicly available database with millions of high-resolution images showing the spatial distribution of proteins in 44 different normal human tissues and 20 different cancer types, as well as 46 different human cell lines.

About HPA

Database Description


Google Scholar Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: articles, theses, books, abstracts and court opinions, from academic publishers, professional societies, online repositories, universities and other web sites. Google Scholar helps you find relevant work across the world of scholarly research. Search Tips

PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

Quick Start

NLM Tutorials

Library Classes

PubMed Reminer

Detailed analysis of PubMed Search results


PubServer collects homologous sequences from NR database and retrieves and filters associated publications.


Web of Science The Web of ScienceSM (formerly Web of Knowledge) is today's premier research platform, helping you quickly find, analyze, and share information in the sciences, social sciences, arts, and humanities. You get integrated access to high quality literature through a unified platform that links a wide variety of content with one seamless search.



Database Description



Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.

Gene Help

This resource organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations.


An automated system for constructing putative homology groups from the complete gene sets of a wide range of eukaryotic species.

Query Tips

The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery.

Entrez sequences help

The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function.

Entrez sequences help
Database Description



Access page for all NCBI variation databases (dbSNP, dbVAR, dbGAP, ClinVar, GTR) 

Variation Handbook



ClinVar aggregates information about genomic variation and its relationship to human health.

What is ClinVar?

The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.


dbGAP Tutorial



Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants.


dbSNP Handbook


Fact Sheet


Database of genomic structural variation including insertions, deletions, duplications, inversions, deletion-insertions, mobile element insertions, translocations, and complex rearrangements

Structural Variation Overview



Fact Sheet

1000 Genomes Browser 



The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied by performing low coverage sequencing on a large number of individuals.

Ensembl Tutorial



TCGA data portal The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. About TCGA


Database Description


CBS Prediction Servers

A list of sequence prediction tools provided by CBS.


EBI-EMBL Tools A list of popular, free bioinformatic software generated by EBI-EMBL.  
ExPASy ExPASy is the SIB Bioinformatics Resource Portal which provides access to scientific databases and software tools (i.e., resources) in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.  
Galaxy Galaxy is an open source, web-based platform for data intensive biomedical research, including NGS sequence analysis, ChIPSeq Analysis and SNP/indel identification. It's a great "gateway" to using commandline tools. 

Galaxy 101



Genome Space GenomeSpace is a cloud-based interoperability framework to support integrative genomics analysis through an easy-to-use Web interface. GenomeSpace provides access to a diverse range of bioinformatics tools, and bridges the gaps between the tools, making it easy to leverage the available analyses and visualizations in each of them. Support
NCBI Analyze NCBI provides a wide variety of data analysis tools that allow users to manipulate, align, visualize and evaluate biological data.  
Database Description


BLAST The Basic Local Alignment Search Tool (BLAST) finds regions of homology between your query and a chosen search set (database). This site contains various algorithms for protein and nucleotide sequences.


ClustalW2 ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more sequences. Help
EBI EMBL Multiple Sequence Alignment Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred and the evolutionary relationships between the sequences studied.  
EBI-EMBL Pairwise Sequence Alignment Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).  
EMBOSS Water EMBOSS Water uses the Smith-Waterman algorithm (modified for speed enhancements) to calculate the local alignment of two sequences. Help
Database Description


CDART The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the Entrez Protein database based on domain architecture, defined as the sequential order of conserved domains in protein queries. CDART finds protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity. Help
Phobius Phobius is a program for prediction of transmembrane topology and signal peptides from the amino acid sequence of a protein. Help
EMBL-EBI Protein Functional Analysis A compilation of protein analysis tools provided by EBI-EMBL  
PfamScan PfamScan allows the user to search a FASTA sequence against a library of protein families with known function. Help
SignalP SignalP 4.1 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Instructions
VAST Vector Alignment Search Tool (VAST) is a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures ("similar structures") by purely geometric criteria, and to identify distant homologs that cannot be recognized by sequence comparison. Help
Database Description


Broad GDAC Firehose Provides systematic pipelines for analyzing data from the Cancer Genome Atlas (TCGA). This resource includes versioning of datasets, analysis results, biologist friendly reports, and custom runs.



Gene Spot This tool provides a way to view TCGA data from a gene-centric point-of-view. It includes a number of interactive visualizations, and allows the user to save their current exploration. This application also enables the user to select specific tumor types and genes of interest, and load data that is generated from a variety of TCGA analysis. Quick Guide
IGV Integrated Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

User's Guide


InSilico InSilico DB aggregates more than 250,000 samples of microarray and RNA–Seq data (human, mouse and rat) coming from public repositories including GEO and TCGA. Support
Regulome Explorer Regulome Explorer facilitates the integrative exploration of associations in clinical and molecular TCGA data. Regulome Explorer is an effort by the Center for Systems Analysis of the Cancer Regulome (CSACR), a collaboration between the Institute for Systems Biology and The University of Texas MD Anderson Cancer Center. CSACR is a Genome Data Analysis Center within The Cancer Genome Atlas project.

User Guide

Quick Start Guide

User Group

TCGA Batch Effects Tool This website is designed to help assess, diagnose and correct for any batch effects in TCGA data. It first allows the user to assess and quantify the presence of any batch effects via algorithms such as Hierarchical Clustering and Principal Component Analysis. The results from these algorithms are presented graphically as both simple and interactive diagrams. If significant batch effects is observed in the data, the user then has the option of downloading data that has been computationally corrected using methods such as Empirical Bayes (aka. ComBat), Median Polish and ANOVA. Tutorials
Database Description


NCBI variation tools Main page for all NCBI variation tools. 


Variation Reporter

NCBI Variation Reporter is a tool for accessing the content of human variation resources at NCBI.  You may query our data using your variant calls in a variety of formats.  We will match them to our data to produce a report that draws on dbSNP, dbVar, ClinVar, and NCBI's own human genomic annotation.




Variation Viewer Variation Viewer is a tool for interactive examination and download of nucleotide variants for a specific locus.  It supports both the GRCh38 and GRCh37.p13 assemblies.  Variation Viewer integrates data from all of the NCBI Variation databases and presents them in a coupled graphical and tabular report.  The resulting list of variants can be saved locally using the download function.  You can also upload their own variant data to this browser. 

NCBI Variation Viewer Fact sheet 

Introductory Video tutorial

1000 Genomes Browser 1000 Genomes Browser allows you to review sequence alignments and variant calls from the 1000 Genomes Project in the context of various genome annotations.  Browse data by population or individual sample, as well as by sequencing platform, aligner or experiment type.  Download genome slices of sequence and alignment data or genotype calls.  Users can also upload their own data into the browser.  All data is displayed in GRCh37.p13 coordinates.



NCBI Genome Remapping Service The NCBI Genome Remapping Service is a tool that projects users' annotation data from one coordinate system to another.  With the Assembly-Assembly remap, you can remap your variant calls between different assembly versions, while the Clinical Remap permits you to remap data between RefSeqGenes or LRGs and an assembly, using NCBI calculated alignments. NCBI Remap Service fact sheet
PheGenI The Phenotype-Genotype Integrator is a tool that integrates the search and retrieval of associated genotype-phenotype data from National Human Genome Research Institute (NHGRI) Genome-wide Association Study (GWAS) Catalog integrated with data housed in Gene, dbGaP, OMIM, GTEx and dbSNP.   It provides search by genotype and phenotype.  The dbSNP Data in NCBI PheGenI is only mapped GRCh38 at this time. NCBI Phenotype-Genotype Integrator fact sheet
Database Description


SPS EpiToolKit SBS EpiToolKit is a service of the Division for Simulation of Biological Systems at University of Tübingen. The aim of this website and its services is to support immunological research. It provides a collection of methods from computational immunology for the prediction of MHC ligands or potential T-Cell epitopes.  
NetMHC 3.4 Server Predicts binding of peptides to a number of different HLA alleles using artificial neural networks (ANNs).  
PIGS PredictioPigs ( is a web server for the automatic modeling of immunoglobulin variable domains based on the canonical structure method. It has a user-friendly and flexible interface, that allows the user to choose templates (for the frameworks and the loops) and modeling strategies in an automatic or manual fashion. Its final output is a complete three-dimensional model of the target antibody that can be downloaded or displayed on-line. The server is freely accessible to academic users, with no restriction on the number of submitted sequences.n of imm Help

Immunology Seminar: Dario Vignali, PhD
Monday, January 22, 2018, 12:00 pm
Research Complex 1 North
Hensel Phelps East

Pharmacology Seminar: Jay Debnath, MD
Monday, January 22, 2018, 12:00 pm
Research Complex 1 North

Cardiology Research Conference
Monday, January 22, 2018, 12:00 pm
Academic Office One
Room 7000

Voyage Lecture: Arthur Gutierrez-Hartmann, MD
Tuesday, January 23, 2018, 4:00 pm
Research Complex 2