7  Heatmaps

7.1 Heatmaps from gene expression data

In this tutorial, we’ll learn how to create heatmaps using gene expression data. Heatmaps are powerful visualization tools for gene expression data, allowing researchers to identify patterns, clusters, and differential expression across samples and genes. We will use the GEOquery package to download and process data from the Gene Expression Omnibus (GEO) database. Specifically, we will analyze the dataset with accession number GSE9723, which contains gene expression profiles of human gingival keratinocytes (HIGK) infected with two different bacteria.

The data was published in this study

Summary of Dataset GSE9723

Transcriptional profiling was utilized to define the biological pathways of gingival epithelial cells modulated by co-culture with the oral pathogenic Porphyromonas gingivalis and Aggregatibacter actinomycetemcomitans. We used microarrays to detail the global programme of gene expression underlying infection and identified distinct classes of up- and down-regulated genes during this process.

The files required for this analysis can be downloaded from GEO database. For quick access, The series matrix file for GSE9723 can be found here. This file contains the processed expression data along with sample annotations. The platform file for GPL96 can be found here. This file contains the probe annotations for the Affymetrix Human Genome U133A Array used in this study.

7.2 Installation of Required Packages

We will use the GEOquery package to read the series matrix file and extract the expression data and sample annotations. For processing differential expression, we will use the limma package. For preparing heatmaps, we will use two different packages — pheatmap and ComplexHeatmap. To install the required packages, use the following commands:

install.packages("BiocManager")
BiocManager::install("GEOquery")
BiocManager::install("limma")
install.packages("pheatmap")
BiocManager::install("ComplexHeatmap")

7.3 Load libriaries and Read Data

To read the series matrix file, we will use the getGEO function from the GEOquery package. The getGPL keyword argument is set to FALSE to avoid downloading the platform data, as we only need the expression data and sample annotations for this analysis. Subsequently, we will extract the expression data and phenotype data (sample annotations) from the GEO object using the exprs() and pData() functions respectively.

library(GEOquery)
Loading required package: Biobase
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
    table, tapply, union, unique, unsplit, which.max, which.min
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
library(pheatmap)
Warning: package 'pheatmap' was built under R version 4.4.3
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::combine()    masks Biobase::combine(), BiocGenerics::combine()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)
Warning: package 'gt' was built under R version 4.4.3
# Read the series matrix file using GEOquery
series_matrix_file <- "GEO_data_downloaded/GSE9723_series_matrix.txt"
gse <- getGEO(filename = series_matrix_file, getGPL = FALSE)

# Extract expression data
expr_data <- exprs(gse)

# Extract phenotype data
pData <- pData(gse) %>% as_tibble(rownames = "sample_id")

7.4 Differential Expression Analysis

Next, using the limma package, we’ll identify differentially expressed genes (DEGs) between infected and control samples. We will create a design matrix based on the sample titles, fit a linear model, and apply empirical Bayes moderation to obtain the DEGs, and extract the top 50 DEGs for further analysis.

library(limma)

Attaching package: 'limma'
The following object is masked from 'package:BiocGenerics':

    plotMA
# Create design matrix
group_factor <- factor(ifelse(grepl("Sham", pData$title, ignore.case = TRUE), "Control", "Infected"))
design <- model.matrix(~0 + group_factor)
colnames(design) <- make.names(levels(group_factor))

# Run limma
fit <- lmFit(expr_data, design)
contrast_matrix <- makeContrasts(Infected - Control, levels = design)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)

# Get top DEGs
deg <- topTable(fit2, adjust = "fdr", number = Inf)
top_degs <- rownames(deg)[1:50]

# Read platform annotation file to map probe IDs to gene symbols
platform_file <- "GEO_data_downloaded/GPL96-57554.txt"
platform_data <- read.delim(platform_file, sep = "\t", skip = 16, header = TRUE,
                            stringsAsFactors = FALSE, comment.char = "#")

# Create a mapping from probe ID to gene symbol
probe_to_gene <- platform_data %>%
  select(ID, Gene.Symbol) %>%
  mutate(Gene.Symbol = ifelse(Gene.Symbol == "", NA, Gene.Symbol)) %>%
  deframe()

# Add gene symbols to deg table
deg_with_genes <- deg %>%
  as.data.frame() %>%
  rownames_to_column("ProbeID") %>%
  mutate(Gene = probe_to_gene[ProbeID]) %>%
  mutate(Gene = ifelse(is.na(Gene), ProbeID, Gene))

# Display top 10 with gene symbols
head(deg_with_genes[, c("Gene", "logFC", "adj.P.Val")], 10) %>%
  gt() %>%
  tab_header(title = "Top 10 Differentially Expressed Genes")
Top 10 Differentially Expressed Genes
Gene logFC adj.P.Val
SNAI2 -2508.6750 0.005145103
IL1B 3268.4125 0.005145103
ZFP36L1 -685.6500 0.015821475
TMEM127 -205.4250 0.019209550
BUD31 -364.8375 0.019209550
PPP2R2D -117.5250 0.019209550
TNFRSF12A -2173.4750 0.019209550
FARS2 -45.0125 0.022918236
DIEXF 158.2125 0.026242237
INO80D 99.8000 0.029204904

7.5 Sample Information

pData %>% 
  select(sample_id, title, status, treatment_protocol_ch1) %>%
  gt() %>%
  tab_header(title = "Sample Information for GSE9723") %>%
  tab_options(container.height = "50vh")
Sample Information for GSE9723
sample_id title status treatment_protocol_ch1
GSM245725 HIGK cells infected with Porphyromonas gingivalis, biological rep1 Public on Apr 15 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 100 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245726 HIGK cells infected with Porphyromonas gingivalis, biological rep2 Public on Apr 15 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 100 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245727 HIGK cells infected with Porphyromonas gingivalis, biological rep3 Public on Apr 15 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 100 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245728 HIGK cells infected with Porphyromonas gingivalis, biological rep4 Public on Apr 15 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 100 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245729 Sham infected HIGK cells, biological rep1 (1777) Public on Mar 01 2008 HIGK cells were sham infected with antibiotic-free K-SFM media. HIGK cells (10e7) were cultured for 2 hours at 37C in 5% CO2, and were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245730 Sham infected HIGK cells, biological rep2 (1778) Public on Mar 01 2008 HIGK cells were sham infected with antibiotic-free K-SFM media. HIGK cells (10e7) were cultured for 2 hours at 37C in 5% CO2, and were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245731 Sham infected HIGK cells, biological rep3 (1779) Public on Mar 01 2008 HIGK cells were sham infected with antibiotic-free K-SFM media. HIGK cells (10e7) were cultured for 2 hours at 37C in 5% CO2, and were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245732 Sham infected HIGK cells, biological rep4 (1780) Public on Mar 01 2008 HIGK cells were sham infected with antibiotic-free K-SFM media. HIGK cells (10e7) were cultured for 2 hours at 37C in 5% CO2, and were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245733 HIGK cells infected with Aggregatibacter actinomycetemcomitans, biological rep1 Public on Mar 01 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 2500 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245734 HIGK cells infected with Aggregatibacter actinomycetemcomitans, biological rep2 Public on Mar 01 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 2500 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245735 HIGK cells infected with Aggregatibacter actinomycetemcomitans, biological rep3 Public on Mar 01 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 2500 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.
GSM245736 HIGK cells infected with Aggregatibacter actinomycetemcomitans, biological rep4 Public on Mar 01 2008 Bacteria at mid-log phase were harvested and washed by centrifugation, and resuspended in antibiotic-free K-SFM media. HIGK cells (10e7) were co-cultured with bacteria to give a total association (adhered plus invaded) of approximately 2500 organisms per epithelial cell. Numbers of adhered and invaded organisms were confirmed in parallel experiments by plate counting. After 2 hours at 37C in 5% CO2, the HIGK cells were lysed with Trizol (Invitrogen) prior to RNA extraction.

7.6 Heatmap using pheatmap

For our heatmap, we need to create an annotation data frame that categorizes samples based on the bacteria they were infected with. So we’ll add a new Group column to the phenotype data, which will classify samples into “Control”, “Porphyromonas”, and “Aggregatibacter” groups based on the sample titles. For the heatmap, the data source will be the expression values of the top 50 DEGs identified earlier. The scale = "row" argument will standardize the expression values for each gene across all samples, allowing us to visualize relative expression levels. In the heatmap, clustering of both rows (genes) and columns (samples) will be performed using correlation distance to group similar expression patterns together.

# Create annotation with specific bacteria names from title
sample_groups_detailed <- pData %>%
  mutate(Group = case_when(
    grepl("Sham", title, ignore.case = TRUE) ~ "Control",
    grepl("Porphyromonas", title, ignore.case = TRUE) ~ "Porphyromonas",
    grepl("Aggregatibacter", title, ignore.case = TRUE) ~ "Aggregatibacter",
    TRUE ~ "Other"
  )) %>%
  pull(Group)

annotation_df <- data.frame(
  Group = factor(sample_groups_detailed),
  row.names = colnames(expr_data)
)

# Define custom colors for annotation
annotation_colors <- list(
  Group = c("Control" = "#2E8B57", "Porphyromonas" = "#CD5C5C", "Aggregatibacter" = "#4169E1")
)

# Generate heatmap using top DEGs
# Create expression matrix with gene symbols as row names
expr_heatmap <- expr_data[top_degs, ]
rownames(expr_heatmap) <- probe_to_gene[rownames(expr_heatmap)]
# If any gene symbols are NA, keep the original probe ID
rownames(expr_heatmap) <- ifelse(is.na(rownames(expr_heatmap)), top_degs, rownames(expr_heatmap))

pheatmap(expr_heatmap,
         scale = "row",
         annotation_col = annotation_df,
         annotation_colors = annotation_colors,
         show_rownames = TRUE,
         show_colnames = TRUE,
         clustering_distance_rows = "correlation",
         clustering_distance_cols = "correlation",
         main = "Top 50 Differentially Expressed Genes")

7.7 Clustering Distance Metrics options

When creating heatmaps, choosing the right clustering parameters is crucial for revealing meaningful patterns in the data.

7.7.1 Distance Metrics in pheatmap

Distance Metric Best For Description
"euclidean" Absolute expression differences Straight-line distance between expression vectors. Good when magnitude matters.
"correlation" Shape of expression patterns Pearson correlation. Ideal for gene expression where co-regulation matters more than absolute values. Internally, it calculates distance as \[ d = 1 - correlation \]
"manhattan" High-dimensional data Sum of absolute differences. Less sensitive to outliers than Euclidean.
"maximum" Outlier robustness Maximum coordinate difference
"canberra" Sparse data Weighted Manhattan distance
"minkowski" Generalized distance Generalization of Euclidean and Manhattan
"binary" Presence/absence data For binary matrices

Note: pheatmap uses base R’s dist() function. For Spearman correlation, pre-calculate the distance matrix using as.dist(1 - cor(t(expr_matrix), method = "spearman")) and pass it to the clustering_distance_rows parameter.

7.8 Heatmap using ComplexHeatmap

Bioconductor has another powerful package for creating heatmaps called ComplexHeatmap. We’ll use the same dataset as before, but this time we’ll create three separate heatmaps for each group: Control (Sham), Porphyromonas infected, and Aggregatibacter infected samples. And then make one combined heatmap showing all three groups side by side for comparison. Note that with ComplexHeatmap, we need to manually scale the data by row (z-score) since it doesn’t have a built-in scaling option like pheatmap. This might look as extra work, but it gives us more control over the scaling process e.g. choosing a different method of scaling if needed and handling missing values more flexibly.

library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.22.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
    genomic data. Bioinformatics 2016.


The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
! pheatmap() has been masked by ComplexHeatmap::pheatmap(). Most of the arguments
   in the original pheatmap() are identically supported in the new function. You 
   can still use the original function by explicitly calling pheatmap::pheatmap().

Attaching package: 'ComplexHeatmap'
The following object is masked from 'package:gt':

    row_order
The following object is masked from 'package:pheatmap':

    pheatmap
library(circlize)
Warning: package 'circlize' was built under R version 4.4.3
========================================
circlize version 0.4.17
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: https://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization
  in R. Bioinformatics 2014.

This message can be suppressed by:
  suppressPackageStartupMessages(library(circlize))
========================================
# Get sample groups with bacteria names
sample_groups_detailed <- pData %>%
  mutate(Group = case_when(
    grepl("Sham", title, ignore.case = TRUE) ~ "Control",
    grepl("Porphyromonas", title, ignore.case = TRUE) ~ "Porphyromonas",
    grepl("Aggregatibacter", title, ignore.case = TRUE) ~ "Aggregatibacter",
    TRUE ~ "Other"
  )) %>%
  pull(Group)

# Split samples into groups
control_samples <- colnames(expr_data)[sample_groups_detailed == "Control"]
porphyromonas_samples <- colnames(expr_data)[sample_groups_detailed == "Porphyromonas"]
aggregatibacter_samples <- colnames(expr_data)[sample_groups_detailed == "Aggregatibacter"]

# Use top DEGs
# Check which top DEGs are in the expression matrix
available_degs <- top_degs[top_degs %in% rownames(expr_data)]
if (length(available_degs) == 0) {
  # If no match, use numeric indices
  available_degs <- rownames(expr_data)[as.numeric(top_degs)]
  available_degs <- available_degs[!is.na(available_degs)]
}

# Scale data by row (z-score) - same as pheatmap scale="row"
expr_scaled <- t(apply(expr_data[available_degs, ], 1, function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}))

# Replace probe IDs with gene symbols for row names
rownames(expr_scaled) <- probe_to_gene[rownames(expr_scaled)]
# If any gene symbols are NA, keep the original probe ID
rownames(expr_scaled) <- ifelse(is.na(rownames(expr_scaled)), available_degs, rownames(expr_scaled))

# Define color scheme - matching pheatmap default (blue-white-red)
col_fun <- colorRamp2(c(-2, 0, 2), c("#4575B4", "#FFFFBF", "#D73027"))

# Calculate row clustering using ALL samples (matching pheatmap behavior)
# Create distance matrix and dendrogram for rows using correlation
row_dist <- as.dist(1 - cor(t(expr_scaled)))
row_dend <- hclust(row_dist, method = "complete")

# Create heatmap for Control samples
ht_control <- Heatmap(expr_scaled[, control_samples],
                      name = "Expression",
                      col = col_fun,
                      column_title = "Control (Sham)",
                      row_title = "Genes",
                      cluster_rows = row_dend,  # Use dendrogram based on all samples
                      cluster_columns = TRUE,
                      clustering_distance_columns = "pearson",
                      show_row_names = TRUE,
                      row_names_side = "left",
                      row_dend_side = "left",
                      show_column_names = TRUE,
                      show_heatmap_legend = TRUE,
                      heatmap_legend_param = list(title = "Z-Score"))

# Create heatmap for Porphyromonas infected samples
ht_porphyromonas <- Heatmap(expr_scaled[, porphyromonas_samples],
                         col = col_fun,
                         column_title = "Porphyromonas",
                         cluster_rows = row_dend,  # Same row order as control
                         cluster_columns = TRUE,
                         clustering_distance_columns = "pearson",
                         show_row_names = FALSE,
                         show_column_names = TRUE,
                         show_heatmap_legend = FALSE)

# Create heatmap for Aggregatibacter infected samples
ht_aggregatibacter <- Heatmap(expr_scaled[, aggregatibacter_samples],
                         col = col_fun,
                         column_title = "Aggregatibacter",
                         cluster_rows = row_dend,  # Same row order as control
                         cluster_columns = TRUE,
                         clustering_distance_columns = "pearson",
                         show_row_names = FALSE,
                         show_column_names = TRUE,
                         show_heatmap_legend = FALSE)

# Combine heatmaps horizontally
ht_combined <- ht_control + ht_aggregatibacter + ht_porphyromonas
draw(ht_combined, column_title = "Gene Expression: Control vs Porphyromonas vs Aggregatibacter")

7.8.1 Distance Metrics in ComplexHeatmap

ComplexHeatmap supports all pheatmap distances plus some additional options for correlation-based distances.

Distance Metric Best For Description
"pearson" Linear relationships Same as correlation in pheatmap
"spearman" Non-linear relationships Rank-based correlation. Robust to outliers.
"kendall" Concordance Another rank-based measure