RNA-seq Analysis Guide

First, load the deseq2pip package:

suppressPackageStartupMessages(library(deseq2pip))

## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'

save_dir <- tempdir()
dir.create(save_dir, recursive = TRUE)

## Warning in dir.create(save_dir, recursive = TRUE): '/tmp/Rtmp2r7ayE' already
## exists

#setwd("/Users/hungm/The Francis Crick Dropbox/CaladoD/Matthew/github-repos/r-pipelines/deseq2pip/tests/rnaseq/module/")

Data Import and Setup

Let’s import and prepare your data. We’ll use the example dataset GSE189410 to demonstrate the workflow:

# load example DESeq2 object
rdata <- system.file("data", "GSE189410.dds.RData", package = "deseq2pip")
tx2gene <- gzfile(system.file("data", "GSE189410.tx2gene.tsv.gz", package = "deseq2pip"))
dds <- import_nfcore_rna(rdata = rdata, tx2gene = tx2gene)


# set group variable (optional to manually set reference level)
dds$Group2 <- factor(dds$Group2, c('IgM', 'IgG', 'IgA'))

# set dds design with desired variables
# for more information about setting contrasts in DESeq2, see: https://github.com/tavareshugo/tutorial_DESeq2_contrasts?tab=readme-ov-file
design(dds) <- ~ Group2

Detailed Overview of DESeq2 pipeline

The run_deseq2_pip() function is a wrapper function that runs the complete RNA-seq analysis for every PAIRWISE and ONE-TO-ALL comparison pairs in the specified group variable.

The analysis for each comparison pair can be broken down into three modules:

Quality Control: Filters genes, generates QC/PCA/distance plots, and saves expression data

run_qc_pip()
run_dist_pip()

Differential Expression Analysis: Performs DESeq2 analysis, functional gene annotation, and creates volcano plots

run_diffexp_pip()
plot_peak_annot_pip() (ATAC-seq only)

GSEA Analysis: Find enriched pathways using MSigDB collections and prepare results for EnrichmentMap visualization

run_gsea_pip()
run_enrichmentmap_pip()

For our example dataset, we have 3 comparison pairs in the group variable Group2: IgM vs IgG, IgM vs IgA, and IgG vs IgA.

Here’s how to run the complete pipeline with a single command: Note: Repeating run_deseq2_pip will overwrite the existing files if the same directory name is used. To prevent this, please specify a different save_dir directory.

# run deseq2 pipeline
#run_rna_pip(dds, experiment = "GSE189410", org = "mouse", group_by = "Group2", remove_xy = TRUE, remove_mt = TRUE, save_dir = save_dir)

Run individual modules

After an initial run of the DESeq2 pipeline, you may wish to run a specific module using different parameters or comparison groups without repeating the entire pipeline.

Here’s how to run each module separately. First, set the global arguments.

# set pipeline arguments
group_by <- "Group2"
org <- "mouse"
remove_xy <- TRUE
remove_mt <- TRUE
quantile <- 0.05
batch <- NULL
pals <- NULL
order <- "pxfc"
one_to_all <- FALSE
msigdbr <- import_msigdbr(org = org)

# set additional arguments
save_data <- FALSE
save_plot <- FALSE

Below are the commands to run each individual module:

# Module 1: Quality Control
dds <- run_qc_pip(dds, org = org, group_by = group_by, remove_xy = remove_xy, remove_mt = remove_mt, quantile = quantile, save_dir = save_dir)

## Removing 4050 XY genes out of 54513 total genes...

## Removing 37 MT genes out of 50463 total genes...

## Filtering genes with low expressions...

## Saving DESeq2 object & expressions...

dds <- run_dist_pip(dds, group_by = group_by, batch = batch, pals = pals, save_dir = save_dir)

## Running batch correction...

## Running PCA and distance analysis...

## group_by is not a factor, converting to factor

## using ntop=500 top features by variance

## using ntop=500 top features by variance

## group_by is not a factor, converting to factor
## using ntop=500 top features by variance

## using ntop=500 top features by variance

# Module 2: Differential Expression Analysis
res <- run_diffexp_pip(dds, org = org, group_by = group_by, order = order, one_to_all = one_to_all, save_dir = save_dir)

## Running PAIRWISE differential expression analysis...

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041

## Combining row metadata to results...

## Removed 8 rows with NA values or invalid gene symbols

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: ggrepel: 700 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 703 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: Removed 14978 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: Removed 14978 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041

## Combining row metadata to results...

## Removed 8 rows with NA values or invalid gene symbols

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: ggrepel: 1544 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 1546 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: Removed 16549 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: Removed 16549 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041

## Combining row metadata to results...

## Removed 8 rows with NA values or invalid gene symbols

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: ggrepel: 951 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#res <- plot_peak_annot_pip(res, org = org, save_dir = save_dir)

# Module 3: GSEA Analysis
group_dir <- paste0(save_dir, "/pairwise_", group_by)
gsea <- run_gsea_pip(res = res, org = org, msigdbr = , group_save_dir = group_dir)

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 20 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## leading edge analysis...

## done...

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 10 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"): Ignoring unknown aesthetics: stroke
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 17 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 11 rows containing missing values or values outside the scale range
## (`geom_text()`).

enrichmentmap_pip(dds, org = org, group_by = group_by, group_dir = group_dir, save_dir = group_dir)

## Formatting for enrichmentmap...

## Writing class files for PAIRWISE comparisons...

Module 1 Breakdown

The run_qc_pip() module contains a sequence of subprocesses that performs quality control checks on the data. This includes removing XY and MT genes, removing lowly expressed genes, and checking library size distribution.

Remove XY and MT genes

If your data contains samples taken from both males and females that you are not interested in, you can optionally remove gender-specific genes on X and Y chromosomes. This can be done with the remove_xy_genes() function.

Similarly, if your data contains mitochondrial genes that you are not interested in, you can optionally remove them with the remove_mt_genes() function.

This is optional but recommended if you see a gender-specific effect in downstream PCA and distance analysis.

# when remove_xy or remove_mt are TRUE
if(remove_xy) { 
  dds <- remove_xy_genes(dds, org = org)}

## Removing 0 XY genes out of 50426 total genes...

if(remove_mt) {
  dds <- remove_mt_genes(dds, org = org)}

## Removing 0 MT genes out of 50426 total genes...

Remove lowly expressed genes

RNA-seq is a high-throughput technology that can be noisy, and certain genes are poorly captured from RNA-seq. To control for this, we will remove these genes to improve statistical power for subsequent analysis.

By default we keep genes with at least 5% of total expression (remove_low_expression(..., quantile = 0.05)). The gene must also be expressed in at least n replicates, where n is the minimum replicates in the group.

# remove low expression genes
dds <- remove_low_expression(dds, quantile = quantile, group_by = group_by, save_plot = save_plot, save_dir = save_dir)

## Filtering genes with low expressions...

Check library size distribution

Expression of genes may vary between samples, and this may be due to sample/batch quality issues.

Here we draw a boxplot representing expression level of every gene and compare the gene expression distribution from each sample using the check_library() function.

If the boxplots show little variation in expression level across samples, this indicates that the samples are of high quality and suitable for differential expression analysis.

Otherwise, one should consider removing the samples with low quality or perform batch correction downstream to address quality issues. This will address technical variation between samples/batches and improve the statistical power for true biological variations.

# check library size distribution
check_library(dds, save_plot = save_plot, save_dir = save_dir)

Run PCA and distance analysis

Next we visualize the relationship between samples/groups using PCA and distance analysis using the run_pca() and run_distance() functions.

If there are true biological variability between groups, replicas from the same group should cluster closely in the PCA plot and have lower distance values in the distance heatmap.

Again, one should consider sample/batch quality if the samples do not separate as expected.

# run PCA for the samples with group information
vsd <- vst(dds, blind = FALSE)
run_pca(vsd, group_by = group_by, size = 4, save_data = save_data, save_plot = save_plot, save_dir = save_dir)

## using ntop=500 top features by variance

# run distance analysis between samples
run_distance(vsd, save_data = save_data, save_plot = save_plot, save_dir = save_dir)

Module 2: Differential Expression Analysis

The run_diffexp_pip() module performs statistical analysis to identify differentially expressed genes between groups using DESeq2::DESeq2() function, with p-values were generated using DESeq2 test.

Main function

The main function in the module is the run_diffexp_pip() function, which by default will perform all pairwise/one-to-all comparisons between groups in the DESeq2 object.

This returns a single dataframe all metrics from DESeq2::results(), and the differentially expressed genes for each comparison is separated by comparison column.

Fold change shrinkage is also performed by ashr method.

Additionally, chromosome loci and functional annotations of each gene was added to the dataframe using the strpip::run_annotation() function (see https://github.com/hungms/strpip).

# run differential expression analysis for all comparisons; skipped here
#res <- run_diffexp_pip(dds, group_by = group_by, one_to_all = one_to_all, save_data = save_data, save_dir = save_dir)
#head(res)

# run differential expression analysis for a specific comparison
res <- run_diffexp(dds, org = org, group_by = group_by, comparison = "IgG_vs_IgA", order = order, save_data = save_data, save_dir = save_dir)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041

## Combining row metadata to results...

## Removed 8 rows with NA values or invalid gene symbols

head(res)

##       baseMean log2FoldChange     lfcSE        pvalue          padj     gene
## 1 1609294.3823       5.888986 0.2486099 2.150302e-127 3.564985e-123   Ighg2b
## 2  414449.8996       7.416217 0.4127803  1.333262e-75  1.105207e-71    Ighg1
## 3 1109964.2111       6.136009 0.3462582  1.711502e-73  9.458330e-70   Ighg2c
## 4     116.5320       4.668852 0.4445062  9.408951e-30  1.949888e-26    Smad3
## 5     490.8044       3.150512 0.3229892  1.222197e-24  1.688567e-21 BC035044
## 6    1329.3438       3.792944 0.4491024  2.942726e-20  2.217612e-17     Pld4
##        baseVar allZero dispGeneEst dispGeneIter    dispFit dispersion dispIter
## 1 3.115152e+12   FALSE  0.05815237            7 0.05148407 0.05866094        5
## 2 2.634566e+11   FALSE  0.29667561           10 0.05149559 0.15946279       10
## 3 1.402000e+12   FALSE  0.12433466            8 0.05148587 0.11288714        9
## 4 1.567870e+04   FALSE  0.07974167            9 0.10667979 0.09091352        7
## 5 1.921200e+05   FALSE  0.11029032            8 0.06458617 0.09178199        5
## 6 2.076169e+06   FALSE  0.27614296            9 0.05631895 0.18122957        9
##   dispOutlier    dispMAP Intercept Group2_IgG_vs_IgA SE_Intercept
## 1       FALSE 0.05866094 15.661421          5.933155    0.1747477
## 2       FALSE 0.15946279 12.154245          7.498636    0.2883103
## 3       FALSE 0.11288714 14.846632          6.216173    0.2424103
## 4       FALSE 0.09091352  2.947762          4.871319    0.3664599
## 5       FALSE 0.09178199  6.498391          3.301159    0.2349418
## 6       FALSE 0.18122957  7.237719          4.054703    0.3142603
##   SE_Group2_IgG_vs_IgA WaldStatistic_Intercept WaldStatistic_Group2_IgG_vs_IgA
## 1            0.2471047               89.623064                       24.010693
## 2            0.4075524               42.156816                       18.399197
## 3            0.3427872               61.245867                       18.134207
## 4            0.4299801                8.043889                       11.329172
## 5            0.3221623               27.659581                       10.246884
## 6            0.4397236               23.030968                        9.221025
##   WaldPvalue_Intercept WaldPvalue_Group2_IgG_vs_IgA betaConv betaIter  deviance
## 1         0.000000e+00                2.150302e-127     TRUE        3 199.93005
## 2         0.000000e+00                 1.333262e-75     TRUE        3 179.66890
## 3         0.000000e+00                 1.711502e-73     TRUE        3 199.11238
## 4         8.703135e-16                 9.408951e-30     TRUE        3  59.25030
## 5        2.140447e-168                 1.222197e-24     TRUE        3  88.69925
## 6        2.282327e-117                 2.942726e-20     TRUE        3 107.26246
##     maxCooks comparison      pxfc      rank mouse_chromosome human_gene_symbol
## 1 0.06337948 IgG_vs_IgA 721.09417 721.09417             <NA>              <NA>
## 2 0.66246996 IgG_vs_IgA 526.22919 526.22919             <NA>              <NA>
## 3 0.50039836 IgG_vs_IgA 423.53300 423.53300             <NA>              <NA>
## 4 0.07789860 IgG_vs_IgA 120.03614 120.03614                9             SMAD3
## 5 2.02423141 IgG_vs_IgA  65.44395  65.44395             <NA>              <NA>
## 6 0.31871834 IgG_vs_IgA  63.16813  63.16813               12              PLD4
##   human_chromosome human_gene_symbol_unique human_chromosome_unique entity_type
## 1             <NA>                     <NA>                    <NA>        <NA>
## 2             <NA>                     <NA>                    <NA>        <NA>
## 3             <NA>                     <NA>                    <NA>        <NA>
## 4               15                    SMAD3                      15     protein
## 5             <NA>                     <NA>                    <NA>        <NA>
## 6               14                     PLD4                      14     protein
##                       category
## 1                         <NA>
## 2                         <NA>
## 3                         <NA>
## 4                intracellular
## 5                         <NA>
## 6 transmembrane, intracellular
##                                                  parent consensus_score is_cs
## 1                                                  <NA>              NA    NA
## 2                                                  <NA>              NA    NA
## 3                                                  <NA>              NA    NA
## 4                                         intracellular               3 FALSE
## 5                                                  <NA>              NA    NA
## 6 transmembrane_predicted, intracellular, transmembrane               4  TRUE
##                                                     database_cs is_tf
## 1                                                          <NA>    NA
## 2                                                          <NA>    NA
## 3                                                          <NA>    NA
## 4               OmniPath, GO_Intercell, ComPPI, LOCATE, UniProt  TRUE
## 5                                                          <NA>    NA
## 6 Phobius, GO_Intercell, UniProt, OmniPath, Ramilowski_location FALSE
##                                                                                                                                                                                                                                                                                                                                                                                           database_tf
## 1                                                                                                                                                                                                                                                                                                                                                                                                <NA>
## 2                                                                                                                                                                                                                                                                                                                                                                                                <NA>
## 3                                                                                                                                                                                                                                                                                                                                                                                                <NA>
## 4 CollecTRI;DoRothEA;ExTRI_CollecTRI;NTNU.Curated_CollecTRI;RegNetwork_DoRothEA;TFactS_CollecTRI;TFactS_DoRothEA;TRED_DoRothEA;TRRUST_CollecTRI;TRRUST_DoRothEA;HTRI_CollecTRI;HTRIdb;HTRIdb_DoRothEA;ReMap_DoRothEA;FANTOM4_DoRothEA;CytReg_CollecTRI;GEREDB_CollecTRI;SIGNOR;SIGNOR_CollecTRI;ARACNe-GTEx_DoRothEA;GOA_CollecTRI;IntAct_CollecTRI;IntAct_DoRothEA;HOCOMOCO_DoRothEA;JASPAR_DoRothEA
## 5                                                                                                                                                                                                                                                                                                                                                                                                <NA>
## 6                                                                                                                                                                                                                                                                                                                                                                                                <NA>

Generating MA & volcano plots

Here we will use plot_ma() and plot_volcano() to generate a single MA plot for the IgG_vs_IgA comparison as a demonstration.

# generate MA plot for the "IgG_vs_IgA" comparison
plot_ma(res, save_plot = save_plot, save_dir = save_dir)

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Generate volcano plot for the "IgA_vs_IgG" comparison
plot_volcano(res, save_plot = FALSE, save_dir = save_dir)

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).
## ggrepel: 37 unlabeled data points (too many overlaps). Consider increasing max.overlaps

Module 3: GSEA Analysis

The final run_gsea_pip() module performs gene set enrichment analysis to identify biological pathways and processes using a list of MSigDB gene set collections (HALLMARK, GOBP, KEGG, REACTOME, TFT, BIOCARTA). Gene sets must have at least 10 and no more than 1000 genes to be tested.

Main function

This returns a list of gseaResult class objects for each gene set collection and comparison.

# run gsea analysis
# gsea <- run_gsea_pip(res, org = org, save_data = save_data, save_dir = save_dir)

# run gsea analysis
gsea <- run_gsea(res, org = org, save_data = save_data, save_dir = save_dir)

## Converting baseMean to numeric

## Converting log2FoldChange to numeric

## Converting padj to numeric

## Converting rank to numeric

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

names(gsea)

##  [1] "ID"              "Description"     "setSize"         "enrichmentScore"
##  [5] "NES"             "pvalue"          "p.adjust"        "qvalue"         
##  [9] "rank"            "leading_edge"    "core_enrichment" "collection"     
## [13] "comparison"

Generate barplots for top enriched gene set terms

Here we will use plot_gsea_barplot() to generate a single barplot for the IgA_vs_IgG comparison using GOBP collection as a demonstration. By default, the barplot will show the top 10 upregulated and downregulated terms.

# generate barplot for the "IgA_vs_IgG" comparison
plot_gsea_barplot(gsea, n = 10, save_plot = FALSE, save_dir = save_dir)

## Converting NES to numeric

## Converting pvalue to numeric

## Converting qvalue to numeric

## Warning in sub(paste0("^", collection, "_"), "", gsea$ID): argument 'pattern'
## has length > 1 and only the first element will be used

## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

EnrichmentMap pipeline

Finally the module will run enrichmentmap_pip() which creates a enrichmentmap directory containing files formated for EnrichmentMap visualization.

By default, the function will only use the HALLMARK, GOBP, KEGG and REACTOME collections. We will skip this step here.

# run enrichmentmap pipeline
#enrichmentmap_pip(gsea, org = org, group_by = group_by, group_dir = group_dir, save_dir = group_dir)

Author: Matthew Hung

Last Updated: 2025/05/30

Data Import and Setup

Detailed Overview of DESeq2 pipeline

Run individual modules

Module 1 Breakdown

Remove XY and MT genes

Remove lowly expressed genes

Check library size distribution

Run PCA and distance analysis

Module 2: Differential Expression Analysis

Main function

Generating MA & volcano plots

Module 3: GSEA Analysis

Main function

Generate barplots for top enriched gene set terms

EnrichmentMap pipeline