RNA-seq Analysis Guide
Author: Matthew Hung
Last Updated: 2025/05/30
Source:vignettes/rna_guide.Rmd
rna_guide.RmdFirst, load the deseq2pip package:
## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
save_dir <- tempdir()
dir.create(save_dir, recursive = TRUE)## Warning in dir.create(save_dir, recursive = TRUE): '/tmp/Rtmp2r7ayE' already
## exists
#setwd("/Users/hungm/The Francis Crick Dropbox/CaladoD/Matthew/github-repos/r-pipelines/deseq2pip/tests/rnaseq/module/")Data Import and Setup
Let’s import and prepare your data. We’ll use the example dataset GSE189410 to demonstrate the workflow:
# load example DESeq2 object
rdata <- system.file("data", "GSE189410.dds.RData", package = "deseq2pip")
tx2gene <- gzfile(system.file("data", "GSE189410.tx2gene.tsv.gz", package = "deseq2pip"))
dds <- import_nfcore_rna(rdata = rdata, tx2gene = tx2gene)
# set group variable (optional to manually set reference level)
dds$Group2 <- factor(dds$Group2, c('IgM', 'IgG', 'IgA'))
# set dds design with desired variables
# for more information about setting contrasts in DESeq2, see: https://github.com/tavareshugo/tutorial_DESeq2_contrasts?tab=readme-ov-file
design(dds) <- ~ Group2Detailed Overview of DESeq2 pipeline
The run_deseq2_pip() function is a wrapper function that runs the complete RNA-seq analysis for every PAIRWISE and ONE-TO-ALL comparison pairs in the specified group variable.
The analysis for each comparison pair can be broken down into three modules:
- Quality Control: Filters genes, generates QC/PCA/distance plots, and saves expression data
- Differential Expression Analysis: Performs DESeq2 analysis, functional gene annotation, and creates volcano plots
run_diffexp_pip()-
plot_peak_annot_pip()(ATAC-seq only)
- GSEA Analysis: Find enriched pathways using MSigDB collections and prepare results for EnrichmentMap visualization
run_gsea_pip()run_enrichmentmap_pip()
For our example dataset, we have 3 comparison pairs in the group
variable Group2: IgM vs IgG,
IgM vs IgA, and IgG vs
IgA.
Here’s how to run the complete pipeline with a single command: Note: Repeating run_deseq2_pip will overwrite the existing files if the same directory name is used. To prevent this, please specify a different save_dir directory.
# run deseq2 pipeline
#run_rna_pip(dds, experiment = "GSE189410", org = "mouse", group_by = "Group2", remove_xy = TRUE, remove_mt = TRUE, save_dir = save_dir)Run individual modules
After an initial run of the DESeq2 pipeline, you may wish to run a specific module using different parameters or comparison groups without repeating the entire pipeline.
Here’s how to run each module separately. First, set the global arguments.
# set pipeline arguments
group_by <- "Group2"
org <- "mouse"
remove_xy <- TRUE
remove_mt <- TRUE
quantile <- 0.05
batch <- NULL
pals <- NULL
order <- "pxfc"
one_to_all <- FALSE
msigdbr <- import_msigdbr(org = org)
# set additional arguments
save_data <- FALSE
save_plot <- FALSEBelow are the commands to run each individual module:
# Module 1: Quality Control
dds <- run_qc_pip(dds, org = org, group_by = group_by, remove_xy = remove_xy, remove_mt = remove_mt, quantile = quantile, save_dir = save_dir)## Removing 4050 XY genes out of 54513 total genes...
## Removing 37 MT genes out of 50463 total genes...
## Filtering genes with low expressions...


## Saving DESeq2 object & expressions...
dds <- run_dist_pip(dds, group_by = group_by, batch = batch, pals = pals, save_dir = save_dir)## Running batch correction...
## Running PCA and distance analysis...
## group_by is not a factor, converting to factor
## using ntop=500 top features by variance

## using ntop=500 top features by variance

## group_by is not a factor, converting to factor
## using ntop=500 top features by variance

## using ntop=500 top features by variance


# Module 2: Differential Expression Analysis
res <- run_diffexp_pip(dds, org = org, group_by = group_by, order = order, one_to_all = one_to_all, save_dir = save_dir)## Running PAIRWISE differential expression analysis...
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
## Combining row metadata to results...
## Removed 8 rows with NA values or invalid gene symbols
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: ggrepel: 700 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 703 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: Removed 14978 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 14978 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
## Combining row metadata to results...
## Removed 8 rows with NA values or invalid gene symbols
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: ggrepel: 1544 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 1546 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: Removed 16549 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 16549 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
## Combining row metadata to results...
## Removed 8 rows with NA values or invalid gene symbols
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: ggrepel: 951 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#res <- plot_peak_annot_pip(res, org = org, save_dir = save_dir)
# Module 3: GSEA Analysis
group_dir <- paste0(save_dir, "/pairwise_", group_by)
gsea <- run_gsea_pip(res = res, org = org, msigdbr = , group_save_dir = group_dir)## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 20 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 8 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.03% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 10 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"): Ignoring unknown aesthetics: stroke
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 17 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 11 rows containing missing values or values outside the scale range
## (`geom_text()`).

enrichmentmap_pip(dds, org = org, group_by = group_by, group_dir = group_dir, save_dir = group_dir)## Formatting for enrichmentmap...
## Writing class files for PAIRWISE comparisons...
Module 1 Breakdown
The run_qc_pip() module contains a sequence of
subprocesses that performs quality control checks on the data. This
includes removing XY and MT genes, removing lowly expressed genes, and
checking library size distribution.
Remove XY and MT genes
If your data contains samples taken from both males and females that
you are not interested in, you can optionally remove gender-specific
genes on X and Y chromosomes. This can be done with the
remove_xy_genes() function.
Similarly, if your data contains mitochondrial genes that you are not
interested in, you can optionally remove them with the
remove_mt_genes() function.
This is optional but recommended if you see a gender-specific effect in downstream PCA and distance analysis.
# when remove_xy or remove_mt are TRUE
if(remove_xy) {
dds <- remove_xy_genes(dds, org = org)}## Removing 0 XY genes out of 50426 total genes...
if(remove_mt) {
dds <- remove_mt_genes(dds, org = org)}## Removing 0 MT genes out of 50426 total genes...
Remove lowly expressed genes
RNA-seq is a high-throughput technology that can be noisy, and certain genes are poorly captured from RNA-seq. To control for this, we will remove these genes to improve statistical power for subsequent analysis.
By default we keep genes with at least 5% of total expression
(remove_low_expression(..., quantile = 0.05)). The gene
must also be expressed in at least n replicates, where n is the minimum
replicates in the group.
# remove low expression genes
dds <- remove_low_expression(dds, quantile = quantile, group_by = group_by, save_plot = save_plot, save_dir = save_dir)## Filtering genes with low expressions...

Check library size distribution
Expression of genes may vary between samples, and this may be due to sample/batch quality issues.
Here we draw a boxplot representing expression level of every gene
and compare the gene expression distribution from each sample using the
check_library() function.
If the boxplots show little variation in expression level across samples, this indicates that the samples are of high quality and suitable for differential expression analysis.
Otherwise, one should consider removing the samples with low quality or perform batch correction downstream to address quality issues. This will address technical variation between samples/batches and improve the statistical power for true biological variations.
# check library size distribution
check_library(dds, save_plot = save_plot, save_dir = save_dir)
Run PCA and distance analysis
Next we visualize the relationship between samples/groups using PCA
and distance analysis using the run_pca() and
run_distance() functions.
If there are true biological variability between groups, replicas from the same group should cluster closely in the PCA plot and have lower distance values in the distance heatmap.
Again, one should consider sample/batch quality if the samples do not separate as expected.
# run PCA for the samples with group information
vsd <- vst(dds, blind = FALSE)
run_pca(vsd, group_by = group_by, size = 4, save_data = save_data, save_plot = save_plot, save_dir = save_dir)## using ntop=500 top features by variance


# run distance analysis between samples
run_distance(vsd, save_data = save_data, save_plot = save_plot, save_dir = save_dir)
Module 2: Differential Expression Analysis
The run_diffexp_pip() module performs statistical
analysis to identify differentially expressed genes between groups using
DESeq2::DESeq2() function, with p-values were generated
using DESeq2 test.
Main function
The main function in the module is the run_diffexp_pip()
function, which by default will perform all pairwise/one-to-all
comparisons between groups in the DESeq2 object.
This returns a single dataframe all metrics from
DESeq2::results(), and the differentially expressed genes
for each comparison is separated by comparison column.
Fold change shrinkage is also performed by ashr
method.
Additionally, chromosome loci and functional annotations of each gene
was added to the dataframe using the
strpip::run_annotation() function (see https://github.com/hungms/strpip).
# run differential expression analysis for all comparisons; skipped here
#res <- run_diffexp_pip(dds, group_by = group_by, one_to_all = one_to_all, save_data = save_data, save_dir = save_dir)
#head(res)
# run differential expression analysis for a specific comparison
res <- run_diffexp(dds, org = org, group_by = group_by, comparison = "IgG_vs_IgA", order = order, save_data = save_data, save_dir = save_dir)## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
## Combining row metadata to results...
## Removed 8 rows with NA values or invalid gene symbols
head(res)## baseMean log2FoldChange lfcSE pvalue padj gene
## 1 1609294.3823 5.888986 0.2486099 2.150302e-127 3.564985e-123 Ighg2b
## 2 414449.8996 7.416217 0.4127803 1.333262e-75 1.105207e-71 Ighg1
## 3 1109964.2111 6.136009 0.3462582 1.711502e-73 9.458330e-70 Ighg2c
## 4 116.5320 4.668852 0.4445062 9.408951e-30 1.949888e-26 Smad3
## 5 490.8044 3.150512 0.3229892 1.222197e-24 1.688567e-21 BC035044
## 6 1329.3438 3.792944 0.4491024 2.942726e-20 2.217612e-17 Pld4
## baseVar allZero dispGeneEst dispGeneIter dispFit dispersion dispIter
## 1 3.115152e+12 FALSE 0.05815237 7 0.05148407 0.05866094 5
## 2 2.634566e+11 FALSE 0.29667561 10 0.05149559 0.15946279 10
## 3 1.402000e+12 FALSE 0.12433466 8 0.05148587 0.11288714 9
## 4 1.567870e+04 FALSE 0.07974167 9 0.10667979 0.09091352 7
## 5 1.921200e+05 FALSE 0.11029032 8 0.06458617 0.09178199 5
## 6 2.076169e+06 FALSE 0.27614296 9 0.05631895 0.18122957 9
## dispOutlier dispMAP Intercept Group2_IgG_vs_IgA SE_Intercept
## 1 FALSE 0.05866094 15.661421 5.933155 0.1747477
## 2 FALSE 0.15946279 12.154245 7.498636 0.2883103
## 3 FALSE 0.11288714 14.846632 6.216173 0.2424103
## 4 FALSE 0.09091352 2.947762 4.871319 0.3664599
## 5 FALSE 0.09178199 6.498391 3.301159 0.2349418
## 6 FALSE 0.18122957 7.237719 4.054703 0.3142603
## SE_Group2_IgG_vs_IgA WaldStatistic_Intercept WaldStatistic_Group2_IgG_vs_IgA
## 1 0.2471047 89.623064 24.010693
## 2 0.4075524 42.156816 18.399197
## 3 0.3427872 61.245867 18.134207
## 4 0.4299801 8.043889 11.329172
## 5 0.3221623 27.659581 10.246884
## 6 0.4397236 23.030968 9.221025
## WaldPvalue_Intercept WaldPvalue_Group2_IgG_vs_IgA betaConv betaIter deviance
## 1 0.000000e+00 2.150302e-127 TRUE 3 199.93005
## 2 0.000000e+00 1.333262e-75 TRUE 3 179.66890
## 3 0.000000e+00 1.711502e-73 TRUE 3 199.11238
## 4 8.703135e-16 9.408951e-30 TRUE 3 59.25030
## 5 2.140447e-168 1.222197e-24 TRUE 3 88.69925
## 6 2.282327e-117 2.942726e-20 TRUE 3 107.26246
## maxCooks comparison pxfc rank mouse_chromosome human_gene_symbol
## 1 0.06337948 IgG_vs_IgA 721.09417 721.09417 <NA> <NA>
## 2 0.66246996 IgG_vs_IgA 526.22919 526.22919 <NA> <NA>
## 3 0.50039836 IgG_vs_IgA 423.53300 423.53300 <NA> <NA>
## 4 0.07789860 IgG_vs_IgA 120.03614 120.03614 9 SMAD3
## 5 2.02423141 IgG_vs_IgA 65.44395 65.44395 <NA> <NA>
## 6 0.31871834 IgG_vs_IgA 63.16813 63.16813 12 PLD4
## human_chromosome human_gene_symbol_unique human_chromosome_unique entity_type
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 15 SMAD3 15 protein
## 5 <NA> <NA> <NA> <NA>
## 6 14 PLD4 14 protein
## category
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 intracellular
## 5 <NA>
## 6 transmembrane, intracellular
## parent consensus_score is_cs
## 1 <NA> NA NA
## 2 <NA> NA NA
## 3 <NA> NA NA
## 4 intracellular 3 FALSE
## 5 <NA> NA NA
## 6 transmembrane_predicted, intracellular, transmembrane 4 TRUE
## database_cs is_tf
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 OmniPath, GO_Intercell, ComPPI, LOCATE, UniProt TRUE
## 5 <NA> NA
## 6 Phobius, GO_Intercell, UniProt, OmniPath, Ramilowski_location FALSE
## database_tf
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 CollecTRI;DoRothEA;ExTRI_CollecTRI;NTNU.Curated_CollecTRI;RegNetwork_DoRothEA;TFactS_CollecTRI;TFactS_DoRothEA;TRED_DoRothEA;TRRUST_CollecTRI;TRRUST_DoRothEA;HTRI_CollecTRI;HTRIdb;HTRIdb_DoRothEA;ReMap_DoRothEA;FANTOM4_DoRothEA;CytReg_CollecTRI;GEREDB_CollecTRI;SIGNOR;SIGNOR_CollecTRI;ARACNe-GTEx_DoRothEA;GOA_CollecTRI;IntAct_CollecTRI;IntAct_DoRothEA;HOCOMOCO_DoRothEA;JASPAR_DoRothEA
## 5 <NA>
## 6 <NA>
Generating MA & volcano plots
Here we will use plot_ma() and
plot_volcano() to generate a single MA plot for the
IgG_vs_IgA comparison as a demonstration.
# generate MA plot for the "IgG_vs_IgA" comparison
plot_ma(res, save_plot = save_plot, save_dir = save_dir)## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 952 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Generate volcano plot for the "IgA_vs_IgG" comparison
plot_volcano(res, save_plot = FALSE, save_dir = save_dir)## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: Removed 15579 rows containing missing values or values outside the scale range
## (`geom_text()`).
## ggrepel: 37 unlabeled data points (too many overlaps). Consider increasing max.overlaps

Module 3: GSEA Analysis
The final run_gsea_pip() module performs gene set
enrichment analysis to identify biological pathways and processes using
a list of MSigDB gene set collections (HALLMARK, GOBP, KEGG, REACTOME,
TFT, BIOCARTA). Gene sets must have at least 10 and no more than 1000
genes to be tested.
Main function
This returns a list of gseaResult class objects for each gene set collection and comparison.
# run gsea analysis
# gsea <- run_gsea_pip(res, org = org, save_data = save_data, save_dir = save_dir)
# run gsea analysis
gsea <- run_gsea(res, org = org, save_data = save_data, save_dir = save_dir)## Converting baseMean to numeric
## Converting log2FoldChange to numeric
## Converting padj to numeric
## Converting rank to numeric
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
names(gsea)## [1] "ID" "Description" "setSize" "enrichmentScore"
## [5] "NES" "pvalue" "p.adjust" "qvalue"
## [9] "rank" "leading_edge" "core_enrichment" "collection"
## [13] "comparison"
Generate barplots for top enriched gene set terms
Here we will use plot_gsea_barplot() to generate a
single barplot for the IgA_vs_IgG comparison using GOBP
collection as a demonstration. By default, the barplot will show the top
10 upregulated and downregulated terms.
# generate barplot for the "IgA_vs_IgG" comparison
plot_gsea_barplot(gsea, n = 10, save_plot = FALSE, save_dir = save_dir)## Converting NES to numeric
## Converting pvalue to numeric
## Converting qvalue to numeric
## Warning in sub(paste0("^", collection, "_"), "", gsea$ID): argument 'pattern'
## has length > 1 and only the first element will be used
## Warning in geom_col(aes(stroke = label), width = 0.75, col = "black"):
## Ignoring unknown aesthetics: stroke
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_text()`).

EnrichmentMap pipeline
Finally the module will run enrichmentmap_pip() which
creates a enrichmentmap directory containing files formated for
EnrichmentMap visualization.
By default, the function will only use the HALLMARK, GOBP, KEGG and REACTOME collections. We will skip this step here.
# run enrichmentmap pipeline
#enrichmentmap_pip(gsea, org = org, group_by = group_by, group_dir = group_dir, save_dir = group_dir)