Detecting selection on expression of individual genes

Last updated: 2020-09-28

Checks: 6 1

Knit directory: Blancetal/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200217)

The command set.seed(20200217) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
~/Blancetal/figures/phist.png	../figures/phist.png

Repository version: d923b0f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d923b0f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/quaint-results.rda
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  figures/Supplement_Ve.png
    Untracked:  figures/phist.png
    Untracked:  output/all_0.05_heatmap.png
    Untracked:  output/names_0.05_all.txt

Unstaged changes:
    Modified:   analysis/Environmental_variance.Rmd
    Modified:   analysis/Selection_on_Expression_of_Env_Rsponse_Genes.Rmd
    Modified:   analysis/Selection_on_expression_of_coexpreession_clusters.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Identifying_quaint.Rmd) and HTML (docs/Identifying_quaint.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	d923b0f	jgblanc	2020-09-28	added less stringent cutoff
html	b30fe87	jgblanc	2020-04-23	Build site.
Rmd	47b41db	jgblanc	2020-04-23	Ready to publish individual genes section
Rmd	8298d4d	GitHub	2020-04-16	Merge branch ‘master’ into master
html	8298d4d	GitHub	2020-04-16	Merge branch ‘master’ into master
Rmd	a98d9a4	em	2020-04-16	stuff
Rmd	1ac0628	em	2020-04-16	stuff
Rmd	be3ef3c	em	2020-04-10	ready for pull requset?
html	be3ef3c	em	2020-04-10	ready for pull requset?
Rmd	94badb2	em	2020-04-09	stuff
Rmd	cf43de2	em	2020-04-01	stuff
Rmd	fa0c86f	em	2020-03-30	stuff
html	fa0c86f	em	2020-03-30	stuff
Rmd	6b00f47	em	2020-03-27	stuff
html	6b00f47	em	2020-03-27	stuff
Rmd	640b45a	em	2020-03-24	drought genes
html	640b45a	em	2020-03-24	drought genes
Rmd	682f50e	em	2020-03-23	stuff
html	682f50e	em	2020-03-23	stuff
Rmd	450bded	em	2020-03-19	more plots
html	450bded	em	2020-03-19	more plots
Rmd	626d202	em	2020-03-19	num sig genes
Rmd	2688954	em	2020-03-18	merge conflict ugh
Rmd	8c52f3c	Em	2020-03-17	stuff
html	b7f1f85	jgblanc	2020-03-13	Build site.
Rmd	1b91cdb	jgblanc	2020-03-13	added quanit html
Rmd	dec95d3	Em	2020-03-04	pc figures
html	dec95d3	Em	2020-03-04	pc figures
Rmd	993e4b7	Em	2020-03-03	mroe stuff
html	993e4b7	Em	2020-03-03	mroe stuff
Rmd	a2e0aec	Em	2020-02-27	drought gene analysis
html	a2e0aec	Em	2020-02-27	drought gene analysis
Rmd	c018f1d	Em	2020-02-26	added drought info
Rmd	227a4b6	Em	2020-02-26	added figure
html	227a4b6	Em	2020-02-26	added figure
Rmd	a597c91	Em	2020-02-26	stuff
html	a597c91	Em	2020-02-26	stuff

Intro

Here we identify genes whose expression is under divergent selection along the first 5 principal components in the dataset using the R package QUAINT. We have data for 7 different tissue types, the number of lines and genes per tissue is recorded in table S1. The mean centered expression values for the genes for each tissue type are stored in “data/Mean_centered_expression/tissuename.txt”. The kinship matrices, generated from 78,324 randomly chosen SNPs for each tissue type, are stored in “../data/Kinship_matrices/F_tissuename”.

Code

This function takes in a tissue name as an argument and returns the results of calcQpc() for each gene. calcQpc() takes a vector of traits (\(Z_1, Z_2, ...Z_m\)), the matrix of eigenvectors (\(\vec{U_1}, \vec{U_2}, ...\vec{U_m}\)) and vector of eigenvalues (\(\lambda_1, \lambda_2, ... \lambda_m\)) generated form the eigen decomposition of the kinship matrix, the range of PCs used to estimate \(V_a\) and the range of PCs you want to test for selection. Here our trait values are the mean centered expression values and we use the last half of PCs to estimate \(V_a\) and test for selection along the first 5 PCs. calcQpc() outputs a list that includes the \(c_m\) values (see equation 2) for the 5 PCs we are testing for selection along with the p-value from the F-test (see equation 3). The testAllGenes function runs calcQpc() on all genes and outputs the above information for every gene.

############ function for testing all genes
testAllGenes <- function(myTissue){
# Read in mean-centered expression values
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
geneNames = names(df1)
# Read in tissue specific kinship matrix 
myF <- read.table(paste('../data/Kinship_matrices/F_',myTissue,'.txt',sep=""))

## Get Eigen Values and Vectors 
myE <- eigen(myF)
E_vectors <- myE$vectors
E_values <- myE$values

## Testing for selection on first 5 PCs
myM = 1:5

## Using the last 1/2 of PCs to estimate Va
myL = 6:dim(myF)[1]


## test for selection on each gene
allGeneOutput <- sapply(1:ncol(df1), function(x){
myQpc = calcQpc(myZ = df1[,x], myU = E_vectors, myLambdas = E_values, myL = myL, myM = myM)
return(c(geneNames[x],myQpc))
})

return(allGeneOutput)
}

Now we will run the test for selection on all of our tissue types.

###run on all genes
alltissues = c('GRoot',"Kern","LMAD26","LMAN26", "L3Tip","GShoot","L3Base")
alltissueresults = lapply(alltissues, testAllGenes)
names(alltissueresults) <- alltissues

First let’s look at the number of genes with a raw p-value < 0.05.

##look at sig results
sigresults = lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs

## how many are significant?
numsig <- apply(pvals, 2, function(x){sum(x<0.05)})
numsig
})

###make a big image of how many sig results we have
sigTable_raw = as.data.frame(matrix(unlist(sigresults), ncol=5, byrow=T))
names(sigTable_raw) = c('PC1','PC2','PC3','PC4','PC5')
sigTable_raw$nicename = c('Germinating root', 'Kernel','Adult Leaf Day', 'Adult Leaf Night', '3rd leaf tip', 'Germinating shoot','3rd leaf base')
sigTable_raw

  PC1 PC2 PC3 PC4 PC5          nicename
1  74 153  46  77  92  Germinating root
2 507 331 376 131 116            Kernel
3 330 551 293 222 617    Adult Leaf Day
4 345 344 511 257 432  Adult Leaf Night
5  99 306  88  68  77      3rd leaf tip
6 106 191  61 123 151 Germinating shoot
7 390 352  79  70  91     3rd leaf base

We can get the names of all the genes with p-value < 0.05 in at least 1 pc/tissue combination. Here we have 4932 unique genes.

# Get names of genes w/ p-value < 0.05 
siggenes = unlist(lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs
pdf = as.data.frame(pvals)
pdf = pdf %>% mutate(sig = ifelse(V1 <= 0.05 | V2 <= 0.05 | V3 <= 0.05 | V4 <= 0.05 | V5 <= 0.05, T, F))
pdf$genes = unlist(alltissueresults[[i]][1,])
return(dplyr::filter(pdf, sig==T)$genes)
}))
unique_sig <- unique(siggenes) ##number of unique genes
length(unique_sig)

[1] 4932

#write.table(unique_sig,"../output/names_0.05_all.txt", row.names = F, quote = F)

Now lets look at the number of gene with FDR < 0.01 for each tissue

##look at sig results
sigresults = lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs

##look at pvalue distributions
# par(mfrow=c(3,2), mar=c(4,4,1,1))
# sapply(1:5, function(x){
# hist(pvals[,x], border="white", col = "darkgray", main="", breaks = 20, xlab = paste('PC ',x,' ',alltissues[i],sep=""))
# })

## test fpr significance with the fdr
pfdr = data.frame(apply(pvals,2, function(x){p.adjust(x, method='fdr')}))

## how many are significant?
numsig <- apply(pfdr, 2, function(x){sum(x<0.1)})
numsig
})

###make a big image of how many sig results we have
sigTable = as.data.frame(matrix(unlist(sigresults), ncol=5, byrow=T))
names(sigTable) = c('PC1','PC2','PC3','PC4','PC5')
sigTable$nicename = c('Germinating root', 'Kernel','Adult Leaf Day', 'Adult Leaf Night', '3rd leaf tip', 'Germinating shoot','3rd leaf base')
sigTable

  PC1 PC2 PC3 PC4 PC5          nicename
1   0   0   0   0   0  Germinating root
2   0   3   6   1   0            Kernel
3   1   5   3   0  27    Adult Leaf Day
4   1   0   6   0   8  Adult Leaf Night
5   0   1   0   1   0      3rd leaf tip
6   0   2   0   0   0 Germinating shoot
7   0   1   1   0   0     3rd leaf base

sigLong = tidyr::gather(sigTable, 'variable','value', -nicename)
sigLong[sigLong == 0] <- NA

###save output for other analyses
#save(alltissueresults, sigTable, sigLong, file = "../data/quaint-results.rda")

We can also check how many of our significant genes are unique. Here 60 of 66 significant genes are unqiue to their tissue/PC combination.

#how many unique genes???
siggenes = unlist(lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs
pfdr = data.frame(apply(pvals,2, function(x){p.adjust(x, method='fdr')})) ## test fpr significance with the fdr
pfdr$numsig <- apply(pfdr, 1, function(x){sum(x<0.1)})
pfdr$genes = unlist(alltissueresults[[i]][1,])
return(dplyr::filter(pfdr, numsig>0)$genes)
}))
length(unique(siggenes)) ##number of unique genes

[1] 60

Figures

Now that we have a table with the number of significant genes per tissue. let’s make a heat map of them with ggplot. This is Figure 1A in the paper.

pl <- ggplot(data=sigLong,aes(x=variable,y=nicename)) + 
  geom_tile(aes(fill=value),color='black') + scale_fill_gradient(low = 'lightyellow', high = "#CC79A7", guide = FALSE, na.value = "white") + theme_bw() + xlab("\n") + ylab("Tissue") + 
  theme(axis.ticks=element_blank(),panel.border=element_blank(),panel.grid.major = element_blank(), axis.text.y = element_text(size=10,angle=0), axis.title.y = element_text(size=16),axis.title.x  = element_text(size=16),axis.text.x = element_text(angle = 0, hjust = 0.5,size=14)) + geom_text(aes(label=value),colour="grey15",size=5.5)
pl

Warning: Removed 20 rows containing missing values (geom_text).

Make a table of the sample sizes - Table S1.

mySampleTable = sapply(alltissues, function(myTissue){
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
numGenes = dim(df1)[2]
numInds = dim(df1)[1]
return(c(numGenes,numInds))
})
rownames(mySampleTable) = c('genes','individuals')

mySampleTable

            GRoot Kern LMAD26 LMAN26 L3Tip GShoot L3Base
genes       10500 9814   8879   8435  8489  10195  11555
individuals   232  207    109    110   237    239    236

Here we have a function that allow’s us to plot expression vs PC

### function for making plots
makeGenePlot <- function(tissue, geneIndex, pc){

## read in population data
pop <- read.csv("../data/FlintGarciaTableS1_fixednew.csv", stringsAsFactors = F)

## read in expression data
exp <- read.table(paste("../data/Mean_centered_expression/",tissue,".txt", sep=""), stringsAsFactors = F)
expgene = data.frame(Inbred = row.names(exp), geneexp = exp[,geneIndex], stringsAsFactors = F)

## merge data together
pop_dat <- dplyr::inner_join(pop, expgene, by= "Inbred")

#get the eigenvalues
myF <- read.table(paste('../data/Kinship_matrices/F_',tissue,'.txt', sep=""))
myE = eigen(myF)
myPC = data.frame(pc = myE$vectors[,pc], stringsAsFactors = F)
pop_dat <- dplyr::bind_cols(pop_dat, myPC)

lambda <- myE$values[pc]

##calculate the CIs
generesults = alltissueresults[[tissue]][,geneIndex]
myVaest = var0(generesults$cml)
myCI = 1.96*sqrt(myVaest*lambda)

##gene name
geneName = names(exp)[geneIndex]

lR <- lm(pop_dat$geneexp ~ pop_dat$pc) 
coeff <- lR$coefficients[[2]]

if(tissue=="LMAD26"){mylab = c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "tropical")} else{mylab = c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "sweet", "tropical")} ##no sweets in LMAD26

col <- c('#E69F00', '#56B4E9', "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pl1 <- ggplot(data=pop_dat, aes(x = pc, y= geneexp , color=Subpopulation)) + scale_colour_manual(values = col, labels=mylab) + xlab(paste("PC",pc)) +  ylab("Expression") + theme_bw() + theme(panel.grid.major = element_blank(), text = element_text(size=15), panel.grid.minor = element_blank(), axis.title.y = element_text(size=18), axis.title.x  = element_text(size=18), legend.position = "right", legend.title = element_text(size = 12), legend.text = element_text(size = 20), plot.title = element_text(hjust = 0.5)) + geom_point(size = 3.5) + geom_abline(slope = myCI, linetype = 2, col = '#56B4E9', size=1.2) + geom_abline(slope = coeff, size = 1.5)+  geom_abline(slope = -myCI, linetype = 2, col = '#56B4E9', size=1.2)

return(pl1)
}

Here we will plot Figure 1B and 1C. The code bellow can be modified to make expression plots for different gene and different tissues by changing the tissue name in the first line and giving the index of the gene you want to plot. The index can be found by finding the column number

myTissue = 'LMAD26'
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
geneNames = names(df1)  

## Pick name of gene to plot and plot vs PC 
gene_index <- which(geneNames == "GRMZM2G152686")
pc1_plot = makeGenePlot(tissue = myTissue,geneIndex = gene_index, pc = 1)
pc1_plot

## Pick name of gene to plot and plot vs PC 
gene_index <- which(geneNames == "GRMZM2G069762")
pc5_plot = makeGenePlot(tissue = myTissue,geneIndex = gene_index, pc = 5)
pc5_plot

Make Final Figure

final <- ggarrange(pl,                                                 
          ggarrange(pc1_plot, pc5_plot, ncol = 2, labels = c("B", "C"),common.legend = T, legend = "bottom"), 
          nrow = 2, 
          labels = "A"                                       
          )

Warning: Removed 20 rows containing missing values (geom_text).

final

Plot the p-value histrogram for all tissue/PC combinations

pvals = matrix(unlist(alltissueresults[[1]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
df <- melt(pvals)
df$tissue <- "Germinating Root"

pvals = matrix(unlist(alltissueresults[[2]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Kernel"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[3]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Adult Leaf Day"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[4]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Adult Leaf Night"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[5]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "3rd Leaf Tip"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[6]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Germinating Shoot"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[7]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "3rd Leaf Night"
df <- rbind(df, dat)

pl <- ggplot(data = df, aes(x=value, fill = tissue)) + geom_histogram(bins = 25) + facet_grid(Var2 ~ tissue)  + theme_bw() +  xlim(c(0,1)) + xlab("p-value") + ylab("Number of Genes") + 
  scale_fill_manual(values=c('#E69F00', '#56B4E9', "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) + theme(legend.position = "none", axis.text.x =element_text(size = 6), strip.text.x =  element_text(size = 7))

#ggsave("~/Blancetal/figures/phist.png",pl, width = 10, height = 10)

sessionInfo()

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.1.0       quaint_0.0.0.9000 ggpubr_0.3.0      reshape2_1.4.4   
[5] ggplot2_3.3.2     workflowr_1.6.2  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.15         purrr_0.3.4       haven_2.3.1      
 [5] lattice_0.20-41   carData_3.0-4     colorspace_1.4-1  vctrs_0.3.1      
 [9] generics_0.0.2    htmltools_0.5.0   yaml_2.2.1        rlang_0.4.6      
[13] later_1.1.0.1     pillar_1.4.4      foreign_0.8-72    glue_1.4.1       
[17] withr_2.2.0       readxl_1.3.1      lifecycle_0.2.0   plyr_1.8.6       
[21] stringr_1.4.0     cellranger_1.1.0  munsell_0.5.0     ggsignif_0.6.0   
[25] gtable_0.3.0      zip_2.0.4         evaluate_0.14     labeling_0.3     
[29] knitr_1.29        rio_0.5.16        forcats_0.5.0     httpuv_1.5.4     
[33] curl_4.3          highr_0.8         broom_0.5.6       Rcpp_1.0.4.6     
[37] promises_1.1.1    scales_1.1.1      backports_1.1.8   abind_1.4-5      
[41] farver_2.0.3      fs_1.4.1          gridExtra_2.3     hms_0.5.3        
[45] digest_0.6.25     stringi_1.4.6     openxlsx_4.1.5    rstatix_0.6.0    
[49] dplyr_1.0.0       cowplot_1.0.0     grid_3.6.2        rprojroot_1.3-2  
[53] tools_3.6.2       magrittr_1.5      tibble_3.0.1      crayon_1.3.4     
[57] whisker_0.4       car_3.0-8         pkgconfig_2.0.3   ellipsis_0.3.1   
[61] data.table_1.12.8 rmarkdown_2.3     R6_2.4.1          nlme_3.1-148     
[65] git2r_0.27.1      compiler_3.6.2

Detecting selection on expression of individual genes

Emily Josephs

Intro

Code

Figures