Last updated: 2020-09-28

Checks: 6 1

Knit directory: Blancetal/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200217) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute relative
~/Blancetal/figures/phist.png ../figures/phist.png

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d923b0f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/quaint-results.rda
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  figures/Supplement_Ve.png
    Untracked:  figures/phist.png
    Untracked:  output/all_0.05_heatmap.png
    Untracked:  output/names_0.05_all.txt

Unstaged changes:
    Modified:   analysis/Environmental_variance.Rmd
    Modified:   analysis/Selection_on_Expression_of_Env_Rsponse_Genes.Rmd
    Modified:   analysis/Selection_on_expression_of_coexpreession_clusters.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Identifying_quaint.Rmd) and HTML (docs/Identifying_quaint.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd d923b0f jgblanc 2020-09-28 added less stringent cutoff
html b30fe87 jgblanc 2020-04-23 Build site.
Rmd 47b41db jgblanc 2020-04-23 Ready to publish individual genes section
Rmd 8298d4d GitHub 2020-04-16 Merge branch ‘master’ into master
html 8298d4d GitHub 2020-04-16 Merge branch ‘master’ into master
Rmd a98d9a4 em 2020-04-16 stuff
Rmd 1ac0628 em 2020-04-16 stuff
Rmd be3ef3c em 2020-04-10 ready for pull requset?
html be3ef3c em 2020-04-10 ready for pull requset?
Rmd 94badb2 em 2020-04-09 stuff
Rmd cf43de2 em 2020-04-01 stuff
Rmd fa0c86f em 2020-03-30 stuff
html fa0c86f em 2020-03-30 stuff
Rmd 6b00f47 em 2020-03-27 stuff
html 6b00f47 em 2020-03-27 stuff
Rmd 640b45a em 2020-03-24 drought genes
html 640b45a em 2020-03-24 drought genes
Rmd 682f50e em 2020-03-23 stuff
html 682f50e em 2020-03-23 stuff
Rmd 450bded em 2020-03-19 more plots
html 450bded em 2020-03-19 more plots
Rmd 626d202 em 2020-03-19 num sig genes
Rmd 2688954 em 2020-03-18 merge conflict ugh
Rmd 8c52f3c Em 2020-03-17 stuff
html b7f1f85 jgblanc 2020-03-13 Build site.
Rmd 1b91cdb jgblanc 2020-03-13 added quanit html
Rmd dec95d3 Em 2020-03-04 pc figures
html dec95d3 Em 2020-03-04 pc figures
Rmd 993e4b7 Em 2020-03-03 mroe stuff
html 993e4b7 Em 2020-03-03 mroe stuff
Rmd a2e0aec Em 2020-02-27 drought gene analysis
html a2e0aec Em 2020-02-27 drought gene analysis
Rmd c018f1d Em 2020-02-26 added drought info
Rmd 227a4b6 Em 2020-02-26 added figure
html 227a4b6 Em 2020-02-26 added figure
Rmd a597c91 Em 2020-02-26 stuff
html a597c91 Em 2020-02-26 stuff

Intro

Here we identify genes whose expression is under divergent selection along the first 5 principal components in the dataset using the R package QUAINT. We have data for 7 different tissue types, the number of lines and genes per tissue is recorded in table S1. The mean centered expression values for the genes for each tissue type are stored in “data/Mean_centered_expression/tissuename.txt”. The kinship matrices, generated from 78,324 randomly chosen SNPs for each tissue type, are stored in “../data/Kinship_matrices/F_tissuename”.

Code

This function takes in a tissue name as an argument and returns the results of calcQpc() for each gene. calcQpc() takes a vector of traits (\(Z_1, Z_2, ...Z_m\)), the matrix of eigenvectors (\(\vec{U_1}, \vec{U_2}, ...\vec{U_m}\)) and vector of eigenvalues (\(\lambda_1, \lambda_2, ... \lambda_m\)) generated form the eigen decomposition of the kinship matrix, the range of PCs used to estimate \(V_a\) and the range of PCs you want to test for selection. Here our trait values are the mean centered expression values and we use the last half of PCs to estimate \(V_a\) and test for selection along the first 5 PCs. calcQpc() outputs a list that includes the \(c_m\) values (see equation 2) for the 5 PCs we are testing for selection along with the p-value from the F-test (see equation 3). The testAllGenes function runs calcQpc() on all genes and outputs the above information for every gene.

############ function for testing all genes
testAllGenes <- function(myTissue){
# Read in mean-centered expression values
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
geneNames = names(df1)
# Read in tissue specific kinship matrix 
myF <- read.table(paste('../data/Kinship_matrices/F_',myTissue,'.txt',sep=""))

## Get Eigen Values and Vectors 
myE <- eigen(myF)
E_vectors <- myE$vectors
E_values <- myE$values

## Testing for selection on first 5 PCs
myM = 1:5

## Using the last 1/2 of PCs to estimate Va
myL = 6:dim(myF)[1]


## test for selection on each gene
allGeneOutput <- sapply(1:ncol(df1), function(x){
myQpc = calcQpc(myZ = df1[,x], myU = E_vectors, myLambdas = E_values, myL = myL, myM = myM)
return(c(geneNames[x],myQpc))
})

return(allGeneOutput)
}

Now we will run the test for selection on all of our tissue types.

###run on all genes
alltissues = c('GRoot',"Kern","LMAD26","LMAN26", "L3Tip","GShoot","L3Base")
alltissueresults = lapply(alltissues, testAllGenes)
names(alltissueresults) <- alltissues

First let’s look at the number of genes with a raw p-value < 0.05.

##look at sig results
sigresults = lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs

## how many are significant?
numsig <- apply(pvals, 2, function(x){sum(x<0.05)})
numsig
})

###make a big image of how many sig results we have
sigTable_raw = as.data.frame(matrix(unlist(sigresults), ncol=5, byrow=T))
names(sigTable_raw) = c('PC1','PC2','PC3','PC4','PC5')
sigTable_raw$nicename = c('Germinating root', 'Kernel','Adult Leaf Day', 'Adult Leaf Night', '3rd leaf tip', 'Germinating shoot','3rd leaf base')
sigTable_raw
  PC1 PC2 PC3 PC4 PC5          nicename
1  74 153  46  77  92  Germinating root
2 507 331 376 131 116            Kernel
3 330 551 293 222 617    Adult Leaf Day
4 345 344 511 257 432  Adult Leaf Night
5  99 306  88  68  77      3rd leaf tip
6 106 191  61 123 151 Germinating shoot
7 390 352  79  70  91     3rd leaf base

We can get the names of all the genes with p-value < 0.05 in at least 1 pc/tissue combination. Here we have 4932 unique genes.

# Get names of genes w/ p-value < 0.05 
siggenes = unlist(lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs
pdf = as.data.frame(pvals)
pdf = pdf %>% mutate(sig = ifelse(V1 <= 0.05 | V2 <= 0.05 | V3 <= 0.05 | V4 <= 0.05 | V5 <= 0.05, T, F))
pdf$genes = unlist(alltissueresults[[i]][1,])
return(dplyr::filter(pdf, sig==T)$genes)
}))
unique_sig <- unique(siggenes) ##number of unique genes
length(unique_sig)
[1] 4932
#write.table(unique_sig,"../output/names_0.05_all.txt", row.names = F, quote = F)

Now lets look at the number of gene with FDR < 0.01 for each tissue

##look at sig results
sigresults = lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs

##look at pvalue distributions
# par(mfrow=c(3,2), mar=c(4,4,1,1))
# sapply(1:5, function(x){
# hist(pvals[,x], border="white", col = "darkgray", main="", breaks = 20, xlab = paste('PC ',x,' ',alltissues[i],sep=""))
# })

## test fpr significance with the fdr
pfdr = data.frame(apply(pvals,2, function(x){p.adjust(x, method='fdr')}))

## how many are significant?
numsig <- apply(pfdr, 2, function(x){sum(x<0.1)})
numsig
})

###make a big image of how many sig results we have
sigTable = as.data.frame(matrix(unlist(sigresults), ncol=5, byrow=T))
names(sigTable) = c('PC1','PC2','PC3','PC4','PC5')
sigTable$nicename = c('Germinating root', 'Kernel','Adult Leaf Day', 'Adult Leaf Night', '3rd leaf tip', 'Germinating shoot','3rd leaf base')
sigTable
  PC1 PC2 PC3 PC4 PC5          nicename
1   0   0   0   0   0  Germinating root
2   0   3   6   1   0            Kernel
3   1   5   3   0  27    Adult Leaf Day
4   1   0   6   0   8  Adult Leaf Night
5   0   1   0   1   0      3rd leaf tip
6   0   2   0   0   0 Germinating shoot
7   0   1   1   0   0     3rd leaf base
sigLong = tidyr::gather(sigTable, 'variable','value', -nicename)
sigLong[sigLong == 0] <- NA

###save output for other analyses
#save(alltissueresults, sigTable, sigLong, file = "../data/quaint-results.rda")

We can also check how many of our significant genes are unique. Here 60 of 66 significant genes are unqiue to their tissue/PC combination.

#how many unique genes???
siggenes = unlist(lapply(1:length(alltissues), function(i){
# extract the pvalues
pvals = matrix(unlist(alltissueresults[[i]][5,]), ncol=5, byrow=TRUE) #each row corresponds to a gene, columns are to PCs
pfdr = data.frame(apply(pvals,2, function(x){p.adjust(x, method='fdr')})) ## test fpr significance with the fdr
pfdr$numsig <- apply(pfdr, 1, function(x){sum(x<0.1)})
pfdr$genes = unlist(alltissueresults[[i]][1,])
return(dplyr::filter(pfdr, numsig>0)$genes)
}))
length(unique(siggenes)) ##number of unique genes
[1] 60

Figures

Now that we have a table with the number of significant genes per tissue. let’s make a heat map of them with ggplot. This is Figure 1A in the paper.

pl <- ggplot(data=sigLong,aes(x=variable,y=nicename)) + 
  geom_tile(aes(fill=value),color='black') + scale_fill_gradient(low = 'lightyellow', high = "#CC79A7", guide = FALSE, na.value = "white") + theme_bw() + xlab("\n") + ylab("Tissue") + 
  theme(axis.ticks=element_blank(),panel.border=element_blank(),panel.grid.major = element_blank(), axis.text.y = element_text(size=10,angle=0), axis.title.y = element_text(size=16),axis.title.x  = element_text(size=16),axis.text.x = element_text(angle = 0, hjust = 0.5,size=14)) + geom_text(aes(label=value),colour="grey15",size=5.5)
pl
Warning: Removed 20 rows containing missing values (geom_text).

Make a table of the sample sizes - Table S1.

mySampleTable = sapply(alltissues, function(myTissue){
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
numGenes = dim(df1)[2]
numInds = dim(df1)[1]
return(c(numGenes,numInds))
})
rownames(mySampleTable) = c('genes','individuals')

mySampleTable
            GRoot Kern LMAD26 LMAN26 L3Tip GShoot L3Base
genes       10500 9814   8879   8435  8489  10195  11555
individuals   232  207    109    110   237    239    236

Here we have a function that allow’s us to plot expression vs PC

### function for making plots
makeGenePlot <- function(tissue, geneIndex, pc){

## read in population data
pop <- read.csv("../data/FlintGarciaTableS1_fixednew.csv", stringsAsFactors = F)

## read in expression data
exp <- read.table(paste("../data/Mean_centered_expression/",tissue,".txt", sep=""), stringsAsFactors = F)
expgene = data.frame(Inbred = row.names(exp), geneexp = exp[,geneIndex], stringsAsFactors = F)

## merge data together
pop_dat <- dplyr::inner_join(pop, expgene, by= "Inbred")

#get the eigenvalues
myF <- read.table(paste('../data/Kinship_matrices/F_',tissue,'.txt', sep=""))
myE = eigen(myF)
myPC = data.frame(pc = myE$vectors[,pc], stringsAsFactors = F)
pop_dat <- dplyr::bind_cols(pop_dat, myPC)

lambda <- myE$values[pc]

##calculate the CIs
generesults = alltissueresults[[tissue]][,geneIndex]
myVaest = var0(generesults$cml)
myCI = 1.96*sqrt(myVaest*lambda)

##gene name
geneName = names(exp)[geneIndex]

lR <- lm(pop_dat$geneexp ~ pop_dat$pc) 
coeff <- lR$coefficients[[2]]

if(tissue=="LMAD26"){mylab = c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "tropical")} else{mylab = c("mixed", "non-stiff stalk", "popcorn", "stiff stalk", "sweet", "tropical")} ##no sweets in LMAD26

col <- c('#E69F00', '#56B4E9', "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pl1 <- ggplot(data=pop_dat, aes(x = pc, y= geneexp , color=Subpopulation)) + scale_colour_manual(values = col, labels=mylab) + xlab(paste("PC",pc)) +  ylab("Expression") + theme_bw() + theme(panel.grid.major = element_blank(), text = element_text(size=15), panel.grid.minor = element_blank(), axis.title.y = element_text(size=18), axis.title.x  = element_text(size=18), legend.position = "right", legend.title = element_text(size = 12), legend.text = element_text(size = 20), plot.title = element_text(hjust = 0.5)) + geom_point(size = 3.5) + geom_abline(slope = myCI, linetype = 2, col = '#56B4E9', size=1.2) + geom_abline(slope = coeff, size = 1.5)+  geom_abline(slope = -myCI, linetype = 2, col = '#56B4E9', size=1.2)

return(pl1)
}

Here we will plot Figure 1B and 1C. The code bellow can be modified to make expression plots for different gene and different tissues by changing the tissue name in the first line and giving the index of the gene you want to plot. The index can be found by finding the column number

myTissue = 'LMAD26'
df1 <- read.table(paste("../data/Mean_centered_expression/",myTissue,".txt",sep=""))
geneNames = names(df1)  

## Pick name of gene to plot and plot vs PC 
gene_index <- which(geneNames == "GRMZM2G152686")
pc1_plot = makeGenePlot(tissue = myTissue,geneIndex = gene_index, pc = 1)
pc1_plot

## Pick name of gene to plot and plot vs PC 
gene_index <- which(geneNames == "GRMZM2G069762")
pc5_plot = makeGenePlot(tissue = myTissue,geneIndex = gene_index, pc = 5)
pc5_plot

Make Final Figure

final <- ggarrange(pl,                                                 
          ggarrange(pc1_plot, pc5_plot, ncol = 2, labels = c("B", "C"),common.legend = T, legend = "bottom"), 
          nrow = 2, 
          labels = "A"                                       
          )
Warning: Removed 20 rows containing missing values (geom_text).
final

Plot the p-value histrogram for all tissue/PC combinations

pvals = matrix(unlist(alltissueresults[[1]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
df <- melt(pvals)
df$tissue <- "Germinating Root"

pvals = matrix(unlist(alltissueresults[[2]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Kernel"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[3]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Adult Leaf Day"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[4]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Adult Leaf Night"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[5]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "3rd Leaf Tip"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[6]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "Germinating Shoot"
df <- rbind(df, dat)

pvals = matrix(unlist(alltissueresults[[7]][5,]), ncol=5, byrow=TRUE)
colnames(pvals) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
dat <- melt(pvals)
dat$tissue <- "3rd Leaf Night"
df <- rbind(df, dat)

pl <- ggplot(data = df, aes(x=value, fill = tissue)) + geom_histogram(bins = 25) + facet_grid(Var2 ~ tissue)  + theme_bw() +  xlim(c(0,1)) + xlab("p-value") + ylab("Number of Genes") + 
  scale_fill_manual(values=c('#E69F00', '#56B4E9', "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) + theme(legend.position = "none", axis.text.x =element_text(size = 6), strip.text.x =  element_text(size = 7))

#ggsave("~/Blancetal/figures/phist.png",pl, width = 10, height = 10)

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.1.0       quaint_0.0.0.9000 ggpubr_0.3.0      reshape2_1.4.4   
[5] ggplot2_3.3.2     workflowr_1.6.2  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.15         purrr_0.3.4       haven_2.3.1      
 [5] lattice_0.20-41   carData_3.0-4     colorspace_1.4-1  vctrs_0.3.1      
 [9] generics_0.0.2    htmltools_0.5.0   yaml_2.2.1        rlang_0.4.6      
[13] later_1.1.0.1     pillar_1.4.4      foreign_0.8-72    glue_1.4.1       
[17] withr_2.2.0       readxl_1.3.1      lifecycle_0.2.0   plyr_1.8.6       
[21] stringr_1.4.0     cellranger_1.1.0  munsell_0.5.0     ggsignif_0.6.0   
[25] gtable_0.3.0      zip_2.0.4         evaluate_0.14     labeling_0.3     
[29] knitr_1.29        rio_0.5.16        forcats_0.5.0     httpuv_1.5.4     
[33] curl_4.3          highr_0.8         broom_0.5.6       Rcpp_1.0.4.6     
[37] promises_1.1.1    scales_1.1.1      backports_1.1.8   abind_1.4-5      
[41] farver_2.0.3      fs_1.4.1          gridExtra_2.3     hms_0.5.3        
[45] digest_0.6.25     stringi_1.4.6     openxlsx_4.1.5    rstatix_0.6.0    
[49] dplyr_1.0.0       cowplot_1.0.0     grid_3.6.2        rprojroot_1.3-2  
[53] tools_3.6.2       magrittr_1.5      tibble_3.0.1      crayon_1.3.4     
[57] whisker_0.4       car_3.0-8         pkgconfig_2.0.3   ellipsis_0.3.1   
[61] data.table_1.12.8 rmarkdown_2.3     R6_2.4.1          nlme_3.1-148     
[65] git2r_0.27.1      compiler_3.6.2