ATAC seq

使用Tn5转酶切割染色质的开放区域、对这些片段进行测序；只需要少量样本(500~50000 cell)；可结合基因表达数据、挖掘开放区域的潜在转录因子结合区域(TFBS)，得到关键的TF作为潜在的疾病的Marker或进行Atlas绘制

promoter: 接近TSS区域，包含TF结合位点、募集 RNA polymerase
如果TF接触 promoter region 时结合了 enhancer，则基因的表达量升高；如果结合了 silencer，则基因的表达量降低
enhancer 最远距离 promoter region 上下游1Mb

现成的流程：ATAC-seq Pipeline， ChIP-seq Pipeline

表观组学的数据基本上是 genomic sites/regions 中出现与否的信息，主要看它周边的基因、或查看区域/点位周边的序列motif

Pipeline

与 ChipSeq 流程有许多重合：mapping后callPeaks，然后idr合并重复实验的重合peaks作为这个样本最终的peaks。Peaks实际上是chr上的一段区间，可以提取这个区间的序列进行motif富集/denovo预测，然后选取motif基序扫描基因组得到潜在的TFBS，Footprinting验证其是否已经结合TF。同时也可注释Peak所在区域，提取Promoter对应的基因进行GO/KEGG等分析。也可查看不同样本间结合位点的差异性（DiffBind）

在设计实验时一般会有重复实验和control组作为背景，callpeaks/QC/Diffbind时会消除掉背景的影响

Task	Tools	说明
Reads QC + Trim	FastQC + Trimmomatic	--
Mapping to Ref	BWA/Bowtie2	得到 BAM file
Pre-analysis	Samtools	- Mark PCR Duplicate - Filter Insert Size - Filter off mitochondria reads
Peak Calling	MACS2 `--shift -75(ReadsLen)` (断点才是Peak中心)	已经初步过滤
View Peak	IGV	提供bam和gff，对照Peak Calling结果查看peak是否合理
Peak QC	samtools + ChIPQC + phantompeakqualtools + Greenscreen	查看PeakCalling的大致的情况，确定数据是否合理 - InsertSize, FRiP, ... - ENCODE Blacklist: 去除因富含GC/重复片段而总是高信号的区域，有可能过度严格 - Greenscreen 去除假阳性Peaks
Merge Replicates	IDR	寻找、合并重复样本间一致性的peaks
-------	Normalization + Analysis	-------
Diff Peak	DiffBind	两个样本间开放程度的差异（对peak进行deseq），输入BED file
Peak Annotation	ChIPseeker	Peak位于promoter/gene/..处；如果Bioconductor没有合适的TxDb，则需要使用GenomicFeatures包`makeTxDbFromxxx`制作
--	annotatr	一般设置基因TSS上游1000bp为启动子区域，需提供该物种基因组的结构注释.gff/.bed
Motif	MEME（All） homer（denovo） chromvar（Enrich） TFBSTools（Scan） ggmotif_plot（画图）	- deNovo: 根据Peak序列从头预测Motif基序 - Enrichment: 扫描Motif库（例如JASPAR数据库），获取本样品中富含的Motif基序 -Scan: 已知晓Motif，扫描基因组寻找潜在TFBS
Footprinting	HINT-ATAC	验证潜在TFBS（寻找已结合TF/Ribo的TFBS）：如果某一开放区域已经被结合，则此区域不会被测得，表现为peak中间有凹陷
GO, Pathway ...	--	(基于基因组上距离)相关基因的功能信息

注：可视化推荐用IGV。deepTools - bamCoverage可以将bam转换为BigWig.bw文件，作为IGV或UCSC的输入；deepTools本身也可以画展示TSS附近的coverage的图（UCSC下载refTSS），例如plotProfile、plotHeatmap

## pip install deeptools
## 支持--blackListFileName
## 参考 https://cloud.tencent.com/developer/article/1842761
samtools index reads.bam
bamCoverage --normalizineUsing -b reads.bam -o coverage.bw


## UCSC 下载 ref.TSS.bed
## 参考 https://www.jianshu.com/p/d6cb795af22a
## 参考 https://y-bai.github.io/posts/omics-find-tss-for-genes/
## 然后 computeMatrix reference-point --referencePoint TSS -R refTSS.bed -S my.bw   ...

Run

SetupDir

mkdir ref
mkdir raw
mkdir trimmed
mkdir mapped
mkdir peak   # *__peaks.narrowPeak

Prepare Data

cd ref
wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.51.gff3.gz
gunzip *.gz
# sed -i 's/>/>chr/g' Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
mv Arabidopsis_thaliana.TAIR10.dna.toplevel.fa Arabidopsis_thaliana.fa
mv Arabidopsis_thaliana.TAIR10.51.gff3  Arabidopsis_thaliana.gff3
bwa index Arabidopsis_thaliana.fa
cd ..

cd raw
echo "SRR5874658\nSRR5874659\n" > SraAccList.txt
cat SraAccList.txt | while read i; do
x=$(echo $i | cut -b1-6)
y=`echo ${i: -1}`
echo $id
ascp -QT -l 500m -P33001  -i /home/a/miniconda3/envs/bulkatac/etc/asperaweb_id_dsa.openssh \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${x}/00${y}/${i}/${i}_1.fastq.gz \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${x}/00${y}/${i}/${i}_2.fastq.gz ./
done
cd ..

cd raw     ## extract a small portion of it
cat SraAccList.txt | while read i; do
less  ${i}_1.fastq.gz|head -500000 >  t${i}_1.fastq
less  ${i}_2.fastq.gz|head -500000 >  t${i}_2.fastq
less  ${i}_1.fastq.gz|head -50000 >  c${i}_1.fastq
less  ${i}_2.fastq.gz|head -50000 >  c${i}_2.fastq
done
gzip *.fastq

Call Peak

暂且不用对照组

echo -e "SampleID\tTissue\tFactor\tCondition\tReplicate\tbamReads\tControlID\tbamControl\tPeaks\tPeakCaller" > peak/Sheet.csv

for dd in SRR5874658 SRR5874659 ; do
SID=t${dd}
# CID=c${dd}
# echo -e "${SID}\tNA\tNA\tNA\t1\t../mapped/${SID}.rmdup.sorted.bam\tControlID\t../mapped/${CID}.rmdup.sorted.bam\t${SID}_out/${SID}_peaks.narrowPeak\tnarrow" >> peak/Sheet.csv
echo -e "${SID}\tNA\tNA\tNA\t1\t../mapped/${SID}.rmdup.sorted.bam\tControlID\tNA\t${SID}_out/${SID}_peaks.narrowPeak\tnarrow" >> peak/Sheet.csv
sh mapping.sh ${SID} Arabidopsis_thaliana.fa 
# sh mapping.sh ${CID} Arabidopsis_thaliana.fa 
# sh callpeakC.sh  ${SID}  ${CID}  135000000
sh callpeak.sh  ${SID}  135000000
done
cd peak

然后导入IGV视情况、或根据 /_peaks.xls 里的 -log10(pvalue) 等消除不合格的peaks（直接从MACS结果的.narrowPeak/.bed文件中提取感兴趣的peaks？？）

Peak QC

QC只是展示了PeakCalling的大致的情况（类似FastQC），并没有提示某个peak

一些快速QC小技巧

## insertsize List
samtools view ../mapped/tSRR5874658.rmdup.sorted.bam | cut -f 9 | less

## RIP%/FRiP = reads_in_peak/all_reads
bedtools intersect -a ../mapped/tSRR5874658.bed  -b tSRR5874658_out/tSRR5874658_peaks.narrowPeak | wc -l
wc -l ../mapped/tSRR5874658.bed

ChIPQC：似乎已经包含了DiffBind

library(ChIPQC)
## browseVignettes("ChIPQC")
samples <- read.csv('Sheet.csv',sep='\t')
chipObj <- ChIPQC(samples, annotation = NULL) 
ChIPQCreport(chipObj, reportName="ChIP QC report", reportFolder="ChIPQCreport")

# 如果使用SampleList（它也支持单个），则至少有2个sample。否则报错（另外留下的peak太少有可能报错？）   
# 内置的annotation支持blacklist，不过暂时没有找到自定义anno的方法。。

Merge Replicates

找到重复实验的重合peaks，合并它们。

粗暴Merge:

bedtools intersect -a tSRR5874658_out/tSRR5874658_peaks.narrowPeak -b tSRR5874659_out/tSRR5874659_peaks.narrowPeak -wo > overlaps.bed
## if A B overlaps, return -wo
## : Write the original A and B entries plus the number of base pairs of overlap between the two features

IDR: 标准流程

## 需要先排序
sort -k8,8nr tSRR5874658_out/tSRR5874658_peaks.narrowPeak > s.tSRR5874658_peaks.narrowPeak
sort -k8,8nr tSRR5874659_out/tSRR5874659_peaks.narrowPeak > s.tSRR5874659_peaks.narrowPeak

## 
## WITHOUT --peak-list 
## 如果 SampleA_peak1 与 SampleB_peak9 有overlap，则把它们merge在一起；--use-nonoverlapping-peaks 才能显示无overlap的peaks
idr --samples  s.tSRR5874658_peaks.narrowPeak  s.tSRR5874659_peaks.narrowPeak  --plot


## WITH --peak-list  一组自定义的peak范围
## 从各样本中选取与peaklist中OraclPeak重合者，如果多个样本Peak同时match一个OraclePeak，则依次选择：
## 1. 与summit最近  2. largest overlap 3. highest score
idr --samples  s.tSRR5874658_peaks.narrowPeak  s.tSRR5874659_peaks.narrowPeak  --peak-list  xxx???


## idrValues.txt format
## https://github.com/nboley/idr/blob/master/README.md

Peak Annotation

ChIPseeker covplot 也可以作图

library("ChIPseeker")  
library("GenomicFeatures")

ara_TxDb <- makeTxDbFromGFF("../ref/Arabidopsis_thaliana.gff3")
peaks <-readPeakFile("tSRR5874658_out/tSRR5874658_peaks.narrowPeak")
seqlevels(peaks)  ## chr名称应该与gff中一致！

## Anno
peaksAnno <- annotatePeak(peaks, tssRegion=c(-3000,3000),TxDb=ara_TxDb)
peaksAnno_df  <- as.data.frame(peaksAnno)


## Plot1
covplot(peaks,weightCol=5)
## Plot2:  peakHeatmap(peaks, TxDb=ara_TxDb, upstream=1500, downstream=1500, color="blue")
promoter <- getPromoters(TxDb=ara_TxDb, upstream=1500, downstream=1500)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
tagHeatmap(tagMatrix, xlim=c(-1500, 1500), color="green")
## Plot3
# files <- c("tSRR5874658_out/tSRR5874658_peaks.narrowPeak","tSRR5874659_out/tSRR5874659_peaks.narrowPeak")
# names(files_list) <- c("tSRR5874658","tSRR5874659")
# tagMatrixList <- lapply(files_list, getTagMatrix, windows=promoter)
# plotAvgProf(tagMatrixList, xlim=c(-1000, 1000), conf=0.95,resample=500, facet="row")

Motif （TBA）

homer：由于测试数据太少，没有结果

awk '{print $4"\t"$1"\t"$2"\t"$3"\t+"}' tSRR5874658_out/tSRR5874658_peaks.narrowPeak > testhomer.bed
findMotifsGenome.pl testhomer.bed ../ref/Arabidopsis_thaliana.fa homer_out

chromVAR：peaks此处使用单个试验的，其实也可以是merged（自行从idrValues.txt里提取？？）

library(chromVAR)
library(TFBSTools)
library(ggmotif)   ## https://github.com/AliciaSchep/ggmotif


peaks <- readNarrowpeaks("tSRR5874658_out/tSRR5874658_peaks.narrowPeak")  
bamfiles <- c("../mapped/tSRR5874658.rmdup.sorted.bam","../mapped/tSRR5874659.rmdup.sorted.bam")

fragment_counts <- getCounts(bamfiles, peaks, 
                              paired =  TRUE, 
                              by_rg = TRUE, 
                              format = "bam")

motifs <- getJasparMotifs()

## FAIL TBA

Bash Files

mapping.sh

SID=$1
REF=$2
echo "Processing ${SID}   $REF  $GSIZE at $PWD"

cd trimmed
trimmomatic PE -phred33  ../raw/${SID}_1.fastq.gz ../raw/${SID}_2.fastq.gz  ${SID}_R1.trimmed.fq.gz ${SID}_R1.U.fq.gz  ${SID}_R2.trimmed.fq.gz ${SID}_R2.U.fq.gz SLIDINGWINDOW:4:15 MINLEN:75   -summary ${SID}.summary.txt  -trimlog ${SID}.log.txt  -threads 64  -quiet
fastqc ${SID}_R1.trimmed.fq.gz -o ./
fastqc ${SID}_R2.trimmed.fq.gz -o ./
cd ..


cd mapped
bwa mem ../ref/$REF ../trimmed/${SID}_R1.trimmed.fq.gz ../trimmed/${SID}_R2.trimmed.fq.gz | grep -v Mt | samtools view -b - | samtools sort -  > ${SID}.sorted.bam
ulimit -n 8000 
sambamba markdup -r ${SID}.sorted.bam ${SID}.rmdup.bam --tmpdir=./ -t 2
bedtools bamtobed -i ${SID}.rmdup.bam  > ${SID}.bed 
samtools sort ${SID}.rmdup.bam  > ${SID}.rmdup.sorted.bam
samtools index ${SID}.rmdup.sorted.bam
cd ..

## sh mapping.sh ct Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

callpeak

callpeak.sh (no control)

SID=$1
GSIZE=$2
echo "Processing ${SID}    $GSIZE at $PWD"

cd peak
macs2 callpeak -t ../mapped/${SID}.rmdup.sorted.bam -n ${SID} --shift -100 --extsize 200 --nomodel -B --SPMR -g $GSIZE --outdir ${SID}_out 2> ${SID}.macs2.log
cd ..

## sh callpeak.sh SID GSIZE

callpeakC.sh (with control)

SID=$1
CID=$2
GSIZE=$3
echo "Processing ${SID} ${CID} $GSIZE  at $PWD"

cd peak
macs2 callpeak -c ../mapped/${CID}.rmdup.sorted.bam -t ../mapped/${SID}.rmdup.sorted.bam -n ${SID} --shift -100 --extsize 200 --nomodel -B --SPMR -g $GSIZE --outdir ${SID}_out 2> ${SID}.macs2.log
cd ..

## sh callpeakC.sh SID CID  GSIZE

Install Env

conda create --name idr  ## py11
conda activate idr
conda install python=3.6 -y
conda install -c bioconda idr -y
conda deactivate

conda create --name bulkatac
conda activate bulkatac
conda install -c hcc aspera-cli -y
conda install -c bioconda trimmomatic -y
conda install -c bioconda fastqc -y 
conda install conda-forge::libgcc-ng -y
conda install -c conda-forge ncurses -y
conda install -c bioconda samtools=1.18 --force-reinstall -y
conda install -c bioconda bwa -y
conda install -c bioconda sambamba -y
conda install -c bioconda bedtools
conda install -c bioconda homer -y

sudo apt install macs

wget https://github.com/kundajelab/idr/archive/2.0.4.zip
unzip 2.0.4.zip
cd idr-2.0.4
conda install anaconda::numpy -y 
conda install conda-forge::setuptools -y
pip install  matplotlib  dev
python3 setup.py install




if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ChIPQC")
devtools::install_github("YuLab-SMU/ggtree")
BiocManager::install("ChIPseeker")
BiocManager::install("chromVAR")  ## sudo apt install libgsl-dev 
BiocManager::install("TFBSTools")
devtools::install_github("AliciaSchep/ggmotif")

参考

https://zhuanlan.zhihu.com/p/415718382（拟南芥）

https://www.plob.org/article/24683.html（人系列教程）

https://cloud.tencent.com/developer/article/1360799（人）

https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html（人）

https://zhuanlan.zhihu.com/p/590862767（概念）

https://zhuanlan.zhihu.com/p/471350610（motif-在线集合）

https://www.jianshu.com/p/6aba8f1dea56（footprinting）

https://cloud.tencent.com/developer/article/1624517（esATAC pipeline）

https://zhuanlan.zhihu.com/p/57516178（chromVAR---参考数据库或denovo，找到富含的Motif）

https://www.jianshu.com/p/c65a1cb35d50（TFBSTools---已知Motif扫描，基因组）

https://meme-suite.org/meme/（MEME---denovo/enrichment/scanGenome）

https://zhuanlan.zhihu.com/p/602992278（Greenscreen）