CDHIT
快速进行序列去重/序列相似性聚类,原理:
- words:若需要xx%相似度,则至少有yy个长度为zz的相同word
- index table for words
关联:UCLUST
Install
conda按照,或Clone,或于此处下载:https://github.com/weizhongli/cdhit/releases
conda install -c bioconda cd-hit
############################## OR #####################(4.8.1)###
sudo apt install zlib1g-dev -y
git clone https://github.com/weizhongli/cdhit.git
cd cdhit
make openmp=no
cd cd-hit-auxtools
make
export PATH=$PATH:/mnt/d/WSL_dir/workdir/cdhit/:/mnt/d/WSL_dir/workdir/cdhit/cd-hit-auxtools/:/mnt/d/WSL_dir/workdir/cdhit/psi-cd-hit/
注意,Conda安装仅支持cd-hit、cd-hit-2d、cd-hit-est、cd-hit-est-2d、cd-hit-454、cd-hit-div;其它命令须Git Clone + MAKE 安装才支持
Commands
- 后缀
.faa = FASTA Amino Acid file
.ffn = FASTA nucleotide coding regions file
.fna = FASTA Nucleic Acid file
- word_length & thresholds
####### protein
-n word_length
5 for thresholds 0.7 ~ 1.0 (default)
4 for thresholds 0.6 ~ 0.7
3 for thresholds 0.5 ~ 0.6
2 for thresholds 0.4 ~ 0.5
####### nucleotide
-n word_length
10-11 for thresholds 0.95 ~ 1.0
8-9 for thresholds 0.90 ~ 0.95
7 for thresholds 0.88 ~ 0.9
6 for thresholds 0.85 ~ 0.88
5 for thresholds 0.80 ~ 0.85
4 for thresholds 0.75 ~ 0.8
- local/global sequence identity
-G If 1 (default):"global sequence identity" =
total length of all co-linear and non-overlapping HSPs
divided by length of the shorter sequence;
If 0: "local sequence identity" =
identity of the top high score HSP;
Only use -G 0 when using alignment coverage controls -aL, -AL, -aS, -AS
cd-hit
Clusters a protein dataset,每个cluster会有一个代表序列
cd-hit -i input.faa -o cdhit.faa -c 0.9 -G 1 -n 5 -M 16000 -T 8 -bak 1
-i input.fa(.gz)*
-o output.fa*
-c sequence identity threshold, default 0.9
-G sequence identity;1:global (default), 0:local
-n word_length (default 5), selection method see above
-M memory limit (in MB)
-T number of threads
-bak write backup cluster file (1 or 0-default)
cd-hit-2d
Compares protein dataset B to A
cd-hit-2d -i inputA.faa -i2 inputB.faa -o inB.notinA.faa -c 0.9 -n 5 -M 16000 -T 8 -bak 1
cd-hit-est
Clusters a nucleotide dataset (non-intron containing); -j -op for paired end (PE) files
cd-hit-est -i input.ffn -o cdhitest.ffn -c 0.95 -G 1 -n 10 -M 16000 -T 8 -bak 1
cd-hit-est -i input_F.ffn -j input_R.ffn -o cdhitest_F.ffn -op cdhitest_R.ffn -c 0.95 -G 1 -n 10 -M 16000 -T 8 -bak 1
cd-hit-est-2d
Compares nucleotide dataset B to A (both non-intron containing);
- -j/-op for paired end (PE) files
- 1/2 for A/B datasets
cd-hit-2d -i inputA.ffn -i2 inputB.ffn -o BmapA.ffn -c 0.9 -n 5 -M 16000 -T 8 -bak 1
cd-hit-2d -i inputA_F.ffn -j inputA_R.ffn -i2 inputB_F.ffn -j2 inputB_R.ffn -o BmapA_F.ffn -o BmapA_R.ffn -c 0.9 -n 5 -M 16000 -T 8 -bak 1
## -mask NX: mask out both 'N' and 'X'
psi-cd-hit.pl
PSI-CD-HIT clusters proteins at very low threshold (cd-hit support threshold >40%), it also cluster long DNA sequences (cd-hit-est can not handle that)
- Sort seq by decreasing length
- 1st-seq as the first representative
- Blast 1st-seq aganist all remaining sequences, pick up seqs that fits clustering threshold
- Repeat until done
psi-cd-hit.pl -i input.faa -o psicdhit.faa -c 0.3 -prog blastp
psi-cd-hit.pl -i Long_contigs.fa -o Long_contigs_dRep.fa -c 0.9 -prog blastn
-prog blastp(default), blastn, megablast, psiblast
-circle treat sequences as circular sequence (0:no-default, 1:yes)
-g a seq clustered to the first fitted cluster (1:fast-default) or the most similar cluster (0:slow)
-exec qsub, local
-s blast para, "-seg yes -evalue 0.000001 -max_target_seqs 100000" (default)
(need BLAST) & apply it via Hierarchical clustering!!
Sequencing reads
- cd-hit-454:Identify duplicated 454 reads
- cd-hit-dup:Removing duplicates from sequencing reads (single/paired-end), remove chimeric reads (opt.)
- cd-hit-lap:Extracting pairs of overlapping reads (with perfect matching, tail-head overlaps)
- read-linker:Concatenate pair-end reads into single ones
TBA
Tools
- cd-hit-div: Sort & Divide a FASTA file into pieces
- plot_len.pl: Print out distributions of clusters & sequences
- clstr_sort_by.pl: Sort clusters in .clstr file
- clstr_sort_prot_by.pl: Sort sequences within clusters in .clstr file
- clstr_merge_noorder.pl: Merges two or more .clstr files
- clstr_ renumber.pl: Renumbers clusters and sequences within clusters in .clstr file
- clstr_rev.pl: Combines a .clstr file with its parent .clstr file
- make_multi_seq.pl: Reads in .clstr file and generates a separate fasta file for each oversized clusters
- clstr2xml.pl: Convert .clstr file to xml format
cd-hit-div -i input.faa -o div_out.faa -div 3 ## Output: div_out.faa-0 div_out.faa-1 div_out.faa-2
TBA
Output Files
- Clustering's
cd-hit/cd-hit-est -o out_F.fa -op out_R.fa ...
out_*.fa: rep seq of each clusters
out_*.fa.clstr: list of clusters (by Cluster)
out_*.fa.bak.clstr: list of clusters (by Seq)
- Comparing's
cd-hit-2d/cd-hit-est-2d -o out_F.fa -op out_R.fa ...
out_*.fa: seq in B that connot maps to A
out_*.fa.clstr: list of similar sequences between A,B (by Cluster)
out_*.fa.bak.clstr: list of similar sequences between A,B (by Seq)
- Clustering's: psicdhit
psicdhit -o psi_out.fa ...
psi_out.fa
psi_out.fa.clstr
psi_out.fa.log
psi_out.fa.out
psi_out.fa.restart
.clstr
>Cluster 0
0 1806nt, >X15065.1... at +/98.50%
1 80423nt, >L07835.1... *
>Cluster 1
0 54962nt, >L10351.1... *
>Cluster 2
0 784nt, >V00880.1... at +/100.00%
1 521nt, >V00881.1... at +/99.81%
2 893nt, >X02215.1... at +/96.86%
3 44594nt, >M18818.1... *
4 1604nt, >J00661.1... at +/100.00%
5 2638nt, >J00664.1... at +/99.73%
6 1827nt, >J00659.1... at +/99.45%
.....
idx Seq_len,>Seq_ID... at <+-Strand>/Seq_identity
有*者为代表序列
单位可为aa/nt
.bak.clstr
13604 556nt, >X17276.1... *
14615 437nt, >X51700.1... at +/100.00%
7026 1512nt, >X68321.1... *
3759 2367nt, >X55027.1... *
2311 540nt, >Z12029.1... at +/98.33%
5859 1759nt, >X52700.1... *
5859 1758nt, >X52701.1... at +/95.56%
5859 1758nt, >X52702.1... at +/95.28%
15101 422nt, >X52706.1... *
15101 410nt, >X52703.1... at +/95.12%
15101 422nt, >X52704.1... at +/97.87%
.....
Cluster_ID Seq_len,>Seq_ID... at <+-Strand>/Seq_identity
有*者为代表序列
单位可为aa/nt
.restart
if program crash, stoped, termitated, you can restart with .restart file
1259 0 * 1
1510 0 0.0/6344:660aa/96.87% 1
1521 0 0.0/4860:1387:656aa/95.13% 1
5465 0 0.0/6344:660aa/97.75% 1
5514 0 0.0/6344:660aa/97.6% 1
1349 0 0.0/6343:660aa/96.61% 1
2668 0 0.0/6343:660aa/96.96% 1
401 0 0.0/6344:660aa/97.7% 1
2723 0 0.0/6344:660aa/97.57% 1
3707 0 0.0/6344:660aa/97.53% 1
2395 0 0.0/4059:2190:660aa/96.2% 1
2628 0 0.0/6344:660aa/97.63% 1
3225 0 0.0/6344:660aa/97.48% 1
2762 0 0.0/2942:1967:1246:660aa/93.98% 1
3534 0 0.0/6227:660aa/97.49% 1
1859 0 0.0/6039:426:39aa/95.25% 1
418 1 * 1
5154 2 * 1
3624 3 * 1
4086 4 * 1
Seq files
header是原文件中header
>KJX92028.1 hypothetical protein TI39_contig5958g00003 [Zymoseptoria brevis]
......
Tricks
Incremental clustering
- 往大文件里加新序列
- Find new seqs in Additional.fa (cd-hit/-est-2d)
- Dereplicat new seqs (cd-hit/-est)
- Add dereplicated seqs to all.fa
cd-hit-est-2d -i all.fa -i2 add.fa -o new.fa -c 0.95 -n 10 -M 16000 -T 16
cd-hit-est -i new.fa -o new_drep.fa -c 0.95 -n 10 -M 16000 -T 16
cat new_drep.fa >> all.fa
Hierarchical clustering
- 多步迭代运行, clustering via neighbor-joining method, 从而生成层次结构
- 比一步聚类更快、更准确, 可以避免两个非常相似的序列A和B被归于不同的聚类中
cd-hit -i nr -o nr80 -c 0.8 -n 5 ## 0.8
cd-hit -i nr80 -o nr60 -c 0.6 -n 4 ## 0.6
psi-cd-hit.pl -i nr60 -o nr30 -c 0.3 ## 0.3
clstr_rev.pl nr80.clstr nr60.clstr > nr80-60.clstr
clstr_rev.pl nr80-60.clstr nr30.clstr > nr80-60-30.clstr
可以0.9-0.6-0.3,总之不要一下子降很快
参考
https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide