BLAST

注:由于不确定翻译起始,nt-->aa 有6种可能
faa: protein
fna: nucleotide
另:各种参数的选择建议参考同类论文
Install
conda install -c bioconda blast
conda install -c bioconda diamond
apt,conda,release 都可以安装;只是我的wsl中diamond只能成功安装v0.9
BLAST+ package
- search tools: blastn, blastp, blastx, tblastx, tblastn, psiblast, rpsblast, rpstblastn
- database tools: makeblastdb, blastdb_aliastool, makeprofiledb, blastdbcmd
makeblastdb
makeblastdb -in DB.fna -input_type fasta -dbtype nucl -out DB_nt -parse_seqids
makeblastdb -in DB.faa -input_type fasta -dbtype prot -out DB_aa -parse_seqids
## -dbtype nucl/prot/guess
## without -parse_seqids:
### DB_nt.nhr DB_nt.nin DB_nt.nsq
## with -parse_seqids:
### DB_nt.nhr DB_nt.nin DB_nt.nsq DB_nt.nog DB_nt.nsd DB_nt.nsi
blastdbcmd
## date & info
blastdbcmd -db DB_nt -info | less
## back to FASTA
blastdbcmd -db DB_nt -entry all | less
blastn
blastn -query in.fna -db DB_nt -out DB_nt.m8 -outfmt 6 -evalue 1e-5 -max_target_seqs 5 -num_threads 20
-outfmt可以控制输出列,详情见下文
-outfmt '6 qseqid sseqid qstart qend sstart send evalue'
对于短序列(e.g. 20bp),建议
-task blastn-short -word_size 4 -evalue 1
## -task: 'blastn' 'blastn-short' 'dc-megablast'
psiblast
- 迭代PSSM矩阵、不断发现新的match
- 一次为一条蛋白序列计算PSSM
- 对于SWISS数据库需要query序列>15bp,否则需要试试更大的数据库(e.g. NR)
- 生成 L*20维 矩阵,但长度不固定(L为input长度);后续作为机器学习输入前需要进行信息提取
psiblast -query single_in.faa -db DB_aa -out pssm.m8 -outfmt 6 -evalue 1e-3 -num_iterations 3 -out_ascii_pssm pssm.txt
Diamond
Diamond的blastp似乎比BLAST+更快,于是和蛋白数据库相关的一般用它;注意:无root权限时运行比对命令需要指定临时文件夹(需创建)
--tmpdir /tmp/mytmpdir/
makedb
Diamond只支持blastp、blastx;它俩使用protein DB
diamond makedb --in DB.faa --db DB_aa.dmnd
## db info
diamond dbinfo -d DB_aa.dmnd |less
## back to FASTA
diamond getseq -d DB_aa.dmnd |less
blastp
mkdir temp_dir
diamond blastp -d DB_aa.dmnd -q in.faa -o out_p.m8 -f 6 -e 1e-5 -k 5 -b 2 -t ./temp_dir
-f 6 (outformat 6: BLAST table format)
-k 10 (max-target-seqs,设置每个query比对结果的最大匹配数目)
-b sequence block size in billions of letters (default=2.0)
blastx
diamond blastx -d DB_aa.dmnd -q in.fna -o out_x.m8 -f 6 -e 1e-5 -k 5 -b 2 -t ./temp_dir \
--threads 20 --quiet --id 10 --subject-cover 50 --query-cover 50
Output Format
PSSM
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
3 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
4 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
5 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
6 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 4 -1 -3 -1 0 -1 -3 -2 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
8 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
9 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00
......
m8
常用的-outfmt 6输出为:
X17276.1 X17276 100.000 556 0 0 1 556 1 556 0.0 1027
X51700.1 X53699 100.000 437 0 0 1 437 1 437 0.0 808
X51700.1 X51700 100.000 437 0 0 1 437 1 437 0.0 808
X68321.1 X68321 100.000 1512 0 0 1 1512 1 1512 0.0 2793
.....
1 2 3 4 5 6 7 8 9 10 11 12
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
Details
blastn -help
-outfmt <String>
alignment view options:
0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
17 = Sequence Alignment/Map (SAM),
18 = Organism Report
Options 6, 7, 10 and 17 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers for options 6, 7 and 10 are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
When not provided, the default value is:
'qacc sacc pident length mismatch gapopen qstart qend sstart send evalue
bitscore', which is equivalent to the keyword 'std'
The supported format specifier for option 17 is:
SQ means Include Sequence Data
Default = `0'
diamond help
--outfmt (-f) output format
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM
Value 6 may be followed by a space-separated list of these keywords:
qseqid means Query Seq - id
qlen means Query sequence length
sseqid means Subject Seq - id
sallseqid means All subject Seq - id(s), separated by a ';'
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive - scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive - scoring matches
qframe means Query frame
btop means Blast traceback operations(BTOP)
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
qcovhsp means Query Coverage Per HSP
qtitle means Query title
Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
参考
BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
Diamond: https://github.com/bbuchfink/diamond
BLAST比对结果: https://www.jianshu.com/p/9aa1a131473e
比对软件默认eval: https://www.jianshu.com/p/ea8218a3ff18
PSSM: https://blog.csdn.net/cbb_ft/article/details/124623766