You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you all for the work and effort on developing foldseek (and mmseqs). Really appreciate it :)
I was doing a search of a set of proteins to themselves. I made a db using ProstT5 and used easy-search with default parameters, when I noticed unusual bit scores for a protein (GC_00008862). At the time, I was using v9:
qseqid
sseqid
pident
length
mismatch
gapopen
qstart
qend
sstart
send
evalue
bitscore
GC_00008862
GC_00019083
0.347
23
14
0
27804
27826
46
68
8.129E-80
8
...
...
...
...
...
...
...
...
...
...
...
...
GC_00008862
GC_00008862
1.000
4685
0
0
1
4685
1
4685
4.201E-72
0
GC_00008862
GC_00024198
0.307
13
8
0
968
980
155
167
3.868E-71
-1
...
...
...
...
...
...
...
...
...
...
...
...
I found it suspicious and indeed that protein had a unique length: ~28,000 AA. Is it a gene call artefact, a real protein? Good questions. But I would still imagine that it would be possible to include that protein in my search. I can see that there is a max sequence length parameter, but the default value is 65,535 so it should be fine.
I updated my foldseek installation to v10 and ran a couple of test with the full length protein, and a truncated version of it. For the short version, no problem: the reciprocal easy-search gives a 100% identity over the whole length of that protein. But for the long version of the protein, the result was now empty (compared to the wrong alignment length reported with v9).
Here are all the file to reproduce the error (fasta and dbs):
...
Do you have any idea as to why a very long protein could cause this unexpected result?
Bonus mmseqs
While I was messing with this unusual protein, I found an unexpected behaviour with mmseqs2. I first noticed that mmseqs easy-search of my long protein was also outputting an incomplete alignment, just like the above output for foldseek v9. But when I used search of the db against itself, the result was correct. I could reproduce the error of easy-search by using search between two database made from the same sequence (a.k.a, I re-did what easy-search actually does).
To reproduce that error with the same dataset above:
# make two db with the same sequence
mmseqs createdb full_protein.fa db_1 --dbtype 1
mmseqs createdb full_protein.fa db_2 --dbtype 1
# search db_1 against itself, good output
mmseqs search db_1 db_1 align_1v1 tmp
mmseqs convertalis db_1 db_1 align_1v1 result_1v1
Could you try again with the latest foldseek commit? I think I fixed the root cause why this happened.
Basically, alignments of very long sequences can result in very large smith waterman scores, especially for foldseek, that combines AA and 3Di scores in the alignment. We have implemented a fast int8 and a fast int16 alignment, but had never bothered to implement a SW for larger alignment scores, as these where already very unlikely to occur.
I implemented the int32 fallback now, so it should work correctly.
Hello!
First, thank you all for the work and effort on developing foldseek (and mmseqs). Really appreciate it :)
I was doing a search of a set of proteins to themselves. I made a db using ProstT5 and used
easy-search
with default parameters, when I noticed unusual bit scores for a protein (GC_00008862). At the time, I was using v9:I found it suspicious and indeed that protein had a unique length: ~28,000 AA. Is it a gene call artefact, a real protein? Good questions. But I would still imagine that it would be possible to include that protein in my search. I can see that there is a max sequence length parameter, but the default value is 65,535 so it should be fine.
I updated my foldseek installation to v10 and ran a couple of test with the full length protein, and a truncated version of it. For the short version, no problem: the reciprocal
easy-search
gives a 100% identity over the whole length of that protein. But for the long version of the protein, the result was now empty (compared to the wrong alignment length reported with v9).Here are all the file to reproduce the error (fasta and dbs):
...
Do you have any idea as to why a very long protein could cause this unexpected result?
Bonus mmseqs
While I was messing with this unusual protein, I found an unexpected behaviour with mmseqs2. I first noticed that mmseqs
easy-search
of my long protein was also outputting an incomplete alignment, just like the above output for foldseek v9. But when I usedsearch
of the db against itself, the result was correct. I could reproduce the error ofeasy-search
by usingsearch
between two database made from the same sequence (a.k.a, I re-did what easy-search actually does).To reproduce that error with the same dataset above:
# make two db with the same sequence mmseqs createdb full_protein.fa db_1 --dbtype 1 mmseqs createdb full_protein.fa db_2 --dbtype 1
# search db_1 against itself, good output mmseqs search db_1 db_1 align_1v1 tmp mmseqs convertalis db_1 db_1 align_1v1 result_1v1
Output:
# search db_1 against db_2, wrong output mmseqs search db_1 db_2 align_1v2 tmp mmseqs convertalis db_1 db_2 align_1v2 result_1v2
Output:
Let me know if I should also open an issue for mmseqs.
Thanks a lot for your help!
The text was updated successfully, but these errors were encountered: