Skip to content

No result for a reciprocal search of a very long protein #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FlorianTrigodet opened this issue Feb 10, 2025 · 3 comments
Open

No result for a reciprocal search of a very long protein #422

FlorianTrigodet opened this issue Feb 10, 2025 · 3 comments

Comments

@FlorianTrigodet
Copy link

Hello!

First, thank you all for the work and effort on developing foldseek (and mmseqs). Really appreciate it :)

I was doing a search of a set of proteins to themselves. I made a db using ProstT5 and used easy-search with default parameters, when I noticed unusual bit scores for a protein (GC_00008862). At the time, I was using v9:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
GC_00008862 GC_00019083 0.347 23 14 0 27804 27826 46 68 8.129E-80 8
... ... ... ... ... ... ... ... ... ... ... ...
GC_00008862 GC_00008862 1.000 4685 0 0 1 4685 1 4685 4.201E-72 0
GC_00008862 GC_00024198 0.307 13 8 0 968 980 155 167 3.868E-71 -1
... ... ... ... ... ... ... ... ... ... ... ...

I found it suspicious and indeed that protein had a unique length: ~28,000 AA. Is it a gene call artefact, a real protein? Good questions. But I would still imagine that it would be possible to include that protein in my search. I can see that there is a max sequence length parameter, but the default value is 65,535 so it should be fine.

I updated my foldseek installation to v10 and ran a couple of test with the full length protein, and a truncated version of it. For the short version, no problem: the reciprocal easy-search gives a 100% identity over the whole length of that protein. But for the long version of the protein, the result was now empty (compared to the wrong alignment length reported with v9).

Here are all the file to reproduce the error (fasta and dbs):
...

# full length protein
foldseek createdb full_protein.fa db_full --prostt5-model weights --threads 50
foldseek easy-search db_full db_full result_full tmp --threads 50

# truncated protein
foldseek createdb truncated_protein.fa db_truncated --prostt5-model weights --threads 50
foldseek easy-search db_truncated db_truncated result_truncated tmp --threads 4

Do you have any idea as to why a very long protein could cause this unexpected result?

Bonus mmseqs

While I was messing with this unusual protein, I found an unexpected behaviour with mmseqs2. I first noticed that mmseqs easy-search of my long protein was also outputting an incomplete alignment, just like the above output for foldseek v9. But when I used search of the db against itself, the result was correct. I could reproduce the error of easy-search by using search between two database made from the same sequence (a.k.a, I re-did what easy-search actually does).

To reproduce that error with the same dataset above:

# make two db with the same sequence
mmseqs createdb full_protein.fa db_1 --dbtype 1
mmseqs createdb full_protein.fa db_2 --dbtype 1
# search db_1 against itself, good output
mmseqs search db_1 db_1 align_1v1 tmp
mmseqs convertalis db_1 db_1 align_1v1 result_1v1

Output:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
GC_00008862 GC_00008862 1.000 28178 0 0 1 28178 1 28178 0.000E+00 1695305458
# search db_1 against db_2, wrong output
mmseqs search db_1 db_2 align_1v2 tmp
mmseqs convertalis db_1 db_2 align_1v2 result_1v2

Output:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
GC_00008862 GC_00008862 0.869 7188 941 0 1 7188 1 7188 0.000E+00 12938

Let me know if I should also open an issue for mmseqs.

Thanks a lot for your help!

@milot-mirdita
Copy link
Member

Could you try again with the latest foldseek commit? I think I fixed the root cause why this happened.

Basically, alignments of very long sequences can result in very large smith waterman scores, especially for foldseek, that combines AA and 3Di scores in the alignment. We have implemented a fast int8 and a fast int16 alignment, but had never bothered to implement a SW for larger alignment scores, as these where already very unlikely to occur.

I implemented the int32 fallback now, so it should work correctly.

@milot-mirdita
Copy link
Member

I have not looked at the bonus issue with the difference of search and easy-search yet

@FlorianTrigodet
Copy link
Author

Thank you so much @milot-mirdita. Works perfectly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants