No result for a reciprocal search of a very long protein #422

FlorianTrigodet · 2025-02-10T09:34:48Z

Hello!

First, thank you all for the work and effort on developing foldseek (and mmseqs). Really appreciate it :)

I was doing a search of a set of proteins to themselves. I made a db using ProstT5 and used easy-search with default parameters, when I noticed unusual bit scores for a protein (GC_00008862). At the time, I was using v9:

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
GC_00008862	GC_00019083	0.347	23	14	0	27804	27826	46	68	8.129E-80	8
...	...	...	...	...	...	...	...	...	...	...	...
GC_00008862	GC_00008862	1.000	4685	0	0	1	4685	1	4685	4.201E-72	0
GC_00008862	GC_00024198	0.307	13	8	0	968	980	155	167	3.868E-71	-1
...	...	...	...	...	...	...	...	...	...	...	...

I found it suspicious and indeed that protein had a unique length: ~28,000 AA. Is it a gene call artefact, a real protein? Good questions. But I would still imagine that it would be possible to include that protein in my search. I can see that there is a max sequence length parameter, but the default value is 65,535 so it should be fine.

I updated my foldseek installation to v10 and ran a couple of test with the full length protein, and a truncated version of it. For the short version, no problem: the reciprocal easy-search gives a 100% identity over the whole length of that protein. But for the long version of the protein, the result was now empty (compared to the wrong alignment length reported with v9).

Here are all the file to reproduce the error (fasta and dbs):
...

# full length protein
foldseek createdb full_protein.fa db_full --prostt5-model weights --threads 50
foldseek easy-search db_full db_full result_full tmp --threads 50

# truncated protein
foldseek createdb truncated_protein.fa db_truncated --prostt5-model weights --threads 50
foldseek easy-search db_truncated db_truncated result_truncated tmp --threads 4

Do you have any idea as to why a very long protein could cause this unexpected result?

Bonus mmseqs

While I was messing with this unusual protein, I found an unexpected behaviour with mmseqs2. I first noticed that mmseqs easy-search of my long protein was also outputting an incomplete alignment, just like the above output for foldseek v9. But when I used search of the db against itself, the result was correct. I could reproduce the error of easy-search by using search between two database made from the same sequence (a.k.a, I re-did what easy-search actually does).

To reproduce that error with the same dataset above:

# make two db with the same sequence
mmseqs createdb full_protein.fa db_1 --dbtype 1
mmseqs createdb full_protein.fa db_2 --dbtype 1

# search db_1 against itself, good output
mmseqs search db_1 db_1 align_1v1 tmp
mmseqs convertalis db_1 db_1 align_1v1 result_1v1

Output:

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
GC_00008862	GC_00008862	1.000	28178	0	0	1	28178	1	28178	0.000E+00	1695305458

# search db_1 against db_2, wrong output
mmseqs search db_1 db_2 align_1v2 tmp
mmseqs convertalis db_1 db_2 align_1v2 result_1v2

Output:

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
GC_00008862	GC_00008862	0.869	7188	941	0	1	7188	1	7188	0.000E+00	12938

Let me know if I should also open an issue for mmseqs.

Thanks a lot for your help!

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2025-02-25T08:09:59Z

Could you try again with the latest foldseek commit? I think I fixed the root cause why this happened.

Basically, alignments of very long sequences can result in very large smith waterman scores, especially for foldseek, that combines AA and 3Di scores in the alignment. We have implemented a fast int8 and a fast int16 alignment, but had never bothered to implement a SW for larger alignment scores, as these where already very unlikely to occur.

I implemented the int32 fallback now, so it should work correctly.

milot-mirdita · 2025-02-25T08:10:39Z

I have not looked at the bonus issue with the difference of search and easy-search yet

FlorianTrigodet · 2025-02-25T12:51:58Z

Thank you so much @milot-mirdita. Works perfectly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No result for a reciprocal search of a very long protein #422

No result for a reciprocal search of a very long protein #422

FlorianTrigodet commented Feb 10, 2025

milot-mirdita commented Feb 25, 2025

milot-mirdita commented Feb 25, 2025

FlorianTrigodet commented Feb 25, 2025

No result for a reciprocal search of a very long protein #422

No result for a reciprocal search of a very long protein #422

Comments

FlorianTrigodet commented Feb 10, 2025

Bonus mmseqs

milot-mirdita commented Feb 25, 2025

milot-mirdita commented Feb 25, 2025

FlorianTrigodet commented Feb 25, 2025